areusch commented on pull request #9129: URL: https://github.com/apache/tvm/pull/9129#issuecomment-933956298
@Mousius > I do empathise with this, but I don't think we should design a CI solution around the edge cases, by reducing the overall running jobs we can get to these faster when they do arise. I kind of agree, but not sure 100% here. For example, suppose there are iterative test failures on ci-arm as well as test failures on ci-gpu, neither of which you have available locally. If you push to CI, you'll wind up using resources to rebuild on all platforms. > There's two things this change fixes: > 1. Machine availability - we keep overall machines free-er to start a job than they previously were as we fail out of them faster > 2. Machine saturation - running multiple tasks on a single machine is going to result in n slow jobs, the fewer jobs you run the more compute you have free. > I don't rely on CI for test results, but I can definitely feel the reluctance of waiting for CI to complete once you have a green tick given your change is then delayed to likely the next day each time. I guess I am open to trying this, but I feel a bit like we should publicize this in the forum in case anyone else is attached to the current setup. My example came from an internal OctoML ask of me. I think I feel like this because I'm not sure we have hard metrics to consult. > We should be very careful about considering the number of executors available as a metric as to how efficient CI is. When a Jenkins agent is under load from one set of branch builds it will have a negative effect on any other thing also running - so whilst we may never run out of executors on paper, this change would result in them being less loaded and thus more efficient at running CI jobs. Ah! I agree this is the case right now, but I am sort of scheming to change this with the `xdist` work. Right now it is indeed possible to run > 1 PR on a node at once. With xdist, it'll be possible to use the entire node's resources on a single PR. I haven't worked out the details yet, but would propose we do this for test-cases and then it will be possible to treat queue-depth as a measure of CI load :). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
