areusch commented on pull request #9129:
URL: https://github.com/apache/tvm/pull/9129#issuecomment-933956298


   @Mousius 
   > I do empathise with this, but I don't think we should design a CI solution 
around the edge cases, by reducing the overall running jobs we can get to these 
faster when they do arise.
   
   I kind of agree, but not sure 100% here. For example, suppose there are 
iterative test failures on ci-arm as well as test failures on ci-gpu, neither 
of which you have available locally. If you push to CI, you'll wind up using 
resources to rebuild on all platforms.
   
   > There's two things this change fixes:
   
   > 1. Machine availability - we keep overall machines free-er to start a job 
than they previously were as we fail out of them faster
   > 2. Machine saturation - running multiple tasks on a single machine is 
going to result in n slow jobs, the fewer jobs you run the more compute you 
have free.
   > I don't rely on CI for test results, but I can definitely feel the 
reluctance of waiting for CI to complete once you have a green tick given your 
change is then delayed to likely the next day each time.
   
   I guess I am open to trying this, but I feel a bit like we should publicize 
this in the forum in case anyone else is attached to the current setup. My 
example came from an internal OctoML ask of me. I think I feel like this 
because I'm not sure we have hard metrics to consult.
   
   > We should be very careful about considering the number of executors 
available as a metric as to how efficient CI is. When a Jenkins agent is under 
load from one set of branch builds it will have a negative effect on any other 
thing also running - so whilst we may never run out of executors on paper, this 
change would result in them being less loaded and thus more efficient at 
running CI jobs.
   
   Ah! I agree this is the case right now, but I am sort of scheming to change 
this with the `xdist` work. Right now it is indeed possible to run > 1 PR on a 
node at once. With xdist, it'll be possible to use the entire node's resources 
on a single PR. I haven't worked out the details yet, but would propose we do 
this for test-cases and then it will be possible to treat queue-depth as a 
measure of CI load :).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to