areusch commented on pull request #9129: URL: https://github.com/apache/tvm/pull/9129#issuecomment-932996398
so originally the request was essentially around the integration tests, which we run in smaller sets (e.g. relay, topi, etc). when a test in the early set fails, results from the later ones aren't reported. this change isn't quite the same--but, it's the same argument as to why you may not want fail-fast; for example, if a test fails in the `ci_arm` container, you may not know whether it's also failing in ci_gpu or vice versa. i agree CI is not a personal testing environment, but it is sometimes the easiest way for developers to access cloud platforms they don't have e.g. arm, gpu. @Mousius the comment you referenced is a bit more general and i'm not sure this specific issue contributes to CI taking a while to complete. you can monitor CI if you're anxious for the test results. one effort in progress is the `xdist` which should have a bit bigger impact without potentially making it harder to access a test platform you don't have locally. i'm not opposed to changing CI to improve developer productivity, but could you motivate this specific change a bit more? in practice this seems most likely to result in cancellation of GPU integration tests, but the [number of available GPU executors](https://ci.tlcpack.ai/label/GPU/load-statistics?type=hour) has not been 0 in the past month. perhaps we should track that stat for a bit now that #9128 is in. i am wondering if maybe it already somewhat addressed this concern. @jroesch your comment is a bit generic. i still would like to see more rationale as to why cancelling the GPU unit tests when an ARM one fails. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
