RuRo opened a new issue #18090: Aborted unix-gpu CI URL: https://github.com/apache/incubator-mxnet/issues/18090 ## Description There currently exists some weird behaviour with `unix-gpu` CI jobs, where the build sometimes gets aborted and other times completes fine. I've seen this multiple times on different PRs. Until today, I thought, that this is caused by limited available GPU executors and the jobs are getting manually aborted or aborted by some automatic priority setup in Jenkins (maybe priority goes to CI/CD for master or something). However, I've noticed a few weird consistent things about these aborted jobs, so I wanted to make sure, that the current behaviour is intentional. 1) Of the cases I've seen, `unix-gpu` getting aborted, it's almost always in a situation, where all the build steps and all the other tests were completed, but there is just a single `Python 3: GPU` or `Python 3: GPU (TVM_OP OFF)` test step that was aborted. <details>   </details> 2) Normally, these steps seem to take around 1 hour to complete. But in the cases, where they were aborted, it was after **3 hours**. Additionally, there is a weird jump in the logs, between the time of the last log message from the test and the first message from shutting down due to the interrupt signal.<details><pre><code>[2020-04-16T14:57:56.542Z] test_operator_gpu.test_np_diag ... ok (2.9642s) [2020-04-16T14:57:56.797Z] test_operator_gpu.test_np_diag_indices_from ... ok (0.2669s) [2020-04-16T14:58:00.957Z] test_operator_gpu.test_np_diagflat ... ok (3.5951s) [2020-04-16T14:58:01.882Z] test_operator_gpu.test_np_diagonal ... ok (1.4132s) [2020-04-16T14:58:04.397Z] test_operator_gpu.test_np_diff ... ok (2.0127s) [2020-04-16T14:58:05.758Z] test_operator_gpu.test_np_dot ... ok (1.8446s) [2020-04-16T14:58:05.758Z] test_operator_gpu.test_np_dsplit ... ok (0.0832s) [2020-04-16T14:58:06.013Z] test_operator_gpu.test_np_dstack ... ok (0.0664s) [2020-04-16T14:58:15.936Z] test_operator_gpu.test_np_ediff1d ... ok (8.6182s) [2020-04-16T14:58:17.295Z] test_operator_gpu.test_np_einsum ... ok (2.4994s) [2020-04-16T14:58:17.295Z] test_operator_gpu.test_np_empty ... ok (0.0161s) [2020-04-16T17:33:48.723Z] Sending interrupt signal to process [2020-04-16T17:33:57.546Z] 2020-04-16 17:33:48,990 - root - WARNING - Signal 15 received, cleaning up... [2020-04-16T17:33:57.546Z] 2020-04-16 17:33:48,991 - root - WARNING - Cleaning up containers [2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,155 - root - INFO - ☠: stopped container 89c8acd3217d [2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,241 - root - INFO - 🚽: removed container 89c8acd3217d [2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,241 - root - INFO - Cleaning up containers finished. [2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,241 - root - WARNING - done. Exiting with error. [2020-04-16T17:33:57.549Z] script returned exit code 1 </code></pre></details> These are consecutive log messages, but you can see a huge time skip between `test_operator_gpu.test_np_empty ... ok` at `14:58` and `Sending interrupt signal to process` at `17:33`. ## Occurrences Here are the 2 examples from the screenshots: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18055/4/pipeline http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18054/6/pipeline and a random example, not from my PRs: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18081/1/pipeline Can somebody clarify, if 1) `unix-gpu` CI jobs getting aborted is intentional 2) if it is, is there something we can do to at the very least abort the tests faster or maybe not even fail these jobs, but automatically reschedule only the aborted step (not the whole pipeline)
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
