RuRo opened a new issue #18090: Aborted unix-gpu CI
URL: https://github.com/apache/incubator-mxnet/issues/18090
 
 
   ## Description
   There currently exists some weird behaviour with `unix-gpu` CI jobs, where 
the build sometimes gets aborted and other times completes fine. I've seen this 
multiple times on different PRs.
   
   Until today, I thought, that this is caused by limited available GPU 
executors and the jobs are getting manually aborted or aborted by some 
automatic priority setup in Jenkins (maybe priority goes to CI/CD for master or 
something).
   
   However, I've noticed a few weird consistent things about these aborted 
jobs, so I wanted to make sure, that the current behaviour is intentional. 
   
   1) Of the cases I've seen, `unix-gpu` getting aborted, it's almost always in 
a situation, where all the build steps and all the other tests were completed, 
but there is just a single `Python 3: GPU` or `Python 3: GPU (TVM_OP OFF)` test 
step that was aborted. <details>
   
![image](https://user-images.githubusercontent.com/3747318/79561369-eeab6c80-80b1-11ea-934c-b849e3d38259.png)
 
![image](https://user-images.githubusercontent.com/3747318/79561571-3b8f4300-80b2-11ea-87b8-83a5c9c77da0.png)
 </details>
   2) Normally, these steps seem to take around 1 hour to complete. But in the 
cases, where they were aborted, it was after **3 hours**. Additionally, there 
is a weird jump in the logs, between the time of the last log message from the 
test and the first message from shutting down due to the interrupt 
signal.<details><pre><code>[2020-04-16T14:57:56.542Z] 
test_operator_gpu.test_np_diag ... ok (2.9642s)
   [2020-04-16T14:57:56.797Z] test_operator_gpu.test_np_diag_indices_from ... 
ok (0.2669s)
   [2020-04-16T14:58:00.957Z] test_operator_gpu.test_np_diagflat ... ok 
(3.5951s)
   [2020-04-16T14:58:01.882Z] test_operator_gpu.test_np_diagonal ... ok 
(1.4132s)
   [2020-04-16T14:58:04.397Z] test_operator_gpu.test_np_diff ... ok (2.0127s)
   [2020-04-16T14:58:05.758Z] test_operator_gpu.test_np_dot ... ok (1.8446s)
   [2020-04-16T14:58:05.758Z] test_operator_gpu.test_np_dsplit ... ok (0.0832s)
   [2020-04-16T14:58:06.013Z] test_operator_gpu.test_np_dstack ... ok (0.0664s)
   [2020-04-16T14:58:15.936Z] test_operator_gpu.test_np_ediff1d ... ok (8.6182s)
   [2020-04-16T14:58:17.295Z] test_operator_gpu.test_np_einsum ... ok (2.4994s)
   [2020-04-16T14:58:17.295Z] test_operator_gpu.test_np_empty ... ok (0.0161s)
   [2020-04-16T17:33:48.723Z] Sending interrupt signal to process
   [2020-04-16T17:33:57.546Z] 2020-04-16 17:33:48,990 - root - WARNING - Signal 
15 received, cleaning up...
   [2020-04-16T17:33:57.546Z] 2020-04-16 17:33:48,991 - root - WARNING - 
Cleaning up containers
   [2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,155 - root - INFO - ☠: 
stopped container 89c8acd3217d
   [2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,241 - root - INFO - 🚽: 
removed container 89c8acd3217d
   [2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,241 - root - INFO - Cleaning 
up containers finished.
   [2020-04-16T17:33:57.546Z] 2020-04-16 17:33:53,241 - root - WARNING - done. 
Exiting with error.
   [2020-04-16T17:33:57.549Z] script returned exit code 1
   </code></pre></details> These are consecutive log messages, but you can see 
a huge time skip between `test_operator_gpu.test_np_empty ... ok` at `14:58` 
and `Sending interrupt signal to process` at `17:33`.
   
   ## Occurrences
   Here are the 2 examples from the screenshots:
   
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18055/4/pipeline
   
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18054/6/pipeline
   and a random example, not from my PRs:
   
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18081/1/pipeline
   
   Can somebody clarify, if
   1) `unix-gpu` CI jobs getting aborted is intentional
   2) if it is, is there something we can do to at the very least abort the 
tests faster or maybe not even fail these jobs, but automatically reschedule 
only the aborted step (not the whole pipeline)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to