RuRo edited a comment on issue #18090: Aborted unix-gpu CI
URL: 
https://github.com/apache/incubator-mxnet/issues/18090#issuecomment-616019010
 
 
   Okay! I think I was able to track down the source of this problem. I 
noticed, that the last 3 times the pipeline froze for me, the last output was 
`test_operator_gpu.test_np_true_divide`. The next test after that, that is 
supposed to run is `test_operator_gpu.test_np_unary_bool_funcs`.
   
   After ducking around with `test_operator_gpu.test_np_unary_bool_funcs` a 
bit, I was able to reproduce the hanging behaviour locally. I just ran the test 
in a loop, printing something in between runs and after a while, the process 
got stuck and stopped printing.
   
   It didn't happen every time and setting PRNG seeds didn't make it fail 
deterministically, so the issue is probably thread-related. Also, it definitely 
doesn't happen with `MXNET_ENGINE_TYPE=NaiveEngine`.
   
   After making a reproducible setup, I spent some time bisecting the 
`test_operator_gpu.test_np_unary_bool_funcs` test to narrow down the offending 
operator. In the end, I was able to narrow it down to the `np.empty_like` 
operator:
   
   ```python
   import mxnet as mx
   import sys
   
   
   with mx.Context(mx.gpu()):
       while True:
           for _ in range(100):
               mx_data = mx.np.array(0)
               mx_x = mx.np.empty_like(mx_data)
   
           sys.stdout.buffer.write(b'.')
           sys.stdout.buffer.flush()
   ```
   
   The above piece of code executes the `empty_like` operator and prints a dot 
once every 100 iterations and after ~20 seconds it gets stuck. I took a quick 
look at how `empty_like` is implemented but wasn't able to figure out, what 
exactly is wrong with it. It looks like a really thin wrapper around 
`CustomOp`, which is used from C++, so I am giving up for now.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to