waytrue17 opened a new issue #19556:
URL: https://github.com/apache/incubator-mxnet/issues/19556


   ## Description
   Running mxnet-horovod example 
`incubator-mxnet/example/distributed_training-horovod/gluon_mnist.py` on 
mxnet1.8-cuda11.0 with python 3.7 encountered a segfault error. The error 
occurred after the example script finished. 
   The same script works fine on mxnet1.8-cuda10.2 with python 3.7 and 
mxnet1.8-cuda11.0 with python 3.6.
   
   ## To Reproduce
   ### Steps to reproduce
   1. Launch an EC2 p3.8x gpu instance with dlami: ami-02440419a5afe47ab
   2. Build mx1.8-cu110 from source
   3. Install Horovod `python3 -m pip install horovod`
   4. Run `LD_LIBRARY_PATH=/usr/local/cuda-11.0/lib64:$LD_LIBRARY_PATH python3 \
   incubator-mxnet/example/distributed_training-horovod/gluon_mnist.py` to 
reproduce the error
   
   ## What have you tried to solve it?
   
   1. Backport #19378 to v1.8.x solved the issue
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to