eric-haibin-lin opened a new issue #18014: enabling mkldnn leads to segfault in 
bytePS
URL: https://github.com/apache/incubator-mxnet/issues/18014
 
 
   I am using `bytepsimage/mxnet` docker image to test bytePS. However, I find 
recent MXNet versions lead to segfault. 
   
   Specifically, if i build mxnet-cu100 variant with commit 
b6b1de092b2bbc6ab7207a98dcb1c08fe67ca14b, the following command works:
   ```
   docker pull bytepsimage/mxnet
   
   nvidia-docker run -it --net=host --shm-size=32768m bytepsimage/mxnet bash
   
   # now you are in docker environment
   export NVIDIA_VISIBLE_DEVICES=0,1,2,3  # gpus list
   export DMLC_WORKER_ID=0 # your worker id
   export DMLC_NUM_WORKER=1 # one worker
   export DMLC_ROLE=worker 
   
   # the following value does not matter for non-distributed jobs 
   export DMLC_NUM_SERVER=1 
   export DMLC_PS_ROOT_URI=10.0.0.1 
   export DMLC_PS_ROOT_PORT=1234 
   
   bpslaunch python3 /usr/local/byteps/example/mxnet/train_gluon_mnist_byteps.py
   
   ```
   
   However, if i use commit 2f6cdd383abbf46a37b84a5fad013726b5c62169, it gives 
me segfault. 
   I used `source tools/staticbuild/build.sh mxnet-cu100 pip` to build the pip 
package. 
   
   And if I use the latest nightly build, it also gives me segfault 
   
   @TaoLv any idea why?
   
   related issue https://github.com/bytedance/byteps/issues/222 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to