eric-haibin-lin opened a new issue #18014: enabling mkldnn leads to segfault in bytePS URL: https://github.com/apache/incubator-mxnet/issues/18014 I am using `bytepsimage/mxnet` docker image to test bytePS. However, I find recent MXNet versions lead to segfault. Specifically, if i build mxnet-cu100 variant with commit b6b1de092b2bbc6ab7207a98dcb1c08fe67ca14b, the following command works: ``` docker pull bytepsimage/mxnet nvidia-docker run -it --net=host --shm-size=32768m bytepsimage/mxnet bash # now you are in docker environment export NVIDIA_VISIBLE_DEVICES=0,1,2,3 # gpus list export DMLC_WORKER_ID=0 # your worker id export DMLC_NUM_WORKER=1 # one worker export DMLC_ROLE=worker # the following value does not matter for non-distributed jobs export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.0.0.1 export DMLC_PS_ROOT_PORT=1234 bpslaunch python3 /usr/local/byteps/example/mxnet/train_gluon_mnist_byteps.py ``` However, if i use commit 2f6cdd383abbf46a37b84a5fad013726b5c62169, it gives me segfault. I used `source tools/staticbuild/build.sh mxnet-cu100 pip` to build the pip package. And if I use the latest nightly build, it also gives me segfault @TaoLv any idea why? related issue https://github.com/bytedance/byteps/issues/222
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
