leezu edited a comment on issue #18772: URL: https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755
Horovod includes the MXNet C++ headers and based on them interacts with the Engine: https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24 But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels? I tried the following steps to compile in the container: ``` docker run --privileged --cap-add=NET_ADMIN --gpus=all -it mxnetci/build.centos7_gpu_cu102 /usr/sbin/init ``` ``` docker container list docker container exec -it aa5253f2282f bash source /opt/rh/devtoolset-7/enable source /opt/rh/rh-python36/enable pip install pyyaml cffi yum install openmpi-devel git clone --recursive -b mx2 https://github.com/eric-haibin-lin/horovod.git cd horovod pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org