leezu edited a comment on issue #18772:
URL: 
https://github.com/apache/incubator-mxnet/issues/18772#issuecomment-662781755


   Horovod includes the MXNet C++ headers and based on them interacts with the 
Engine: 
   
   
https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24
   
   But C++ does not have a stable ABI and your Horovod may not be compiled with 
the same ABI as the MXNet binary wheel. Could this be the source of the crash? 
Have you tried reproducing this when building Horovod in the same container as 
used for building the binary wheels?
   
   I tried the following steps to compile in the container:
   
   ```
   docker run --gpus=all -it mxnetci/build.centos7_gpu_cu102 /bin/bash
   source /opt/rh/devtoolset-7/enable
   source /opt/rh/rh-python36/enable
   pip install pyyaml cffi
   yum install openmpi-devel
   git clone --recursive https://github.com/eric-haibin-lin/horovod.git
   cd horovod
   git checkout mx2
   git submodule update --init --recursive
   PATH=/usr/lib64/openmpi/bin/:$PATH HOROVOD_WITH_MXNET=1 
HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install 
--user
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to