larroy opened a new issue #12994: Test failure and possible bug on GPU topology algorithm (test_device.test_device_pushpull) URL: https://github.com/apache/incubator-mxnet/issues/12994 ## Description Failure in test_device.test_device_pushpull is reported by NVidia in DGX1V. I suspect there is a bug on the binary tree creation. I'm looking into this issue. ``` ERROR: test_device.test_device_pushpull ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest self.test(*self.arg) File "/opt/mxnet/tests/python/gpu/test_device.py", line 74, in test_device_pushpull check_dense_pushpull('device') File "/opt/mxnet/tests/python/gpu/test_device.py", line 61, in check_dense_pushpull kv_device.push(cur_key, arr_list) File "/opt/mxnet/python/mxnet/kvstore.py", line 234, in push self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority))) File "/opt/mxnet/python/mxnet/base.py", line 252, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [17:44:02] src/kvstore/./././gpu_topology.h:1040: No valid binary tree found from root 2 using backtracking ``` ## Environment info (Required) ``` What to do: 1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py 2. Run the script using `python diagnose.py` and paste its output here. ``` Package used (Python/R/Scala/Julia): (I'm using ...) For Scala user, please provide: 1. Java version: (`java -version`) 2. Maven version: (`mvn -version`) 3. Scala runtime if applicable: (`scala -version`) For R user, please provide R `sessionInfo()`: ## Build info (Required if built from source) Compiler (gcc/clang/mingw/visual studio): MXNet commit hash: (Paste the output of `git rev-parse HEAD` here.) Build config: (Paste the content of config.mk, or the build command.) ## Error Message: (Paste the complete error message, including stack trace.) ``` [17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv [17:47:41] src/kvstore/././comm.h:752: v.vv. [17:47:41] src/kvstore/././comm.h:752: vv.v. [17:47:41] src/kvstore/././comm.h:752: vvv.. [17:47:41] src/kvstore/././comm.h:752: v.... [17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv. [17:47:41] src/kvstore/././comm.h:752: v.vv.v [17:47:41] src/kvstore/././comm.h:752: vv.v.. [17:47:41] src/kvstore/././comm.h:752: vvv... [17:47:41] src/kvstore/././comm.h:752: v....v [17:47:41] src/kvstore/././comm.h:752: .v..v. [17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv.. [17:47:41] src/kvstore/././comm.h:752: v.vv.v. [17:47:41] src/kvstore/././comm.h:752: vv.v..v [17:47:41] src/kvstore/././comm.h:752: vvv.... [17:47:41] src/kvstore/././comm.h:752: v....vv [17:47:41] src/kvstore/././comm.h:752: .v..v.v [17:47:41] src/kvstore/././comm.h:752: ..v.vv. [17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv... [17:47:41] src/kvstore/././comm.h:752: v.vv.v.. [17:47:41] src/kvstore/././comm.h:752: vv.v..v. [17:47:41] src/kvstore/././comm.h:752: vvv....v [17:47:41] src/kvstore/././comm.h:752: v....vvv [17:47:41] src/kvstore/././comm.h:752: .v..v.vv [17:47:41] src/kvstore/././comm.h:752: ..v.vv.v [17:47:41] src/kvstore/././comm.h:752: ...vvvv. [17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv [17:47:41] src/kvstore/././comm.h:752: v.vv. [17:47:41] src/kvstore/././comm.h:752: vv.v. [17:47:41] src/kvstore/././comm.h:752: vvv.. [17:47:41] src/kvstore/././comm.h:752: v.... [17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv. [17:47:41] src/kvstore/././comm.h:752: v.vv.v [17:47:41] src/kvstore/././comm.h:752: vv.v.. [17:47:41] src/kvstore/././comm.h:752: vvv... [17:47:41] src/kvstore/././comm.h:752: v....v [17:47:41] src/kvstore/././comm.h:752: .v..v. [17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv.. [17:47:41] src/kvstore/././comm.h:752: v.vv.v. [17:47:41] src/kvstore/././comm.h:752: vv.v..v [17:47:41] src/kvstore/././comm.h:752: vvv.... [17:47:41] src/kvstore/././comm.h:752: v....vv [17:47:41] src/kvstore/././comm.h:752: .v..v.v [17:47:41] src/kvstore/././comm.h:752: ..v.vv. [17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv... [17:47:41] src/kvstore/././comm.h:752: v.vv.v.. [17:47:41] src/kvstore/././comm.h:752: vv.v..v. [17:47:41] src/kvstore/././comm.h:752: vvv....v [17:47:41] src/kvstore/././comm.h:752: v....vvv [17:47:41] src/kvstore/././comm.h:752: .v..v.vv [17:47:41] src/kvstore/././comm.h:752: ..v.vv.v [17:47:41] src/kvstore/././comm.h:752: ...vvvv. [17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv [17:47:41] src/kvstore/././comm.h:752: v.vv. [17:47:41] src/kvstore/././comm.h:752: vv.v. [17:47:41] src/kvstore/././comm.h:752: vvv.. [17:47:41] src/kvstore/././comm.h:752: v.... [17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv. [17:47:41] src/kvstore/././comm.h:752: v.vv.v [17:47:41] src/kvstore/././comm.h:752: vv.v.. [17:47:41] src/kvstore/././comm.h:752: vvv... [17:47:41] src/kvstore/././comm.h:752: v....v [17:47:41] src/kvstore/././comm.h:752: .v..v. [17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv.. [17:47:41] src/kvstore/././comm.h:752: v.vv.v. [17:47:41] src/kvstore/././comm.h:752: vv.v..v [17:47:41] src/kvstore/././comm.h:752: vvv.... [17:47:41] src/kvstore/././comm.h:752: v....vv [17:47:41] src/kvstore/././comm.h:752: .v..v.v [17:47:41] src/kvstore/././comm.h:752: ..v.vv. [17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv... [17:47:41] src/kvstore/././comm.h:752: v.vv.v.. [17:47:41] src/kvstore/././comm.h:752: vv.v..v. [17:47:41] src/kvstore/././comm.h:752: vvv....v [17:47:41] src/kvstore/././comm.h:752: v....vvv [17:47:41] src/kvstore/././comm.h:752: .v..v.vv [17:47:41] src/kvstore/././comm.h:752: ..v.vv.v [17:47:41] src/kvstore/././comm.h:752: ...vvvv. [17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv [17:47:41] src/kvstore/././comm.h:752: v.vv. [17:47:41] src/kvstore/././comm.h:752: vv.v. [17:47:41] src/kvstore/././comm.h:752: vvv.. [17:47:41] src/kvstore/././comm.h:752: v.... [17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv. [17:47:41] src/kvstore/././comm.h:752: v.vv.v [17:47:41] src/kvstore/././comm.h:752: vv.v.. [17:47:41] src/kvstore/././comm.h:752: vvv... [17:47:41] src/kvstore/././comm.h:752: v....v [17:47:41] src/kvstore/././comm.h:752: .v..v. [17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv.. [17:47:41] src/kvstore/././comm.h:752: v.vv.v. [17:47:41] src/kvstore/././comm.h:752: vv.v..v [17:47:41] src/kvstore/././comm.h:752: vvv.... [17:47:41] src/kvstore/././comm.h:752: v....vv [17:47:41] src/kvstore/././comm.h:752: .v..v.v [17:47:41] src/kvstore/././comm.h:752: ..v.vv. [17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv... [17:47:41] src/kvstore/././comm.h:752: v.vv.v.. [17:47:41] src/kvstore/././comm.h:752: vv.v..v. [17:47:41] src/kvstore/././comm.h:752: vvv....v [17:47:41] src/kvstore/././comm.h:752: v....vvv [17:47:41] src/kvstore/././comm.h:752: .v..v.vv [17:47:41] src/kvstore/././comm.h:752: ..v.vv.v [17:47:41] src/kvstore/././comm.h:752: ...vvvv. [17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv [17:47:41] src/kvstore/././comm.h:752: v.vv. [17:47:41] src/kvstore/././comm.h:752: vv.v. [17:47:41] src/kvstore/././comm.h:752: vvv.. [17:47:41] src/kvstore/././comm.h:752: v.... [17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv. [17:47:41] src/kvstore/././comm.h:752: v.vv.v [17:47:41] src/kvstore/././comm.h:752: vv.v.. [17:47:41] src/kvstore/././comm.h:752: vvv... [17:47:41] src/kvstore/././comm.h:752: v....v [17:47:41] src/kvstore/././comm.h:752: .v..v. [17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv.. [17:47:41] src/kvstore/././comm.h:752: v.vv.v. [17:47:41] src/kvstore/././comm.h:752: vv.v..v [17:47:41] src/kvstore/././comm.h:752: vvv.... [17:47:41] src/kvstore/././comm.h:752: v....vv [17:47:41] src/kvstore/././comm.h:752: .v..v.v [17:47:41] src/kvstore/././comm.h:752: ..v.vv. [17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv... [17:47:41] src/kvstore/././comm.h:752: v.vv.v.. [17:47:41] src/kvstore/././comm.h:752: vv.v..v. [17:47:41] src/kvstore/././comm.h:752: vvv....v [17:47:41] src/kvstore/././comm.h:752: v....vvv [17:47:41] src/kvstore/././comm.h:752: .v..v.vv [17:47:41] src/kvstore/././comm.h:752: ..v.vv.v [17:47:41] src/kvstore/././comm.h:752: ...vvvv. [17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv [17:47:41] src/kvstore/././comm.h:752: v.vv. [17:47:41] src/kvstore/././comm.h:752: vv.v. [17:47:41] src/kvstore/././comm.h:752: vvv.. [17:47:41] src/kvstore/././comm.h:752: v.... [17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv. [17:47:41] src/kvstore/././comm.h:752: v.vv.v [17:47:41] src/kvstore/././comm.h:752: vv.v.. [17:47:41] src/kvstore/././comm.h:752: vvv... [17:47:41] src/kvstore/././comm.h:752: v....v [17:47:41] src/kvstore/././comm.h:752: .v..v. [17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv.. [17:47:41] src/kvstore/././comm.h:752: v.vv.v. [17:47:41] src/kvstore/././comm.h:752: vv.v..v [17:47:41] src/kvstore/././comm.h:752: vvv.... [17:47:41] src/kvstore/././comm.h:752: v....vv [17:47:41] src/kvstore/././comm.h:752: .v..v.v [17:47:41] src/kvstore/././comm.h:752: ..v.vv. [17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv... [17:47:41] src/kvstore/././comm.h:752: v.vv.v.. [17:47:41] src/kvstore/././comm.h:752: vv.v..v. [17:47:41] src/kvstore/././comm.h:752: vvv....v [17:47:41] src/kvstore/././comm.h:752: v....vvv [17:47:41] src/kvstore/././comm.h:752: .v..v.vv [17:47:41] src/kvstore/././comm.h:752: ..v.vv.v [17:47:41] src/kvstore/././comm.h:752: ...vvvv. [17:47:41] src/kvstore/././comm.h:743: only 14 out of 20 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv [17:47:41] src/kvstore/././comm.h:752: v.vv. [17:47:41] src/kvstore/././comm.h:752: vv.v. [17:47:41] src/kvstore/././comm.h:752: vvv.. [17:47:41] src/kvstore/././comm.h:752: v.... [17:47:41] src/kvstore/././comm.h:743: only 18 out of 30 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv. [17:47:41] src/kvstore/././comm.h:752: v.vv.v [17:47:41] src/kvstore/././comm.h:752: vv.v.. [17:47:41] src/kvstore/././comm.h:752: vvv... [17:47:41] src/kvstore/././comm.h:752: v....v [17:47:41] src/kvstore/././comm.h:752: .v..v. [17:47:41] src/kvstore/././comm.h:743: only 24 out of 42 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv.. [17:47:41] src/kvstore/././comm.h:752: v.vv.v. [17:47:41] src/kvstore/././comm.h:752: vv.v..v [17:47:41] src/kvstore/././comm.h:752: vvv.... [17:47:41] src/kvstore/././comm.h:752: v....vv [17:47:41] src/kvstore/././comm.h:752: .v..v.v [17:47:41] src/kvstore/././comm.h:752: ..v.vv. [17:47:41] src/kvstore/././comm.h:743: only 32 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [17:47:41] src/kvstore/././comm.h:752: .vvvv... [17:47:41] src/kvstore/././comm.h:752: v.vv.v.. [17:47:41] src/kvstore/././comm.h:752: vv.v..v. [17:47:41] src/kvstore/././comm.h:752: vvv....v [17:47:41] src/kvstore/././comm.h:752: v....vvv [17:47:41] src/kvstore/././comm.h:752: .v..v.vv [17:47:41] src/kvstore/././comm.h:752: ..v.vv.v [17:47:41] src/kvstore/././comm.h:752: ...vvvv. [17:47:41] src/kvstore/././comm_tree.h:381: Using Kernighan-Lin to generate trees [17:47:41] src/kvstore/././comm_tree.h:392: Using Tree [17:47:41] src/kvstore/././comm_tree.h:489: Size 10 occurs 1 times [17:47:41] src/kvstore/././comm_tree.h:381: Using Kernighan-Lin to generate trees [17:47:41] src/kvstore/././comm_tree.h:392: Using Tree [17:47:41] src/kvstore/././comm_tree.h:489: Size 10 occurs 1 times [17:47:41] src/kvstore/././comm_tree.h:381: Using Kernighan-Lin to generate trees Traceback (most recent call last): File "test_device.py", line 82, in <module> test_device_pushpull() File "test_device.py", line 74, in test_device_pushpull check_dense_pushpull('device') File "test_device.py", line 61, in check_dense_pushpull kv_device.push(cur_key, arr_list) File "/opt/mxnet/python/mxnet/kvstore.py", line 234, in push self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority))) File "/opt/mxnet/python/mxnet/base.py", line 252, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [17:47:41] src/kvstore/./././gpu_topology.h:1040: No valid binary tree found from root 2 using backtracking Stack trace returned 10 entries: [bt] (0) /usr/local/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7ffa6698659c] [bt] (1) /usr/local/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7ffa66987918] [bt] (2) /usr/local/lib/libmxnet.so(void mxnet::kvstore::ComputeTreesFromRoot<float>(std::vector<float, std::allocator<float> >*, int, int, float, bool, std::vector<unsigned long, std::allocator<unsigned long> >*, std::vector<unsigned long, std::allocator<unsigned long> >*)+0x1a65) [0x7ffa69a59ff5] [bt] (3) /usr/local/lib/libmxnet.so(void mxnet::kvstore::ComputeTrees<float>(std::vector<float, std::allocator<float> > const&, int, float, bool, std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > >*, std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > >*)+0x5b5) [0x7ffa69a5a815] [bt] (4) /usr/local/lib/libmxnet.so(mxnet::kvstore::CommDeviceTree::QueryTopology()+0x1609) [0x7ffa69a5d409] [bt] (5) /usr/local/lib/libmxnet.so(mxnet::kvstore::CommDeviceTree::Reduce(int, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, int)+0x137c) [0x7ffa69a5f0cc] [bt] (6) /usr/local/lib/libmxnet.so(mxnet::kvstore::KVStoreLocal::PushImpl(std::vector<int, std::allocator<int> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, int)+0x1b9) [0x7ffa69a60ec9] [bt] (7) /usr/local/lib/libmxnet.so(mxnet::kvstore::KVStoreLocal::Push(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, int)+0xc6) [0x7ffa69a02ee6] [bt] (8) /usr/local/lib/libmxnet.so(MXKVStorePushEx+0x205) [0x7ffa6993d1d5] [bt] (9) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7ffab18e9e20] ``` ## Minimum reproducible example (If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.) ## Steps to reproduce (Paste the commands you ran that produced the error.) 1. 2. ## What have you tried to solve it? 1. 2.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
