amithr1 opened a new issue #8695: Hangs training on P100
URL: https://github.com/apache/incubator-mxnet/issues/8695
 
 
   I am trying to train imagenet using the default resnet on a single node 
having upto 4 P100s.. When I use the  master branch, I see hangs. When I 
attached gdb I see the following stack trace. If there are useful inputs, I can 
debug the problem more. The problem happens with more than 2 GPUs. With 2GPUs, 
I can run upto several epochs. However when I use 4 GPUs, it hangs within first 
epoch.
   
   (gdb) bt
   #0  0x00003fffac2cdd60 in pthread_cond_wait@@GLIBC_2.17 () at 
/lib64/libpthread.so.0
   #1  0x00003fff4777608c in 
std::condition_variable::wait(std::unique_lock<std::mutex>&) () at 
/lib64/libstdc++.so.6
   #2  0x00003fff6a3e236c in 
std::condition_variable::wait<mxnet::engine::ThreadedEngine::WaitForVar(mxnet::Engine::VarHandle)::__lambda18>(std::unique_lock<std::mutex>
 &, mxnet::engine::ThreadedEngine::__lambda18) (this=0x3fff2c001198, 
__lock=..., __p=...) at /usr/include/c++/4.8.2/condition_variable:93
   #3  0x00003fff6a3e1d10 in 
mxnet::engine::ThreadedEngine::WaitForVar(mxnet::engine::Var*) 
(this=0x3fff2c001150, var=0x3bff50a6a900) at src/engine/threaded_engine.cc:358
   #4  0x00003fff699b6cc8 in mxnet::NDArray::WaitToWrite() const 
(this=0x3bff49fa0cf0) at include/mxnet/./ndarray.h:330
   #5  0x00003fff69be4c88 in mxnet::NDArray::SyncCopyToCPU(void*, unsigned 
long) const (this=0x3bff49fa0cf0, data=0x3bff9c9862c0, size=32) at 
src/ndarray/ndarray.cc:1210
   #6  0x00003fff6a44d190 in MXNDArraySyncCopyToCPU(NDArrayHandle, void*, 
size_t) (handle=0x3bff49fa0cf0, data=0x3bff9c9862c0, size=32) at 
src/c_api/c_api.cc:253
   #7  0x00003fffabed7254 in  () at /lib64/libffi.so.6
   #8  0x00003fffabed5f50 in ffi_call () at /lib64/libffi.so.6
   #9  0x00003fffa5247b24 in _ctypes_callproc () at 
/usr/lib64/python2.7/lib-dynload/_ctypes.so
   #10 0x00003fffa523a6ac in PyCFuncPtr_call () at 
/usr/lib64/python2.7/lib-dynload/_ctypes.so
   #11 0x00003fffac361444 in PyObject_Call () at /lib64/libpython2.7.so.1.0
   #12 0x00003fffac4669f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
   #13 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
   #14 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
   #15 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
   #16 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
   #17 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
   #18 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
   #19 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
   #20 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
   #21 0x00003fffac468c70 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
   #22 0x00003fffac468c70 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
   #23 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
   #24 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
   #25 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
   #26 0x00003fffac4684f0 in PyEval_EvalFrameEx () at /lib64/libpython2.7.so.1.0
   #27 0x00003fffac46cb40 in PyEval_EvalCodeEx () at /lib64/libpython2.7.so.1.0
   #28 0x00003fffac46cc64 in PyEval_EvalCode () at /lib64/libpython2.7.so.1.0
   #29 0x00003fffac4a0528 in PyRun_FileExFlags () at /lib64/libpython2.7.so.1.0
   #30 0x00003fffac4a274c in PyRun_SimpleFileExFlags () at 
/lib64/libpython2.7.so.1.0
   #31 0x00003fffac4a2e9c in PyRun_AnyFileExFlags () at 
/lib64/libpython2.7.so.1.0
   #32 0x00003fffac4beb7c in Py_Main () at /lib64/libpython2.7.so.1.0
   #33 0x0000000010000738 in main ()
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to