guptaanshul201989 opened a new issue #18734:
URL: https://github.com/apache/incubator-mxnet/issues/18734


   I am trying to train a transformer seq-to-seq model on Sagemaker ( The 
script I am using works fine when I run it on an EC2 multi gpu instance ).
   
   When I start a training job on sagemaker, the training progresses fine, but 
it logs a cuda error:
   
   `[03:28:04] src/engine/threaded_engine_perdevice.cc:101: Ignore CUDA Error 
[03:28:04] 
/root/pip_build/mxnet-build/3rdparty/mshadow/mshadow/./tensor_gpu-inl.h:35: 
Check failed: e == cudaSuccess: CUDA: initialization error
   Stack trace:
   [bt] (0) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x6dfb0b) 
[0x7f9f2591cb0b]
   [bt] (1) 
/usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3898dd2) 
[0x7f9f28ad5dd2]
   [bt] (2) 
/usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38bc49e) 
[0x7f9f28af949e]
   [bt] (3) 
/usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38aee71) 
[0x7f9f28aebe71]
   [bt] (4) 
/usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a4a21) 
[0x7f9f28ae1a21]
   [bt] (5) 
/usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a5974) 
[0x7f9f28ae2974]
   [bt] (6) 
/usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Chunk::~Chunk()+0x48a)
 [0x7f9f28d1ce1a]
   [bt] (7) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x6e32ba) 
[0x7f9f259202ba]
   [bt] (8) 
/usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(std::vector<mxnet::NDArray,
 std::allocator<mxnet::NDArray> >::~vector()+0xc8) [0x7f9f25951818] `
   
   
   I found out that when I initialize dataloader with multiprocessing, I get 
this error. When I switch thread_pool on, I don't see this error.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to