[GitHub] ThomasDelteil commented on issue #10042: [MXNET-86] Gluon dataloader crash on speech recognition training

GitBox Mon, 19 Mar 2018 11:24:58 -0700

ThomasDelteil commented on issue #10042: [MXNET-86] Gluon dataloader crash on 
speech recognition training
URL: 
https://github.com/apache/incubator-mxnet/issues/10042#issuecomment-374316037
 
 
   When using num_workers > 0 I get after a few hundreds/thousands of batches 
(the higher the number of workers, the sooner the segfault):
   
   I am using mxnet-cu90 1.1.0:
   
   ```
   Segmentation fault: 11
   
   Stack trace returned 10 entries:
   [bt] (0) 
/home/ec2-user/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x276938)
 [0x7fe86492c938]
   [bt] (1) 
/home/ec2-user/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x28c53ae)
 [0x7fe866f7b3ae]
   [bt] (2) /lib64/libc.so.6(+0x353a0) [0x7fe8e4fe33a0]
   [bt] (3) 
/home/ec2-user/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x28c2703)
 [0x7fe866f78703]
   [bt] (4) 
/home/ec2-user/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x28c46d8)
 [0x7fe866f7a6d8]
   [bt] (5) 
/home/ec2-user/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(MXNDArrayCreateFromSharedMem+0x5f5)
 [0x7fe866a4f4c5]
   [bt] (6) 
/home/ec2-user/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c)
 [0x7fe8d9299ec0]
   [bt] (7) 
/home/ec2-user/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d)
 [0x7fe8d929987d]
   [bt] (8) 
/home/ec2-user/anaconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce)
 [0x7fe8d94ae82e]
   [bt] (9) 
/home/ec2-user/anaconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x12265)
 [0x7fe8d94af265]
   *** Error in `/home/ec2-user/anaconda3/bin/python': malloc(): memory 
corruption: 0x00007fe8380111f0 ***
   ```
   
   Running this code: 
https://github.com/ThomasDelteil/CNN_NLP_MXNet/blob/master/Crepe-Gluon.ipynb 
and changing this line:
   `curr_loss = nd.mean(loss).asscalar()`
   to `curr_loss = nd.mean(loss)`
   
   Sometimes, not all the times, I also get the workers filling up 100% of my 
/dev/shm after the segfault. I am running the code in jupyter lab.
   
   Is this the same issue?
   
   Should I open a new one?
   
   The issue does not happen without multi-processing


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] ThomasDelteil commented on issue #10042: [MXNET-86] Gluon dataloader crash on speech recognition training

Reply via email to