ssttevee opened a new issue #11914: CUDA an illegal memory access was 
encountered
URL: https://github.com/apache/incubator-mxnet/issues/11914
 
 
   ## Description
   I've found a flaky crash relating to rnns and cuda, however, I'm not sure 
exactly what is causing it. I noticed it when I was working with a larger 
encoder-decoder network, but I've reduced it a single script with less than 100 
lines of code with a data generator.
   
   ## Environment info (Required)
   
   I'm using python3.6 on 2 different environments, windows 10 and ubuntu:
   ```
   ----------Python Info----------
   Version      : 3.6.2
   Compiler     : MSC v.1900 64 bit (AMD64)
   Build        : ('v3.6.2:5fd33b5', 'Jul  8 2017 04:57:36')
   Arch         : ('64bit', 'WindowsPE')
   ------------Pip Info-----------
   Version      : 9.0.1
   Directory    : E:\SL\Software\Python36\lib\site-packages\pip
   ----------MXNet Info-----------
   Version      : 1.2.0
   Directory    : E:\SL\Software\Python36\lib\site-packages\mxnet
   Hashtag not found. Not installed from pre-built package.
   ----------System Info----------
   Platform     : Windows-10-10.0.17134-SP0
   system       : Windows
   node         : STEVE-PC
   release      : 10
   version      : 10.0.17134
   ----------Hardware Info----------
   machine      : AMD64
   processor    : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
   Name
   Intel(R) Core(TM) i5-4670K CPU @ 3.40GHz
   
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0450 
sec, LOAD: 1.8009 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0405 sec, LOAD: 
0.1900 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0436 sec, LOAD: 
0.1333 sec.
   Timing for FashionMNIST: 
https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz,
 DNS: 0.0410 sec, LOAD: 0.1675 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0326 sec, LOAD: 
0.6613 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0283 sec, 
LOAD: 0.1286 sec.
   ```
   ```
   ----------Python Info----------
   Version      : 3.6.6
   Compiler     : GCC 5.4.0 20160609
   Build        : ('default', 'Jun 28 2018 04:42:43')
   Arch         : ('64bit', 'ELF')
   ------------Pip Info-----------
   Version      : 18.0
   Directory    : /usr/local/lib/python3.6/dist-packages/pip
   ----------MXNet Info-----------
   Version      : 1.2.1
   Directory    : /usr/local/lib/python3.6/dist-packages/mxnet
   Commit Hash   : 106391a1f0ee012b1ea38764d711e76774ce77e1
   ----------System Info----------
   Platform     : Linux-4.4.0-124-generic-x86_64-with-Ubuntu-16.04-xenial
   system       : Linux
   node         : steve-pc
   release      : 4.4.0-124-generic
   version      : #148-Ubuntu SMP Wed May 2 13:00:18 UTC 2018
   ----------Hardware Info----------
   machine      : x86_64
   processor    : x86_64
   Architecture:          x86_64
   CPU op-mode(s):        32-bit, 64-bit
   Byte Order:            Little Endian
   CPU(s):                16
   On-line CPU(s) list:   0-15
   Thread(s) per core:    2
   Core(s) per socket:    8
   Socket(s):             1
   NUMA node(s):          1
   Vendor ID:             AuthenticAMD
   CPU family:            23
   Model:                 1
   Model name:            AMD Ryzen 7 1700 Eight-Core Processor
   Stepping:              1
   CPU MHz:               1550.000
   CPU max MHz:           3000.0000
   CPU min MHz:           1550.0000
   BogoMIPS:              5999.21
   Virtualization:        AMD-V
   L1d cache:             32K
   L1i cache:             64K
   L2 cache:              512K
   L3 cache:              8192K
   NUMA node0 CPU(s):     0-15
   Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf 
eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes 
xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a 
misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb 
bpext perfctr_l2 mwaitx hw_pstate retpoline retpoline_amd vmmcall fsgsbase bmi1 
avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero 
ibpb arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid 
decodeassists pausefilter pfthreshold
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0114 
sec, LOAD: 0.7577 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0394 sec, LOAD: 
0.1501 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1848 sec, LOAD: 
0.1442 sec.
   Timing for FashionMNIST: 
https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz,
 DNS: 0.0323 sec, LOAD: 0.1578 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0224 sec, LOAD: 
0.5354 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0184 sec, 
LOAD: 0.0980 sec.
   ```
   
   I have a gtx 1070 installed in the windows machine, and a titan v installed 
in the ubuntu machine and 32GB of ram in both matchines. The gtx 1070 and titan 
v have 8GB and 12GB of VRAM respectively.
   
   ## Error Message:
   ```
   Traceback (most recent call last):
     File "asymmetric_encoder_decoder_test.py", line 84, in <module>
       avg_loss = mx.nd.sum(loss).asscalar() / args.batch_size
     File "C:\Python36\lib\site-packages\mxnet\ndarray\ndarray.py", line 1894, 
in asscalar
       return self.asnumpy()[0]
     File "C:\Python36\lib\site-packages\mxnet\ndarray\ndarray.py", line 1876, 
in asnumpy
       ctypes.c_size_t(data.size)))
     File "C:\Python36\lib\site-packages\mxnet\base.py", line 149, in check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [22:07:27] 
c:\jenkins\workspace\mxnet-tag\mxnet\src\storage\pooled_storage_manager.h:77: 
CUDA: an illegal memory access was encountered
   ```
   ```
   Traceback (most recent call last):
     File "asymmetric_encoder_decoder_test.py", line 84, in <module>
       avg_loss = mx.nd.sum(loss).asscalar() / args.batch_size
     File "/usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py", 
line 1894, in asscalar
       return self.asnumpy()[0]
     File "/usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py", 
line 1876, in asnumpy
       ctypes.c_size_t(data.size)))
     File "/usr/local/lib/python3.6/dist-packages/mxnet/base.py", line 149, in 
check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [22:41:02] 
src/storage/./pooled_storage_manager.h:108: cudaMalloc failed: an illegal 
memory access was encountered
   
   Stack trace returned 10 entries:
   [bt] (0) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2f9912) 
[0x7fa9237c5912]
   [bt] (1) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2f9ee8) 
[0x7fa9237c5ee8]
   [bt] (2) 
/usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2f4b6cf) 
[0x7fa9264176cf]
   [bt] (3) 
/usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2f4fcdc) 
[0x7fa92641bcdc]
   [bt] (4) 
/usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2af5369) 
[0x7fa925fc1369]
   [bt] (5) 
/usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2b0a5c8) 
[0x7fa925fd65c8]
   [bt] (6) 
/usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2b0a6eb) 
[0x7fa925fd66eb]
   [bt] (7) 
/usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2942714) 
[0x7fa925e0e714]
   [bt] (8) 
/usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x294897b) 
[0x7fa925e1497b]
   [bt] (9) 
/usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2948b9e) 
[0x7fa925e14b9e]
   ```
   
   ## Minimum reproducible example
   I'm pretty sure this could be reduced further, but given the conditions of 
when the crash happens, I don't want to mess with it too much: 
https://gist.githubusercontent.com/ssttevee/1476b2b680df4eb70e02ed04cb643e96
   
   ## Steps to reproduce
   
   1. `wget 
https://gist.githubusercontent.com/ssttevee/1476b2b680df4eb70e02ed04cb643e96/raw/c1ac08912d995f8c25f453266d2f457139db95b1/asymmetric_encoder_decoder_test.py`
   2. `python3.6 asymmetric_encoder_decoder_test.py --data_length=10000 
--batch_size=6`
   
   You might have to fiddle around with the batch size to cause it to crash. On 
my windows machine, it crashes with `--data_length=10000 --batch_size=6`. On my 
ubuntu machine, it crashes with ` --data_length=1000 --batch_size=64`. Those 
parameters are not the only ones that crashes the program, batch size of ±~2 
would also cause it to crash.
   
   ## What have you tried to solve it?
   
   1. Reinstalled various versions of nvidia drivers and cuda toolkit
   2. Run on a two different environments
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to