ssttevee opened a new issue #11914: CUDA an illegal memory access was encountered URL: https://github.com/apache/incubator-mxnet/issues/11914 ## Description I've found a flaky crash relating to rnns and cuda, however, I'm not sure exactly what is causing it. I noticed it when I was working with a larger encoder-decoder network, but I've reduced it a single script with less than 100 lines of code with a data generator. ## Environment info (Required) I'm using python3.6 on 2 different environments, windows 10 and ubuntu: ``` ----------Python Info---------- Version : 3.6.2 Compiler : MSC v.1900 64 bit (AMD64) Build : ('v3.6.2:5fd33b5', 'Jul 8 2017 04:57:36') Arch : ('64bit', 'WindowsPE') ------------Pip Info----------- Version : 9.0.1 Directory : E:\SL\Software\Python36\lib\site-packages\pip ----------MXNet Info----------- Version : 1.2.0 Directory : E:\SL\Software\Python36\lib\site-packages\mxnet Hashtag not found. Not installed from pre-built package. ----------System Info---------- Platform : Windows-10-10.0.17134-SP0 system : Windows node : STEVE-PC release : 10 version : 10.0.17134 ----------Hardware Info---------- machine : AMD64 processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel Name Intel(R) Core(TM) i5-4670K CPU @ 3.40GHz ----------Network Test---------- Setting timeout: 10 Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0450 sec, LOAD: 1.8009 sec. Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0405 sec, LOAD: 0.1900 sec. Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0436 sec, LOAD: 0.1333 sec. Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0410 sec, LOAD: 0.1675 sec. Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0326 sec, LOAD: 0.6613 sec. Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0283 sec, LOAD: 0.1286 sec. ``` ``` ----------Python Info---------- Version : 3.6.6 Compiler : GCC 5.4.0 20160609 Build : ('default', 'Jun 28 2018 04:42:43') Arch : ('64bit', 'ELF') ------------Pip Info----------- Version : 18.0 Directory : /usr/local/lib/python3.6/dist-packages/pip ----------MXNet Info----------- Version : 1.2.1 Directory : /usr/local/lib/python3.6/dist-packages/mxnet Commit Hash : 106391a1f0ee012b1ea38764d711e76774ce77e1 ----------System Info---------- Platform : Linux-4.4.0-124-generic-x86_64-with-Ubuntu-16.04-xenial system : Linux node : steve-pc release : 4.4.0-124-generic version : #148-Ubuntu SMP Wed May 2 13:00:18 UTC 2018 ----------Hardware Info---------- machine : x86_64 processor : x86_64 Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD Ryzen 7 1700 Eight-Core Processor Stepping: 1 CPU MHz: 1550.000 CPU max MHz: 3000.0000 CPU min MHz: 1550.0000 BogoMIPS: 5999.21 Virtualization: AMD-V L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-15 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 mwaitx hw_pstate retpoline retpoline_amd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero ibpb arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold ----------Network Test---------- Setting timeout: 10 Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0114 sec, LOAD: 0.7577 sec. Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0394 sec, LOAD: 0.1501 sec. Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1848 sec, LOAD: 0.1442 sec. Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0323 sec, LOAD: 0.1578 sec. Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0224 sec, LOAD: 0.5354 sec. Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0184 sec, LOAD: 0.0980 sec. ``` I have a gtx 1070 installed in the windows machine, and a titan v installed in the ubuntu machine and 32GB of ram in both matchines. The gtx 1070 and titan v have 8GB and 12GB of VRAM respectively. ## Error Message: ``` Traceback (most recent call last): File "asymmetric_encoder_decoder_test.py", line 84, in <module> avg_loss = mx.nd.sum(loss).asscalar() / args.batch_size File "C:\Python36\lib\site-packages\mxnet\ndarray\ndarray.py", line 1894, in asscalar return self.asnumpy()[0] File "C:\Python36\lib\site-packages\mxnet\ndarray\ndarray.py", line 1876, in asnumpy ctypes.c_size_t(data.size))) File "C:\Python36\lib\site-packages\mxnet\base.py", line 149, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [22:07:27] c:\jenkins\workspace\mxnet-tag\mxnet\src\storage\pooled_storage_manager.h:77: CUDA: an illegal memory access was encountered ``` ``` Traceback (most recent call last): File "asymmetric_encoder_decoder_test.py", line 84, in <module> avg_loss = mx.nd.sum(loss).asscalar() / args.batch_size File "/usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py", line 1894, in asscalar return self.asnumpy()[0] File "/usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py", line 1876, in asnumpy ctypes.c_size_t(data.size))) File "/usr/local/lib/python3.6/dist-packages/mxnet/base.py", line 149, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [22:41:02] src/storage/./pooled_storage_manager.h:108: cudaMalloc failed: an illegal memory access was encountered Stack trace returned 10 entries: [bt] (0) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2f9912) [0x7fa9237c5912] [bt] (1) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2f9ee8) [0x7fa9237c5ee8] [bt] (2) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2f4b6cf) [0x7fa9264176cf] [bt] (3) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2f4fcdc) [0x7fa92641bcdc] [bt] (4) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2af5369) [0x7fa925fc1369] [bt] (5) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2b0a5c8) [0x7fa925fd65c8] [bt] (6) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2b0a6eb) [0x7fa925fd66eb] [bt] (7) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2942714) [0x7fa925e0e714] [bt] (8) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x294897b) [0x7fa925e1497b] [bt] (9) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x2948b9e) [0x7fa925e14b9e] ``` ## Minimum reproducible example I'm pretty sure this could be reduced further, but given the conditions of when the crash happens, I don't want to mess with it too much: https://gist.githubusercontent.com/ssttevee/1476b2b680df4eb70e02ed04cb643e96 ## Steps to reproduce 1. `wget https://gist.githubusercontent.com/ssttevee/1476b2b680df4eb70e02ed04cb643e96/raw/c1ac08912d995f8c25f453266d2f457139db95b1/asymmetric_encoder_decoder_test.py` 2. `python3.6 asymmetric_encoder_decoder_test.py --data_length=10000 --batch_size=6` You might have to fiddle around with the batch size to cause it to crash. On my windows machine, it crashes with `--data_length=10000 --batch_size=6`. On my ubuntu machine, it crashes with ` --data_length=1000 --batch_size=64`. Those parameters are not the only ones that crashes the program, batch size of ±~2 would also cause it to crash. ## What have you tried to solve it? 1. Reinstalled various versions of nvidia drivers and cuda toolkit 2. Run on a two different environments
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
