DickJC123 opened a new issue #15034: Flakey test: test_operator_gpu.py:test_rnntanh_bidirectional URL: https://github.com/apache/incubator-mxnet/issues/15034 ## Description The failures are seen infrequently and nondeterministically (but within 1000 trials) on a GPU platform when run on NVIDIA P40. Based on initial investigation, the problem is introduced by the commit: ``` 1c49e40fd 2019-04-13 Hao Li Change RNN OP to stateful (#14476) ``` ... which is not too surprising given the sizeable refactoring of the rnn code with that commit. Because the P40 has far fewer compute resources compared to P100 and V100, I suspect a timing related issue. No failures are seen on P100 or V100, nor on P40 on a checkout of the prior commit to 1c49e40fd. Looking over that commit, I see changes in how the various 'spaces' are handled in the GPU case. Maybe the commit author @lihaofd can chime on the need/motivation for these changes: Prior to the commit, the 'workspace' (as set by cudnnGetRNNWorkspaceSize) was allocated from MXNet's TempSpace. With the commit, it becomes a per-instance permanent allocation. Also, prior to the commit, the dropout state space was a per-instance permanent allocation, while with the commit it became managed by the MXNet context resources (and swapped in/out with various instance uses). While I understand that MXNet is set up to manage the dropout state, is there any other motivation to make this switch? When the test fails, the non-fused model output is random garbage. To support the notion that a race condition exists, the test failures go away when a waitall() is inserted in test_operator.py function check_rnn_consistency: ``` dy = mx.random.uniform(shape=mod1.get_outputs()[0].shape) mod1.backward(out_grads=[dy]) mx.nd.waitall() mod2.backward(out_grads=[dy]) ``` @ptrendx @eric-haibin-lin , I'd like to see this resolved by the 1.5 code freeze. ## Environment info (Required) ``` ----------Python Info---------- Version : 3.5.2 Compiler : GCC 5.4.0 20160609 Build : ('default', 'Nov 12 2018 13:43:14') Arch : ('64bit', 'ELF') ------------Pip Info----------- Version : 19.1.1 Directory : /usr/local/lib/python3.5/dist-packages/pip ----------MXNet Info----------- Version : 1.5.0 Directory : /opt/mxnet/python/mxnet Hashtag not found. Not installed from pre-built package. ----------System Info---------- Platform : Linux-4.4.0-36-generic-x86_64-with-Ubuntu-16.04-xenial system : Linux node : 636ace361501 release : 4.4.0-36-generic version : #55~14.04.1-Ubuntu SMP Fri Aug 12 11:49:30 UTC 2016 ----------Hardware Info---------- machine : x86_64 processor : x86_64 Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz Stepping: 2 CPU MHz: 2902.046 CPU max MHz: 3300.0000 CPU min MHz: 1200.0000 BogoMIPS: 5201.48 Virtualization: VT-x Hypervisor vendor: vertical Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 25600K NUMA node0 CPU(s): 0-9,20-29 NUMA node1 CPU(s): 10-19,30-39 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts ``` Package used (Python/R/Scala/Julia): (I'm using ...) For Scala user, please provide: 1. Java version: (`java -version`) 2. Maven version: (`mvn -version`) 3. Scala runtime if applicable: (`scala -version`) For R user, please provide R `sessionInfo()`: ## Build info (Required if built from source) Compiler (gcc/clang/mingw/visual studio): MXNet commit hash: (Paste the output of `git rev-parse HEAD` here.) Build config: (Paste the content of config.mk, or the build command.) ## Error Message: Two example outputs shown: ``` ====================================================================== FAIL: test_operator_gpu.test_rnntanh_bidirectional ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest self.test(*self.arg) File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc return func(*arg, **kw) File "/opt/mxnet/tests/python/gpu/../unittest/common.py", line 177, in test_new orig_test(*args, **kwargs) File "/opt/mxnet/tests/python/gpu/../unittest/common.py", line 110, in test_new orig_test(*args, **kwargs) File "/opt/mxnet/tests/python/gpu/../unittest/test_operator.py", line 182, in test_rnntanh_bidirectional check_rnn_consistency(fused, stack, T, N, I, H, 'add') File "/opt/mxnet/tests/python/gpu/../unittest/test_operator.py", line 69, in check_rnn_consistency assert_allclose(mod1.get_input_grads()[0].asnumpy(), mod2.get_input_grads()[0].asnumpy(), rtol=rtol, atol=atol) File "/usr/local/lib/python3.5/dist-packages/numpy/testing/utils.py", line 1395, in assert_allclose verbose=verbose, header=header, equal_nan=equal_nan) File "/usr/local/lib/python3.5/dist-packages/numpy/testing/utils.py", line 778, in assert_array_compare raise AssertionError(msg) AssertionError: Not equal to tolerance rtol=0.01, atol=0.0001 (mismatch 99.9725%) x: array([[[ 0.122356, 0.663351, 0.721616, ..., 0.300692, 0.809006, 0.190476], [ 0.063241, 0.969914, 0.543127, ..., 0.040564, 0.695362,... y: array([[[-0.036576, 0.022715, -0.00182 , ..., 0.014202, -0.042219, 0.026592], [-0.008018, 0.020907, 0.006875, ..., -0.017068, -0.04306 ,... - ``` Second failure: not difference in second (unfused) rnn model only. ``` ====================================================================== FAIL: test_operator_gpu.test_rnntanh_bidirectional ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest self.test(*self.arg) File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc return func(*arg, **kw) File "/opt/mxnet/tests/python/gpu/../unittest/common.py", line 177, in test_new orig_test(*args, **kwargs) File "/opt/mxnet/tests/python/gpu/../unittest/common.py", line 110, in test_new orig_test(*args, **kwargs) File "/opt/mxnet/tests/python/gpu/../unittest/test_operator.py", line 182, in test_rnntanh_bidirectional check_rnn_consistency(fused, stack, T, N, I, H, 'add') File "/opt/mxnet/tests/python/gpu/../unittest/test_operator.py", line 69, in check_rnn_consistency assert_allclose(mod1.get_input_grads()[0].asnumpy(), mod2.get_input_grads()[0].asnumpy(), rtol=rtol, atol=atol) File "/usr/local/lib/python3.5/dist-packages/numpy/testing/utils.py", line 1395, in assert_allclose verbose=verbose, header=header, equal_nan=equal_nan) File "/usr/local/lib/python3.5/dist-packages/numpy/testing/utils.py", line 778, in assert_array_compare raise AssertionError(msg) AssertionError: Not equal to tolerance rtol=0.01, atol=0.0001 (mismatch 99.96375%) x: array([[[ 0.122356, 0.663351, 0.721616, ..., 0.300692, 0.809006, 0.190476], [ 0.063241, 0.969914, 0.543127, ..., 0.040564, 0.695362,... y: array([[[ 8.044541e-03, 5.417631e-02, 3.945356e-02, ..., -4.552861e-03, -1.618103e-02, 7.161065e-02], [ -8.904358e-03, 3.195144e-02, 2.073918e-02, ...,... ``` ## Steps to reproduce MXNET_TEST_COUNT=1000 MXNET_TEST_SEED=42 nosetests --verbose -s --logging-level=DEBUG tests/python/gpu/test_operator_gpu.py:test_rnntanh_bidirectional ## What have you tried to solve it? See above discussion.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
