DickJC123 opened a new issue #15034: Flakey test: 
test_operator_gpu.py:test_rnntanh_bidirectional
URL: https://github.com/apache/incubator-mxnet/issues/15034
 
 
   ## Description
   
   The failures are seen infrequently and nondeterministically (but within 1000 
trials) on a GPU platform when run on NVIDIA P40.  Based on initial 
investigation, the problem is introduced by the commit:
   ```
   1c49e40fd  2019-04-13  Hao Li                 Change RNN OP to stateful 
(#14476)
   ```
   ... which is not too surprising given the sizeable refactoring of the rnn 
code with that commit.
   Because the P40 has far fewer compute resources compared to P100 and V100, I 
suspect a timing related issue.  No failures are seen on P100 or V100, nor on 
P40 on a checkout of the prior commit to 1c49e40fd.  Looking over that commit, 
I see changes in how the various 'spaces' are handled in the GPU case.  Maybe 
the commit author @lihaofd can chime on the need/motivation for these changes:
   
   Prior to the commit, the 'workspace' (as set by cudnnGetRNNWorkspaceSize) 
was allocated from MXNet's TempSpace.  With the commit, it becomes a 
per-instance permanent allocation.
   
   Also, prior to the commit, the dropout state space was a per-instance 
permanent allocation, while with the commit it became managed by the MXNet 
context resources (and swapped in/out with various instance uses).  While I 
understand that MXNet is set up to manage the dropout state, is there any other 
motivation to make this switch?
   
   When the test fails, the non-fused model output is random garbage.  To 
support the notion that a race condition exists, the test failures go away when 
a waitall() is inserted in test_operator.py function check_rnn_consistency:
   ```
       dy = mx.random.uniform(shape=mod1.get_outputs()[0].shape)
       mod1.backward(out_grads=[dy])
       mx.nd.waitall()
       mod2.backward(out_grads=[dy])
   ```
   @ptrendx @eric-haibin-lin , I'd like to see this resolved by the 1.5 code 
freeze.
   ## Environment info (Required)
   
   ```
   ----------Python Info----------
   Version      : 3.5.2
   Compiler     : GCC 5.4.0 20160609
   Build        : ('default', 'Nov 12 2018 13:43:14')
   Arch         : ('64bit', 'ELF')
   ------------Pip Info-----------
   Version      : 19.1.1
   Directory    : /usr/local/lib/python3.5/dist-packages/pip
   ----------MXNet Info-----------
   Version      : 1.5.0
   Directory    : /opt/mxnet/python/mxnet
   Hashtag not found. Not installed from pre-built package.
   ----------System Info----------
   Platform     : Linux-4.4.0-36-generic-x86_64-with-Ubuntu-16.04-xenial
   system       : Linux
   node         : 636ace361501
   release      : 4.4.0-36-generic
   version      : #55~14.04.1-Ubuntu SMP Fri Aug 12 11:49:30 UTC 2016
   ----------Hardware Info----------
   machine      : x86_64
   processor    : x86_64
   Architecture:          x86_64
   CPU op-mode(s):        32-bit, 64-bit
   Byte Order:            Little Endian
   CPU(s):                40
   On-line CPU(s) list:   0-39
   Thread(s) per core:    2
   Core(s) per socket:    10
   Socket(s):             2
   NUMA node(s):          2
   Vendor ID:             GenuineIntel
   CPU family:            6
   Model:                 63
   Model name:            Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
   Stepping:              2
   CPU MHz:               2902.046
   CPU max MHz:           3300.0000
   CPU min MHz:           1200.0000
   BogoMIPS:              5201.48
   Virtualization:        VT-x
   Hypervisor vendor:     vertical
   Virtualization type:   full
   L1d cache:             32K
   L1i cache:             32K
   L2 cache:              256K
   L3 cache:              25600K
   NUMA node0 CPU(s):     0-9,20-29
   NUMA node1 CPU(s):     10-19,30-39
   Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est 
tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt 
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi 
flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm 
xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts
   
   ```
   
   Package used (Python/R/Scala/Julia):
   (I'm using ...)
   
   For Scala user, please provide:
   1. Java version: (`java -version`)
   2. Maven version: (`mvn -version`)
   3. Scala runtime if applicable: (`scala -version`)
   
   For R user, please provide R `sessionInfo()`:
   
   ## Build info (Required if built from source)
   
   Compiler (gcc/clang/mingw/visual studio):
   
   MXNet commit hash:
   (Paste the output of `git rev-parse HEAD` here.)
   
   Build config:
   (Paste the content of config.mk, or the build command.)
   
   ## Error Message:
   Two example outputs shown:
   ```
   ======================================================================
   FAIL: test_operator_gpu.test_rnntanh_bidirectional
   ----------------------------------------------------------------------
   Traceback (most recent call last):
     File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in 
runTest
       self.test(*self.arg)
     File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in 
newfunc
       return func(*arg, **kw)
     File "/opt/mxnet/tests/python/gpu/../unittest/common.py", line 177, in 
test_new
       orig_test(*args, **kwargs)
     File "/opt/mxnet/tests/python/gpu/../unittest/common.py", line 110, in 
test_new
       orig_test(*args, **kwargs)
     File "/opt/mxnet/tests/python/gpu/../unittest/test_operator.py", line 182, 
in test_rnntanh_bidirectional
       check_rnn_consistency(fused, stack, T, N, I, H, 'add')
     File "/opt/mxnet/tests/python/gpu/../unittest/test_operator.py", line 69, 
in check_rnn_consistency
       assert_allclose(mod1.get_input_grads()[0].asnumpy(), 
mod2.get_input_grads()[0].asnumpy(), rtol=rtol, atol=atol)
     File "/usr/local/lib/python3.5/dist-packages/numpy/testing/utils.py", line 
1395, in assert_allclose
       verbose=verbose, header=header, equal_nan=equal_nan)
     File "/usr/local/lib/python3.5/dist-packages/numpy/testing/utils.py", line 
778, in assert_array_compare
       raise AssertionError(msg)
   AssertionError: 
   Not equal to tolerance rtol=0.01, atol=0.0001
   
   (mismatch 99.9725%)
    x: array([[[ 0.122356,  0.663351,  0.721616, ...,  0.300692,  0.809006,
             0.190476],
           [ 0.063241,  0.969914,  0.543127, ...,  0.040564,  0.695362,...
    y: array([[[-0.036576,  0.022715, -0.00182 , ...,  0.014202, -0.042219,
             0.026592],
           [-0.008018,  0.020907,  0.006875, ..., -0.017068, -0.04306 ,...
   -
   ```
   Second failure: not difference in second (unfused) rnn model only.
   ```
   ======================================================================
   FAIL: test_operator_gpu.test_rnntanh_bidirectional
   ----------------------------------------------------------------------
   Traceback (most recent call last):
     File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in 
runTest
       self.test(*self.arg)
     File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in 
newfunc
       return func(*arg, **kw)
     File "/opt/mxnet/tests/python/gpu/../unittest/common.py", line 177, in 
test_new
       orig_test(*args, **kwargs)
     File "/opt/mxnet/tests/python/gpu/../unittest/common.py", line 110, in 
test_new
       orig_test(*args, **kwargs)
     File "/opt/mxnet/tests/python/gpu/../unittest/test_operator.py", line 182, 
in test_rnntanh_bidirectional
       check_rnn_consistency(fused, stack, T, N, I, H, 'add')
     File "/opt/mxnet/tests/python/gpu/../unittest/test_operator.py", line 69, 
in check_rnn_consistency
       assert_allclose(mod1.get_input_grads()[0].asnumpy(), 
mod2.get_input_grads()[0].asnumpy(), rtol=rtol, atol=atol)
     File "/usr/local/lib/python3.5/dist-packages/numpy/testing/utils.py", line 
1395, in assert_allclose
       verbose=verbose, header=header, equal_nan=equal_nan)
     File "/usr/local/lib/python3.5/dist-packages/numpy/testing/utils.py", line 
778, in assert_array_compare
       raise AssertionError(msg)
   AssertionError: 
   Not equal to tolerance rtol=0.01, atol=0.0001
   
   (mismatch 99.96375%)
    x: array([[[ 0.122356,  0.663351,  0.721616, ...,  0.300692,  0.809006,
             0.190476],
           [ 0.063241,  0.969914,  0.543127, ...,  0.040564,  0.695362,...
    y: array([[[  8.044541e-03,   5.417631e-02,   3.945356e-02, ...,
             -4.552861e-03,  -1.618103e-02,   7.161065e-02],
           [ -8.904358e-03,   3.195144e-02,   2.073918e-02, ...,...
   ```
   ## Steps to reproduce
   MXNET_TEST_COUNT=1000 MXNET_TEST_SEED=42 nosetests --verbose -s 
--logging-level=DEBUG 
tests/python/gpu/test_operator_gpu.py:test_rnntanh_bidirectional
   
   ## What have you tried to solve it?
   
   See above discussion.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to