DickJC123 opened a new issue #18747:
URL: https://github.com/apache/incubator-mxnet/issues/18747


   ## Description
   This is a problem I ran into in the development of PR 
https://github.com/apache/incubator-mxnet/pull/18694, and **I have include a 
fix** in commit 
https://github.com/apache/incubator-mxnet/pull/18694/commits/95bfe3a642f07ffd0c78d965b7f590cee75a44fd.
   
   An example invocation of a test that is decorated with @retry(3) and that 
fails on its first attempt (succeeding on its 2nd) is:
   ```
   MXNET_TEST_SEED=757747955 pytest --verbose -s --log-cli-level=DEBUG 
tests/python/gpu/test_operator_gpu.py::test_np_mixedType_unary_funcs[float16-4-rint-None--5.0-5.0]
   ```
   I've posted the error message showing the segfault below.
   
   The problem seems to center on the fact that the current retry() 
implementation copies any seen exception to a variable `err` that it retains as 
it pursues further retry attempts of the test.  I believe that when the err 
object is finally garbage collected, the segfault is triggered (does the 
exception have stack trace pointers that are now stale?).  The fix is to not 
retain the exception past the iteration that generated it.
   
   In coming up with the above explanation, I determined that retaining only 
the exception string also avoids the segfault and so would work as a fix.
   So before:
   ```
   err = e
   ...
   raise err
   ```
   could become:
   ```
   err_msg = str(e)
   ...
   raise AssertionError(err_msg)
   ```
   I prefer to stick with the initial fix in the PR, which doesn't regenerate 
the exception.
   
   ### Error Message
   ```
   
---------------------------------------------------------------------------------------
 live log call 
------------------------------------------------------------------------------[0/18716]
   INFO     common:common.py:221 Setting test np/mx/python random seeds, use 
MXNET_TEST_SEED=757747955 to reproduce.
   rint float16 (2, 2, 2, 2)
   
   *** Maximum errors for vector of size 16:  rtol=0.001, atol=1e-05
   
     1: Error 99864.382812  Location of error: (0, 1, 1, 1), a=-1.00000000, 
b=-0.00000000
   rint float16 (3, 3, 3, 2)
   rint float16 (1, 0, 2)
   PASSEDFatal Python error: Segmentation fault
   
   Current thread 0x00007f393667f740 (most recent call first):
     File "/opt/mxnet/python/mxnet/ndarray/ndarray.py", line 2570 in asnumpy
     File "/opt/mxnet/python/mxnet/numpy/multiarray.py", line 1251 in __repr__
     File "/usr/local/lib/python3.6/dist-packages/_pytest/_io/saferepr.py", 
line 56 in repr_instance
     File "/usr/lib/python3.6/reprlib.py", line 65 in repr1
     File "/usr/lib/python3.6/reprlib.py", line 55 in repr
     File "/usr/local/lib/python3.6/dist-packages/_pytest/_io/saferepr.py", 
line 47 in repr
     File "/usr/local/lib/python3.6/dist-packages/_pytest/_io/saferepr.py", 
line 82 in saferepr
     File "/usr/local/lib/python3.6/dist-packages/_pytest/_code/code.py", line 
689 in repr_args
     File "/usr/local/lib/python3.6/dist-packages/_pytest/_code/code.py", line 
780 in repr_traceback_entry
     File "/usr/local/lib/python3.6/dist-packages/_pytest/_code/code.py", line 
821 in repr_traceback
     File "/usr/local/lib/python3.6/dist-packages/_pytest/_code/code.py", line 
877 in repr_excinfo
     File "/usr/local/lib/python3.6/dist-packages/_pytest/_code/code.py", line 
631 in getrepr
     File "/usr/local/lib/python3.6/dist-packages/_pytest/nodes.py", line 326 
in _repr_failure_py
     File "/usr/local/lib/python3.6/dist-packages/_pytest/reports.py", line 296 
in from_item_and_call
     File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 253 
in pytest_runtest_makereport
     File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 
in _multicall
     File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 
in <lambda>
     File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 
in _hookexec
     File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in 
__call__
     File 
"/usr/local/lib/python3.6/dist-packages/flaky/flaky_pytest_plugin.py", line 132 
in call_and_report
     File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 100 
in runtestprotocol
     File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 84 
in pytest_runtest_protocol
     File 
"/usr/local/lib/python3.6/dist-packages/flaky/flaky_pytest_plugin.py", line 92 
in pytest_runtest_protocol
     File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 
in _multicall
     File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 
in <lambda>
     File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 
in _hookexec
     File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in 
__call__
     File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 271 in 
pytest_runtestloop
     File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 
in _multicall
     File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 
in <lambda>
     File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 
in _hookexec
     File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in 
__call__
     File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 247 in 
_main
     File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 197 in 
wrap_session
     File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 240 in 
pytest_cmdline_main
     File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 
in _multicall
     File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 
in <lambda>
     File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 
in _hookexec
     File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in 
__call__
     File "/usr/local/lib/python3.6/dist-packages/_pytest/config/__init__.py", 
line 93 in main
     File "/usr/local/bin/pytest", line 8 in <module>
   Segmentation fault (core dumped)
   
   ```
   ## To Reproduce
   (If you developed your own code, please provide a short script that 
reproduces the error. For existing examples, please provide link.)
   
   ### Steps to reproduce
   (Paste the commands you ran that produced the error.)
   
   1.
   2.
   
   ## What have you tried to solve it?
   
   1.
   2.
   
   ## Environment
   
   We recommend using our script for collecting the diagnositc information. Run 
the following command and paste the outputs below:
   ```
   curl --retry 10 -s 
https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | 
python
   
   # paste outputs here
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to