DickJC123 opened a new issue #18747: URL: https://github.com/apache/incubator-mxnet/issues/18747
## Description This is a problem I ran into in the development of PR https://github.com/apache/incubator-mxnet/pull/18694, and **I have include a fix** in commit https://github.com/apache/incubator-mxnet/pull/18694/commits/95bfe3a642f07ffd0c78d965b7f590cee75a44fd. An example invocation of a test that is decorated with @retry(3) and that fails on its first attempt (succeeding on its 2nd) is: ``` MXNET_TEST_SEED=757747955 pytest --verbose -s --log-cli-level=DEBUG tests/python/gpu/test_operator_gpu.py::test_np_mixedType_unary_funcs[float16-4-rint-None--5.0-5.0] ``` I've posted the error message showing the segfault below. The problem seems to center on the fact that the current retry() implementation copies any seen exception to a variable `err` that it retains as it pursues further retry attempts of the test. I believe that when the err object is finally garbage collected, the segfault is triggered (does the exception have stack trace pointers that are now stale?). The fix is to not retain the exception past the iteration that generated it. In coming up with the above explanation, I determined that retaining only the exception string also avoids the segfault and so would work as a fix. So before: ``` err = e ... raise err ``` could become: ``` err_msg = str(e) ... raise AssertionError(err_msg) ``` I prefer to stick with the initial fix in the PR, which doesn't regenerate the exception. ### Error Message ``` --------------------------------------------------------------------------------------- live log call ------------------------------------------------------------------------------[0/18716] INFO common:common.py:221 Setting test np/mx/python random seeds, use MXNET_TEST_SEED=757747955 to reproduce. rint float16 (2, 2, 2, 2) *** Maximum errors for vector of size 16: rtol=0.001, atol=1e-05 1: Error 99864.382812 Location of error: (0, 1, 1, 1), a=-1.00000000, b=-0.00000000 rint float16 (3, 3, 3, 2) rint float16 (1, 0, 2) PASSEDFatal Python error: Segmentation fault Current thread 0x00007f393667f740 (most recent call first): File "/opt/mxnet/python/mxnet/ndarray/ndarray.py", line 2570 in asnumpy File "/opt/mxnet/python/mxnet/numpy/multiarray.py", line 1251 in __repr__ File "/usr/local/lib/python3.6/dist-packages/_pytest/_io/saferepr.py", line 56 in repr_instance File "/usr/lib/python3.6/reprlib.py", line 65 in repr1 File "/usr/lib/python3.6/reprlib.py", line 55 in repr File "/usr/local/lib/python3.6/dist-packages/_pytest/_io/saferepr.py", line 47 in repr File "/usr/local/lib/python3.6/dist-packages/_pytest/_io/saferepr.py", line 82 in saferepr File "/usr/local/lib/python3.6/dist-packages/_pytest/_code/code.py", line 689 in repr_args File "/usr/local/lib/python3.6/dist-packages/_pytest/_code/code.py", line 780 in repr_traceback_entry File "/usr/local/lib/python3.6/dist-packages/_pytest/_code/code.py", line 821 in repr_traceback File "/usr/local/lib/python3.6/dist-packages/_pytest/_code/code.py", line 877 in repr_excinfo File "/usr/local/lib/python3.6/dist-packages/_pytest/_code/code.py", line 631 in getrepr File "/usr/local/lib/python3.6/dist-packages/_pytest/nodes.py", line 326 in _repr_failure_py File "/usr/local/lib/python3.6/dist-packages/_pytest/reports.py", line 296 in from_item_and_call File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 253 in pytest_runtest_makereport File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda> File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__ File "/usr/local/lib/python3.6/dist-packages/flaky/flaky_pytest_plugin.py", line 132 in call_and_report File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 100 in runtestprotocol File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 84 in pytest_runtest_protocol File "/usr/local/lib/python3.6/dist-packages/flaky/flaky_pytest_plugin.py", line 92 in pytest_runtest_protocol File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda> File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__ File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 271 in pytest_runtestloop File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda> File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__ File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 247 in _main File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 197 in wrap_session File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 240 in pytest_cmdline_main File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda> File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__ File "/usr/local/lib/python3.6/dist-packages/_pytest/config/__init__.py", line 93 in main File "/usr/local/bin/pytest", line 8 in <module> Segmentation fault (core dumped) ``` ## To Reproduce (If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.) ### Steps to reproduce (Paste the commands you ran that produced the error.) 1. 2. ## What have you tried to solve it? 1. 2. ## Environment We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below: ``` curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python # paste outputs here ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org