mseth10 opened a new issue #19360:
URL: https://github.com/apache/incubator-mxnet/issues/19360


   ## Description
   Nightly CD pipeline fails for CUDA 11.0 during testing of MXNet binaries 
using `pytest`. All tests run successfully. The error is thrown during cleanup 
after `pytest` is done running a testing module. This error was first recorded 
when 
https://github.com/apache/incubator-mxnet/commit/480d027b85d3feff6fecec70be55eb244ddff289
 commit was merged, which dropped `pytest`'s `teardown` function. Before this 
commit, the CD pipeline was running successfully for all flavors.
   
   This error is specific to CUDA 11.0 and is not observed for CUDA 10.0 and 
10.1 as can be seen here:
   
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/1848/pipeline/361/
   
   ### Error Message
   ```
   Stack trace:
   Stack trace:
     /usr/lib64/libcudnn_ops_infer.so.8 (                                       
    + 0x15cb68f)  [0x7f7f4ce3e68f]
     /usr/lib64/libcudnn_ops_infer.so.8 ( cudnnDestroy                          
    + 0x6f  )  [0x7f7f4ba78ddf]
     /work/mxnet/python/mxnet/../../lib/libmxnet.so ( 
mshadow::Stream<mshadow::gpu>::DestroyDnnHandle()  + 0x2c  )  [0x7f81869a29ec]
     /work/mxnet/python/mxnet/../../lib/libmxnet.so ( void 
mshadow::DeleteStream<mshadow::gpu>(mshadow::Stream<mshadow::gpu>*)  + 0x13b )  
[0x7f81869a2c3b]
     /work/mxnet/python/mxnet/../../lib/libmxnet.so ( void 
mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context,
 bool, 
mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*,
 std::shared_ptr<dmlc::ManualEvent> const&)  + 0x1bb )  [0x7f81869b83ab]
     /work/mxnet/python/mxnet/../../lib/libmxnet.so ( 
std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), 
mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, 
bool)::{lambda()#4}::operator()() 
const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data
 const&, std::shared_ptr<dmlc::ManualEvent>&&)  + 0x36  )  [0x7f81869b86f6]
     /work/mxnet/python/mxnet/../../lib/libmxnet.so ( 
std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void 
(std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > 
>::_M_run()  + 0x32  )  [0x7f81869b3db2]
   /work/runtime_functions.sh: line 747:     6 Segmentation fault      (core 
dumped) pytest -m 'serial' -s --durations=50 --verbose 
tests/python/gpu/test_gluon_gpu.py
   2020-10-16 07:44:31,682 - root - INFO - Waiting for status of container 
a8b282e29adf for 600 s.
   2020-10-16 07:44:31,853 - root - INFO - Container exit status: {'Error': 
None, 'StatusCode': 139}
   2020-10-16 07:44:31,854 - root - ERROR - Container exited with an error 😞
   2020-10-16 07:44:31,854 - root - INFO - Executed command for reproduction:
   
   ci/build.py -e BRANCH=null --docker-registry mxnetci --nvidiadocker 
--platform centos7_gpu_cu110 --docker-build-retries 3 --shm-size 500m 
/work/runtime_functions.sh cd_unittest_ubuntu cu110
   ```
   
   ### Steps to reproduce
   I was able to reproduce the error by following these steps on an AWS 
Ubuntu18 Deep Learning Base AMI:
   ```
   alias python=python3
   
   git clone --recursive https://github.com/apache/incubator-mxnet.git
   cd incubator-mxnet
   pip3 install -r ci/requirements.txt --user
   
   sudo curl -L 
"https://github.com/docker/compose/releases/download/1.25.5/docker-compose-$(uname
 -s)-$(uname -m)" -o /usr/local/bin/docker-compose
   sudo chmod +x /usr/local/bin/docker-compose
   sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose
   
   python ci/build.py -e BRANCH=null --docker-registry mxnetci --platform 
centos7_gpu_cu110 --docker-build-retries 3 --shm-size 500m 
/work/runtime_functions.sh build_static_libmxnet cu110
   
   python ci/build.py -e BRANCH=null --docker-registry mxnetci --nvidiadocker 
--platform centos7_gpu_cu110 --docker-build-retries 3 --shm-size 500m 
/work/runtime_functions.sh cd_unittest_ubuntu cu110
   ```
   
   ## What have you tried to solve it?
   
   1. The above script takes a long time to run as it runs a lot of tests. I 
reduced the reproduction time by reducing the number of tests. Here's a code 
diff:
   ```
   diff --git a/ci/docker/runtime_functions.sh b/ci/docker/runtime_functions.sh
   index 40405b961..6992caa36 100755
   --- a/ci/docker/runtime_functions.sh
   +++ b/ci/docker/runtime_functions.sh
   @@ -756,7 +756,9 @@ cd_unittest_ubuntu() {
        export DMLC_LOG_STACK_TRACE_DEPTH=10
    
        local mxnet_variant=${1:?"This function requires a mxnet variant as the 
first argument"}
   +    pytest -m 'serial' -s --durations=50 --verbose 
tests/python/gpu/test_gluon_gpu.py
    
   +    : '
        OMP_NUM_THREADS=$(expr $(nproc) / 4) pytest -m 'not serial' -n 4 
--durations=50 --verbose tests/python/unittest
        pytest -m 'serial' --durations=50 --verbose tests/python/unittest
    
   @@ -782,6 +784,7 @@ cd_unittest_ubuntu() {
        if [[ ${mxnet_variant} = *mkl ]]; then
            OMP_NUM_THREADS=$(expr $(nproc) / 4) pytest -n 4 --durations=50 
--verbose tests/python/mkl
        fi
   +    '
    }
   ```
   2. I put a print statement before the `waitall` 
[command](https://github.com/apache/incubator-mxnet/blob/d0ceecbb3e4f2154a7783cba8f6e152b8c9003b1/conftest.py#L68)
 to check whether it gets executed and observed that it gets executed after the 
module ends as expected.
   
   ## Environment
   
   ***We recommend using our script for collecting the diagnostic information 
with the following command***
   `curl --retry 10 -s 
https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
 | python3`
   
   <details>
   <summary>Environment Information</summary>
   
   ```
   ----------Python Info----------
   Version      : 3.6.9
   Compiler     : GCC 8.4.0
   Build        : ('default', 'Oct  8 2020 12:12:24')
   Arch         : ('64bit', 'ELF')
   ------------Pip Info-----------
   Version      : 20.2.3
   Directory    : /usr/local/lib/python3.6/dist-packages/pip
   ----------MXNet Info-----------
   No MXNet installed.
   ----------System Info----------
   Platform     : Linux-5.4.0-1028-aws-x86_64-with-Ubuntu-18.04-bionic
   system       : Linux
   node         : ip-172-31-5-167
   release      : 5.4.0-1028-aws
   version      : #29~18.04.1-Ubuntu SMP Tue Oct 6 17:14:23 UTC 2020
   ----------Hardware Info----------
   machine      : x86_64
   processor    : x86_64
   Architecture:        x86_64
   CPU op-mode(s):      32-bit, 64-bit
   Byte Order:          Little Endian
   CPU(s):              64
   On-line CPU(s) list: 0-63
   Thread(s) per core:  2
   Core(s) per socket:  16
   Socket(s):           2
   NUMA node(s):        2
   Vendor ID:           GenuineIntel
   CPU family:          6
   Model:               85
   Model name:          Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
   Stepping:            7
   CPU MHz:             3109.947
   BogoMIPS:            5000.00
   Hypervisor vendor:   KVM
   Virtualization type: full
   L1d cache:           32K
   L1i cache:           32K
   L2 cache:            1024K
   L3 cache:            36608K
   NUMA node0 CPU(s):   0-15,32-47
   NUMA node1 CPU(s):   16-31,48-63
   Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm 
constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf 
tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe 
popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 
3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms 
invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw 
avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0088 
sec, LOAD: 0.6286 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1193 sec, LOAD: 
0.1101 sec.
   Error open Gluon Tutorial(cn): https://zh.gluon.ai, <urlopen error [SSL: 
CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>, DNS 
finished in 0.055264949798583984 sec.
   Timing for FashionMNIST: 
https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz,
 DNS: 0.0012 sec, LOAD: 0.1100 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0014 sec, LOAD: 
0.3008 sec.
   Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: 
Forbidden, DNS finished in 0.0010542869567871094 sec.
   ----------Environment----------
   ```
   
   </details>
   @leezu @TristonC


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to