RuRo edited a comment on issue #18090: Aborted unix-gpu CI
URL: 
https://github.com/apache/incubator-mxnet/issues/18090#issuecomment-616029228
 
 
   Here is my environment:
   
   <details>
   
   ```
   ----------Python Info----------
   Version      : 3.8.2
   Compiler     : GCC 9.2.1 20200130
   Build        : ('default', 'Feb 26 2020 22:21:03')
   Arch         : ('64bit', 'ELF')
   ------------Pip Info-----------
   Version      : 20.0.2
   Directory    : /home/ruro/.local/lib/python3.8/site-packages/pip
   ----------MXNet Info-----------
   Version      : 2.0.0
   Directory    : /usr/lib/python3.8/site-packages/mxnet
   Num GPUs     : 1
   Hashtag not found. Not installed from pre-built package.
   ----------System Info----------
   Platform     : Linux-5.4.31-1-MANJARO-x86_64-with-glibc2.2.5
   system       : Linux
   node         : ruro-laptop
   release      : 5.4.31-1-MANJARO
   version      : #1 SMP PREEMPT Wed Apr 8 10:25:32 UTC 2020
   ----------Hardware Info----------
   machine      : x86_64
   processor    : 
   Architecture:                    x86_64
   CPU op-mode(s):                  32-bit, 64-bit
   Byte Order:                      Little Endian
   Address sizes:                   39 bits physical, 48 bits virtual
   CPU(s):                          8
   On-line CPU(s) list:             0-7
   Thread(s) per core:              2
   Core(s) per socket:              4
   Socket(s):                       1
   NUMA node(s):                    1
   Vendor ID:                       GenuineIntel
   CPU family:                      6
   Model:                           158
   Model name:                      Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
   Stepping:                        9
   CPU MHz:                         3300.060
   CPU max MHz:                     3800.0000
   CPU min MHz:                     800.0000
   BogoMIPS:                        5602.18
   Virtualization:                  VT-x
   L1d cache:                       128 KiB
   L1i cache:                       128 KiB
   L2 cache:                        1 MiB
   L3 cache:                        6 MiB
   NUMA node0 CPU(s):               0-7
   Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
   Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional 
cache flushes, SMT vulnerable
   Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT 
vulnerable
   Vulnerability Meltdown:          Mitigation; PTI
   Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass 
disabled via prctl and seccomp
   Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and 
__user pointer sanitization
   Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB 
conditional, IBRS_FW, STIBP conditional, RSB filling
   Vulnerability Tsx async abort:   Not affected
   Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep 
mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
syscall nx pdpe1gb rd
                                    tscp lm constant_tsc art arch_perfmon pebs 
bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 
monitor ds_cpl vmx e
                                    st tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid 
sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand 
lahf_lm abm 3dnowpre
                                    fetch cpuid_fault epb invpcid_single pti 
ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase 
tsc_adjust bmi1 avx2 smep
                                     bmi2 erms invpcid mpx rdseed adx smap 
clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp 
hwp_notify hwp_act_wind
                                    ow hwp_epp md_clear flush_l1d
   ```
   
   </details>
   
   My mxnet install is built from source of the latest master a couple of days 
ago. CUDA version is 10.2. GPU is GTX1050.
   
   The timing for the lockup to happen can vary quite a lot. I'd recommend 
starting a couple copies of this script and leaving them running for a minute 
or so to be absolutely sure.
   
   Interrupting the script before it hangs indeed sometimes results in a 
`Segmentation Fault`. After the script hangs, it seems to ignore `SIGINT` 
completely and has to be killed by `SIGTERM` or something like that.
   
   Also, here is the feature list for my mxnet install
   <details>
   
   ```
   [✔ CUDA,
    ✔ CUDNN,
    ✔ NCCL,
    ✔ CUDA_RTC,
    ✖ TENSORRT,
    ✔ CPU_SSE,
    ✔ CPU_SSE2,
    ✔ CPU_SSE3,
    ✔ CPU_SSE4_1,
    ✔ CPU_SSE4_2,
    ✖ CPU_SSE4A,
    ✔ CPU_AVX,
    ✔ CPU_AVX2,
    ✔ OPENMP,
    ✖ SSE,
    ✔ F16C,
    ✖ JEMALLOC,
    ✖ BLAS_OPEN,
    ✖ BLAS_ATLAS,
    ✔ BLAS_MKL,
    ✖ BLAS_APPLE,
    ✔ LAPACK,
    ✔ MKLDNN,
    ✖ OPENCV,
    ✖ CAFFE,
    ✖ PROFILER,
    ✔ DIST_KVSTORE,
    ✖ CXX14,
    ✖ INT64_TENSOR_SIZE,
    ✔ SIGNAL_HANDLER,
    ✖ DEBUG,
    ✖ TVM_OP]
   ```
   
   </details>
   
   Update: some further testing revealed that
   1) It's not a GPU specific problem, I removed `with mx.Context(mx.gpu()):` 
and the problem persist.
   2) Running the test with `MXNET_ENABLE_CYTHON=0 python test.py` fixes the 
issue.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to