(Brief description of the problem in no more than 2 sentences.)
My cpp program sometimes core dump  in libmxnet.so when the model is as large 
as 200M bytes;
no core dump with small model.
## Environment info (Required)
imac osx 10.13.6
## Build info (Required if built from source)
git diff make/config.mk
@@ -82,7 +82,7 @@ USE_NCCL_PATH = NONE
 # whether use opencv during compilation
 # you can disable it, however, you will not able to use
 # imbin iterator
-USE_OPENCV = 1
+USE_OPENCV = 0

 #whether use libjpeg-turbo for image decode without OpenCV wrapper
 USE_LIBJPEG_TURBO = 0
@@ -90,7 +90,7 @@ USE_LIBJPEG_TURBO = 0
 USE_LIBJPEG_TURBO_PATH = NONE

 # use openmp for parallelization
-USE_OPENMP = 1
+USE_OPENMP = 0

## Error Message:
(Paste the complete error message, including stack trace.)
lldb main -c /cores/core.97762
(lldb) target create "main" --core "/cores/core.97762"
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File 
"/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py",
 line 52, in <module>
    import weakref
  File 
"/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/weakref.py",
 line 14, in <module>
    from _weakref import (
ImportError: cannot import name _remove_dead_weakref
Core file '/cores/core.97762' (x86_64) was loaded.
(lldb) bt
warning: could not execute support code to read Objective-C class data in the 
process. This may reduce the quality of type information available.
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007fff64046589 libsystem_pthread.dylib`_pthread_cond_wait + 
732
    frame #2: 0x00007fff61c81cb0 
libc++.1.dylib`std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&)
 + 18
    frame #3: 0x000000010d6bc364 
libmxnet.so`mxnet::engine::ThreadedEngine::WaitForVar(mxnet::engine::Var*) + 596
    frame #4: 0x000000010d7cd49a 
libmxnet.so`mxnet::NDArray::SyncCopyToCPU(void*, unsigned long) const + 954
    frame #5: 0x000000010d6ad0d4 libmxnet.so`MXPredGetOutput + 340
    frame #6: 0x000000010c1cac30 main`Infer(pred_hnd=0x00007fcba2f00000, 
image_data=size=1, data=size=1) at face_predict.cpp:296
    frame #7: 0x000000010c120e99 
main`process_camera(model_path="../models/ncnn", camera=0x00007ffee3af5170, 
output_folder="./output/192.168.150.244", mainThread=true) at main.cpp:278
    frame #8: 0x000000010c125f42 main`main(argc=4, argv=0x00007ffee3af57b0) at 
main.cpp:484
    frame #9: 0x00007fff63d2d015 libdyld.dylib`start + 1
(lldb) thread list
Process 0 stopped
* thread #1: tid = 0x0000, 0x00007fff63e7da16 
libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
  thread #2: tid = 0x0001, 0x00007fff63e7da16 
libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
  thread #3: tid = 0x0002, 0x00007fff63e7da16 
libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
  thread #4: tid = 0x0003, 0x00007fff63e7da16 
libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
  thread #5: tid = 0x0004, 0x00007fff63e7da16 
libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
  thread #6: tid = 0x0005, 0x000000010c589a4a libmxnet.so`void 
mxnet::op::BatchNormForwardImpl<mshadow::cpu, float, 
float>(mshadow::Stream<mshadow::cpu>*, mxnet::OpContext const&, 
mxnet::op::BatchNormParam const&, std::__1::vector<mxnet::TBlob, 
std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::OpReqType, 
std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::TBlob, 
std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::TBlob, 
std::__1::allocator<mxnet::TBlob> > const&) + 1002, stop reason = signal SIGSTOP
  thread #7: tid = 0x0006, 0x00007fff63e7da16 
libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
  thread #8: tid = 0x0007, 0x00007fff63e7da16 
libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
  thread #9: tid = 0x0008, 0x00007fff63e7da16 
libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
  thread #10: tid = 0x0009, 0x00007fff63e7da16 
libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
  thread #11: tid = 0x000a, 0x00007fff63e7e28a 
libsystem_kernel.dylib`__workq_kernreturn + 10, stop reason = signal SIGSTOP
  thread #12: tid = 0x000b, 0x00007fff63e7e28a 
libsystem_kernel.dylib`__workq_kernreturn + 10, stop reason = signal SIGSTOP
  thread #13: tid = 0x000c, 0x00007fff63e7e28a 
libsystem_kernel.dylib`__workq_kernreturn + 10, stop reason = signal SIGSTOP
## Minimum reproducible example
There is no obvious condition which cause the core dump.
I do manuelly send a sigstop signal to my main program, then main stop as usual.
I'm curious that there is no segment fault or abort or some other signal but a 
sigstop when the core dump occurs.
At first I compile the mxnet master branch. Then I switch a release tag 
'1.2.1.rc1', same thing happens.



[ Full content available at: 
https://github.com/apache/incubator-mxnet/issues/12438 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to