loadwiki opened a new issue #12438: core dump in macosx using big model URL: https://github.com/apache/incubator-mxnet/issues/12438 (Brief description of the problem in no more than 2 sentences.) My cpp program sometimes core dump in libmxnet.so when the model is as large as 200M bytes; no core dump with small model. ## Environment info (Required) imac osx 10.13.6 ## Build info (Required if built from source) git diff make/config.mk @@ -82,7 +82,7 @@ USE_NCCL_PATH = NONE # whether use opencv during compilation # you can disable it, however, you will not able to use # imbin iterator -USE_OPENCV = 1 +USE_OPENCV = 0 #whether use libjpeg-turbo for image decode without OpenCV wrapper USE_LIBJPEG_TURBO = 0 @@ -90,7 +90,7 @@ USE_LIBJPEG_TURBO = 0 USE_LIBJPEG_TURBO_PATH = NONE # use openmp for parallelization -USE_OPENMP = 1 +USE_OPENMP = 0 ## Error Message: (Paste the complete error message, including stack trace.) lldb main -c /cores/core.97762 (lldb) target create "main" --core "/cores/core.97762" Traceback (most recent call last): File "<input>", line 1, in <module> File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 52, in <module> import weakref File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/weakref.py", line 14, in <module> from _weakref import ( ImportError: cannot import name _remove_dead_weakref Core file '/cores/core.97762' (x86_64) was loaded. (lldb) bt warning: could not execute support code to read Objective-C class data in the process. This may reduce the quality of type information available. * thread #1, stop reason = signal SIGSTOP * frame #0: 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10 frame #1: 0x00007fff64046589 libsystem_pthread.dylib`_pthread_cond_wait + 732 frame #2: 0x00007fff61c81cb0 libc++.1.dylib`std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 18 frame #3: 0x000000010d6bc364 libmxnet.so`mxnet::engine::ThreadedEngine::WaitForVar(mxnet::engine::Var*) + 596 frame #4: 0x000000010d7cd49a libmxnet.so`mxnet::NDArray::SyncCopyToCPU(void*, unsigned long) const + 954 frame #5: 0x000000010d6ad0d4 libmxnet.so`MXPredGetOutput + 340 frame #6: 0x000000010c1cac30 main`Infer(pred_hnd=0x00007fcba2f00000, image_data=size=1, data=size=1) at face_predict.cpp:296 frame #7: 0x000000010c120e99 main`process_camera(model_path="../models/ncnn", camera=0x00007ffee3af5170, output_folder="./output/192.168.150.244", mainThread=true) at main.cpp:278 frame #8: 0x000000010c125f42 main`main(argc=4, argv=0x00007ffee3af57b0) at main.cpp:484 frame #9: 0x00007fff63d2d015 libdyld.dylib`start + 1 (lldb) thread list Process 0 stopped * thread #1: tid = 0x0000, 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #2: tid = 0x0001, 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #3: tid = 0x0002, 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #4: tid = 0x0003, 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #5: tid = 0x0004, 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #6: tid = 0x0005, 0x000000010c589a4a libmxnet.so`void mxnet::op::BatchNormForwardImpl<mshadow::cpu, float, float>(mshadow::Stream<mshadow::cpu>*, mxnet::OpContext const&, mxnet::op::BatchNormParam const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&) + 1002, stop reason = signal SIGSTOP thread #7: tid = 0x0006, 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #8: tid = 0x0007, 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #9: tid = 0x0008, 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #10: tid = 0x0009, 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP thread #11: tid = 0x000a, 0x00007fff63e7e28a libsystem_kernel.dylib`__workq_kernreturn + 10, stop reason = signal SIGSTOP thread #12: tid = 0x000b, 0x00007fff63e7e28a libsystem_kernel.dylib`__workq_kernreturn + 10, stop reason = signal SIGSTOP thread #13: tid = 0x000c, 0x00007fff63e7e28a libsystem_kernel.dylib`__workq_kernreturn + 10, stop reason = signal SIGSTOP ## Minimum reproducible example There is no obvious condition which cause the core dump. I do manuelly send a sigstop signal to my main program, then main stop as usual. I'm curious that there is no segment fault or abort or some other signal but a sigstop when the core dump occurs. At first I compile the mxnet master branch. Then I switch a release tag '1.2.1.rc1', same thing happens.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
