(Brief description of the problem in no more than 2 sentences.)
My cpp program sometimes core dump in libmxnet.so when the model is as large
as 200M bytes;
no core dump with small model.
## Environment info (Required)
imac osx 10.13.6
## Build info (Required if built from source)
git diff make/config.mk
@@ -82,7 +82,7 @@ USE_NCCL_PATH = NONE
# whether use opencv during compilation
# you can disable it, however, you will not able to use
# imbin iterator
-USE_OPENCV = 1
+USE_OPENCV = 0
#whether use libjpeg-turbo for image decode without OpenCV wrapper
USE_LIBJPEG_TURBO = 0
@@ -90,7 +90,7 @@ USE_LIBJPEG_TURBO = 0
USE_LIBJPEG_TURBO_PATH = NONE
# use openmp for parallelization
-USE_OPENMP = 1
+USE_OPENMP = 0
## Error Message:
(Paste the complete error message, including stack trace.)
lldb main -c /cores/core.97762
(lldb) target create "main" --core "/cores/core.97762"
Traceback (most recent call last):
File "<input>", line 1, in <module>
File
"/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py",
line 52, in <module>
import weakref
File
"/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/weakref.py",
line 14, in <module>
from _weakref import (
ImportError: cannot import name _remove_dead_weakref
Core file '/cores/core.97762' (x86_64) was loaded.
(lldb) bt
warning: could not execute support code to read Objective-C class data in the
process. This may reduce the quality of type information available.
* thread #1, stop reason = signal SIGSTOP
* frame #0: 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10
frame #1: 0x00007fff64046589 libsystem_pthread.dylib`_pthread_cond_wait +
732
frame #2: 0x00007fff61c81cb0
libc++.1.dylib`std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&)
+ 18
frame #3: 0x000000010d6bc364
libmxnet.so`mxnet::engine::ThreadedEngine::WaitForVar(mxnet::engine::Var*) + 596
frame #4: 0x000000010d7cd49a
libmxnet.so`mxnet::NDArray::SyncCopyToCPU(void*, unsigned long) const + 954
frame #5: 0x000000010d6ad0d4 libmxnet.so`MXPredGetOutput + 340
frame #6: 0x000000010c1cac30 main`Infer(pred_hnd=0x00007fcba2f00000,
image_data=size=1, data=size=1) at face_predict.cpp:296
frame #7: 0x000000010c120e99
main`process_camera(model_path="../models/ncnn", camera=0x00007ffee3af5170,
output_folder="./output/192.168.150.244", mainThread=true) at main.cpp:278
frame #8: 0x000000010c125f42 main`main(argc=4, argv=0x00007ffee3af57b0) at
main.cpp:484
frame #9: 0x00007fff63d2d015 libdyld.dylib`start + 1
(lldb) thread list
Process 0 stopped
* thread #1: tid = 0x0000, 0x00007fff63e7da16
libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
thread #2: tid = 0x0001, 0x00007fff63e7da16
libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
thread #3: tid = 0x0002, 0x00007fff63e7da16
libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
thread #4: tid = 0x0003, 0x00007fff63e7da16
libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
thread #5: tid = 0x0004, 0x00007fff63e7da16
libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
thread #6: tid = 0x0005, 0x000000010c589a4a libmxnet.so`void
mxnet::op::BatchNormForwardImpl<mshadow::cpu, float,
float>(mshadow::Stream<mshadow::cpu>*, mxnet::OpContext const&,
mxnet::op::BatchNormParam const&, std::__1::vector<mxnet::TBlob,
std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::OpReqType,
std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::TBlob,
std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::TBlob,
std::__1::allocator<mxnet::TBlob> > const&) + 1002, stop reason = signal SIGSTOP
thread #7: tid = 0x0006, 0x00007fff63e7da16
libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
thread #8: tid = 0x0007, 0x00007fff63e7da16
libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
thread #9: tid = 0x0008, 0x00007fff63e7da16
libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
thread #10: tid = 0x0009, 0x00007fff63e7da16
libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
thread #11: tid = 0x000a, 0x00007fff63e7e28a
libsystem_kernel.dylib`__workq_kernreturn + 10, stop reason = signal SIGSTOP
thread #12: tid = 0x000b, 0x00007fff63e7e28a
libsystem_kernel.dylib`__workq_kernreturn + 10, stop reason = signal SIGSTOP
thread #13: tid = 0x000c, 0x00007fff63e7e28a
libsystem_kernel.dylib`__workq_kernreturn + 10, stop reason = signal SIGSTOP
## Minimum reproducible example
There is no obvious condition which cause the core dump.
I do manuelly send a sigstop signal to my main program, then main stop as usual.
I'm curious that there is no segment fault or abort or some other signal but a
sigstop when the core dump occurs.
At first I compile the mxnet master branch. Then I switch a release tag
'1.2.1.rc1', same thing happens.
[ Full content available at:
https://github.com/apache/incubator-mxnet/issues/12438 ]
This message was relayed via gitbox.apache.org for [email protected]