Hi Da Reproduction instructions:
On the host: Adjust core pattern: $ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern Use the following patch: =============== diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn --- a/3rdparty/mkldnn +++ b/3rdparty/mkldnn @@ -1 +1 @@ -Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da +Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty diff --git a/ci/docker/runtime_functions.sh b/ci/docker/runtime_functions.sh index 027e287..62649c9 100755 --- a/ci/docker/runtime_functions.sh +++ b/ci/docker/runtime_functions.sh @@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() { # https://github.com/apache/incubator-mxnet/issues/10026 #export MXNET_MKLDNN_DEBUG=1 # Ignored if not present export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0 - nosetests-2.7 --verbose tests/python/unittest - nosetests-2.7 --verbose tests/python/train - nosetests-2.7 --verbose tests/python/quantization + export MXNET_TEST_SEED=11 + export MXNET_MODULE_SEED=812478194 + pwd + export MXNET_TEST_COUNT=10000 + ulimit -c unlimited + ulimit -c + while nosetests-2.7 --verbose tests/python/unittest/test_module.py:test_forward_reshape; do echo round; done + #nosetests-2.7 --verbose tests/python/train + #nosetests-2.7 --verbose tests/python/quantization } unittest_ubuntu_python3_cpu() { ============== Build and execute the test, make sure the repo is clean $ ci/docker/runtime_functions.sh clean_repo $ ci/build.py -p ubuntu_cpu /work/runtime_functions.sh build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu /work/runtime_functions.sh unittest_ubuntu_python2_cpu Once it crashes it will stop. Then go in the container: $ ci/build.py -p ubuntu_cpu --into-container --print-docker-run A core should be there. you might need to install gdb as root by executing the previous command without uid so you can use apt-get. Good luck. On Thu, May 3, 2018 at 4:51 PM, Zheng, Da <dzz...@amazon.com> wrote: > Thanks a lot for locating the error. > Could you tell me How you reproduce the error? > > On 5/3/18, 7:45 AM, "Pedro Larroy" <pedro.larroy.li...@gmail.com> wrote: > > Looks like a problem in mkl's same_shape > > the pointer to mkldnn::memory::desc &desc looks invalid. > > (More stack frames follow...) > (gdb) p desc > $1 = (const mkldnn::memory::desc &) @0x10: <error reading variable> > (gdb) p dtype > $2 = 0 > (gdb) p shape > $3 = (const mxnet::TShape &) @0x7f3905a58b50: {<nnvm::Tuple<long>> = > {static kStackCache = <optimized out>, ndim_ = 2, num_heap_allocated_ > = 0, > data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ = 0x0}, <No > data > fields>} > (gdb) > > > On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dzz...@amazon.com> wrote: > > > There are a few problems with valgrind, which makes it not an ideal > tool > > for mxnet with python interface. > > > > First, valgrind generates a huge number of irrelevant messages, most > of > > them from in Python itself. > > > > Second, valgrind can't emulate all CPU instructions. I remember that > when > > I run valgrind with mxnet, valgrind exits with a strange error. I > later on > > found that it was caused by an unsupported CPU instructions. > > > > Third, valgrind doesn't support multithreading well. As far as I > know, > > valgrind runs everything in a single thread even if the program uses > > multi-threading. An error like this, which is likely caused by race > > condition, can't be caught by valgrind. > > > > I used to use Address Sanitizer for memory errors. This tool is much > > faster and can work with multi-threads. However, it doesn't work with > > Python for some reason. > > > > One thing we potentially can do is to use memory checker for C++ unit > > tests. Not sure it'll cover all memory errors we want. > > > > Best, > > Da > > > > On 5/3/18, 6:50 AM, "Pedro Larroy" <pedro.larroy.li...@gmail.com> > wrote: > > > > It's very difficult to reproduce, non-deterministic. We were also > > running > > without signal handlers in CI so there are no stack traces > > unfortunately. > > > > Care to elaborate why valgrind doesn't work with Python? > > > > > > > > On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zhengda1...@gmail.com> > > wrote: > > > > > can we build it in CI?segfault doesn't happen infrequently. > > > > > > 2018年5月2日 下午11:34,"Chris Olivier" <cjolivie...@gmail.com>写道: > > > > > > > you can try Intel Inspector, which is like an enhanced > version of > > > valgrind > > > > with a GUI and whatnot. > > > > > > > > On Wed, May 2, 2018 at 9:42 PM Da Zheng < > zhengda1...@gmail.com> > > wrote: > > > > > > > > > valgrind doesn't work with Python. also, valgrind doesn't > > support some > > > > > CPU instructions used by MXNet (I think some instructions > > related to > > > > > random generator). > > > > > > > > > > > > > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker < > > bhavintha...@gmail.com> > > > > > wrote: > > > > > > Have you tried running with valgrind to get some clues > on the > > > > root-cause? > > > > > > > > > > > > Bhavin Thaker. > > > > > > > > > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng < > zhengda1...@gmail.com > > > > > > wrote: > > > > > > > > > > > >> It might also be possible that this isn't an MKLDNN bug. > > > > > >> I just saw a similar memory error without MKLDNN build. > > > > > >> > > > > > >> > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/ > > organizations/jenkins/ > > > > incubator-mxnet/detail/PR-10783/1/pipeline > > > > > >> > > > > > >> Best, > > > > > >> Da > > > > > >> > > > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da < > dzz...@amazon.com> > > > wrote: > > > > > >> > There might be a race condition that causes the memory > > error. > > > > > >> > It might be caused by this PR: > > > > > >> > https://github.com/apache/incubator-mxnet/pull/10706/ > files > > > > > >> > This PR removes MKLDNN memory from NDArray. > > > > > >> > However, I don't know why this causes memory error. If > > someone is > > > > > using > > > > > >> the memory, it should still hold the memory with shared > > pointer. > > > > > >> > But I do see the memory error increase after this PR > is > > merged. > > > > > >> > > > > > > >> > Best, > > > > > >> > Da > > > > > >> > > > > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" < > > > pedro.larroy.li...@gmail.com> > > > > > >> wrote: > > > > > >> > > > > > > >> > I couldn't reproduce locally with: > > > > > >> > > > > > > >> > ci/build.py -p ubuntu_cpu > /work/runtime_functions.sh > > > > > >> > build_ubuntu_cpu_mkldnn && ci/build.py --platform > > ubuntu_cpu > > > > > >> > /work/runtime_functions.sh > unittest_ubuntu_python2_cpu > > > > > >> > > > > > > >> > > > > > > >> > On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy < > > > > > >> pedro.larroy.li...@gmail.com> > > > > > >> > wrote: > > > > > >> > > > > > > >> > > Hi > > > > > >> > > > > > > > >> > > Seems master is not running anymore, there's a > > segmentation > > > > > fault > > > > > >> using > > > > > >> > > MKDLNN-CPU > > > > > >> > > > > > > > >> > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/ > > organizations/jenkins/ > > > > > >> > > incubator-mxnet/detail/master/801/pipeline/662 > > > > > >> > > > > > > > >> > > > > > > > >> > > I see my PRs failing with a similar error. > > > > > >> > > > > > > > >> > > Pedro > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >