Hi Da

Reproduction instructions:

On the host:

Adjust core pattern:

$ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern


Use the following patch:

===============

diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
--- a/3rdparty/mkldnn
+++ b/3rdparty/mkldnn
@@ -1 +1 @@
-Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da
+Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty
diff --git a/ci/docker/runtime_functions.sh b/ci/docker/runtime_functions.sh
index 027e287..62649c9 100755
--- a/ci/docker/runtime_functions.sh
+++ b/ci/docker/runtime_functions.sh
@@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() {
     # https://github.com/apache/incubator-mxnet/issues/10026
     #export MXNET_MKLDNN_DEBUG=1  # Ignored if not present
     export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
-    nosetests-2.7 --verbose tests/python/unittest
-    nosetests-2.7 --verbose tests/python/train
-    nosetests-2.7 --verbose tests/python/quantization
+    export MXNET_TEST_SEED=11
+    export MXNET_MODULE_SEED=812478194
+    pwd
+    export MXNET_TEST_COUNT=10000
+    ulimit -c unlimited
+    ulimit -c
+    while nosetests-2.7 --verbose
tests/python/unittest/test_module.py:test_forward_reshape; do echo round;
done
+    #nosetests-2.7 --verbose tests/python/train
+    #nosetests-2.7 --verbose tests/python/quantization
 }

 unittest_ubuntu_python3_cpu() {



==============

Build and execute the test, make sure the repo is clean

$ ci/docker/runtime_functions.sh clean_repo

$ ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
/work/runtime_functions.sh unittest_ubuntu_python2_cpu


Once it crashes it will stop.

Then go in the container:


$ ci/build.py -p ubuntu_cpu --into-container --print-docker-run

A core should be there.

you might need to install gdb as root by executing the previous command
without uid so you can use apt-get.




Good luck.







On Thu, May 3, 2018 at 4:51 PM, Zheng, Da <dzz...@amazon.com> wrote:

> Thanks a lot for locating the error.
> Could you tell me How you reproduce the error?
>
> On 5/3/18, 7:45 AM, "Pedro Larroy" <pedro.larroy.li...@gmail.com> wrote:
>
>     Looks like a problem in mkl's same_shape
>
>     the pointer to mkldnn::memory::desc &desc  looks invalid.
>
>     (More stack frames follow...)
>     (gdb) p desc
>     $1 = (const mkldnn::memory::desc &) @0x10: <error reading variable>
>     (gdb) p dtype
>     $2 = 0
>     (gdb) p shape
>     $3 = (const mxnet::TShape &) @0x7f3905a58b50: {<nnvm::Tuple<long>> =
>     {static kStackCache = <optimized out>, ndim_ = 2, num_heap_allocated_
> = 0,
>         data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ = 0x0}, <No
> data
>     fields>}
>     (gdb)
>
>
>     On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dzz...@amazon.com> wrote:
>
>     > There are a few problems with valgrind, which makes it not an ideal
> tool
>     > for mxnet with python interface.
>     >
>     > First, valgrind generates a huge number of irrelevant messages, most
> of
>     > them from in Python itself.
>     >
>     > Second, valgrind can't emulate all CPU instructions. I remember that
> when
>     > I run valgrind with mxnet, valgrind exits with a strange error. I
> later on
>     > found that it was caused by an unsupported CPU instructions.
>     >
>     > Third, valgrind doesn't support multithreading well. As far as I
> know,
>     > valgrind runs everything in a single thread even if the program uses
>     > multi-threading. An error like this, which is likely caused by race
>     > condition, can't be caught by valgrind.
>     >
>     > I used to use Address Sanitizer for memory errors. This tool is much
>     > faster and can work with multi-threads. However, it doesn't work with
>     > Python for some reason.
>     >
>     > One thing we potentially can do is to use memory checker for C++ unit
>     > tests. Not sure it'll cover all memory errors we want.
>     >
>     > Best,
>     > Da
>     >
>     > On 5/3/18, 6:50 AM, "Pedro Larroy" <pedro.larroy.li...@gmail.com>
> wrote:
>     >
>     >     It's very difficult to reproduce, non-deterministic. We were also
>     > running
>     >     without signal handlers in CI so there are no stack traces
>     > unfortunately.
>     >
>     >     Care to elaborate why valgrind doesn't work with Python?
>     >
>     >
>     >
>     >     On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zhengda1...@gmail.com>
>     > wrote:
>     >
>     >     > can we build it in CI?segfault doesn't happen infrequently.
>     >     >
>     >     > 2018年5月2日 下午11:34,"Chris Olivier" <cjolivie...@gmail.com>写道:
>     >     >
>     >     > > you can try Intel Inspector, which is like an enhanced
> version of
>     >     > valgrind
>     >     > > with a GUI and whatnot.
>     >     > >
>     >     > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <
> zhengda1...@gmail.com>
>     > wrote:
>     >     > >
>     >     > > > valgrind doesn't work with Python. also, valgrind doesn't
>     > support some
>     >     > > > CPU instructions used by MXNet (I think some instructions
>     > related to
>     >     > > > random generator).
>     >     > > >
>     >     > > >
>     >     > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
>     > bhavintha...@gmail.com>
>     >     > > > wrote:
>     >     > > > > Have you tried running with valgrind to get some clues
> on the
>     >     > > root-cause?
>     >     > > > >
>     >     > > > > Bhavin Thaker.
>     >     > > > >
>     >     > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <
> zhengda1...@gmail.com
>     > >
>     >     > wrote:
>     >     > > > >
>     >     > > > >> It might also be possible that this isn't an MKLDNN bug.
>     >     > > > >> I just saw a similar memory error without MKLDNN build.
>     >     > > > >>
>     >     > > > >>
>     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>     > organizations/jenkins/
>     >     > > incubator-mxnet/detail/PR-10783/1/pipeline
>     >     > > > >>
>     >     > > > >> Best,
>     >     > > > >> Da
>     >     > > > >>
>     >     > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <
> dzz...@amazon.com>
>     >     > wrote:
>     >     > > > >> > There might be a race condition that causes the memory
>     > error.
>     >     > > > >> > It might be caused by this PR:
>     >     > > > >> > https://github.com/apache/incubator-mxnet/pull/10706/
> files
>     >     > > > >> > This PR removes MKLDNN memory from NDArray.
>     >     > > > >> > However, I don't know why this causes memory error. If
>     > someone is
>     >     > > > using
>     >     > > > >> the memory, it should still hold the memory with shared
>     > pointer.
>     >     > > > >> > But I do see the memory error increase after this PR
> is
>     > merged.
>     >     > > > >> >
>     >     > > > >> > Best,
>     >     > > > >> > Da
>     >     > > > >> >
>     >     > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
>     >     > pedro.larroy.li...@gmail.com>
>     >     > > > >> wrote:
>     >     > > > >> >
>     >     > > > >> >     I couldn't reproduce locally with:
>     >     > > > >> >
>     >     > > > >> >     ci/build.py -p ubuntu_cpu
> /work/runtime_functions.sh
>     >     > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform
>     > ubuntu_cpu
>     >     > > > >> >     /work/runtime_functions.sh
> unittest_ubuntu_python2_cpu
>     >     > > > >> >
>     >     > > > >> >
>     >     > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
>     >     > > > >> pedro.larroy.li...@gmail.com>
>     >     > > > >> >     wrote:
>     >     > > > >> >
>     >     > > > >> >     > Hi
>     >     > > > >> >     >
>     >     > > > >> >     > Seems master is not running  anymore, there's a
>     > segmentation
>     >     > > > fault
>     >     > > > >> using
>     >     > > > >> >     > MKDLNN-CPU
>     >     > > > >> >     >
>     >     > > > >> >     >
>     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>     > organizations/jenkins/
>     >     > > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
>     >     > > > >> >     >
>     >     > > > >> >     >
>     >     > > > >> >     > I see my PRs failing with a similar error.
>     >     > > > >> >     >
>     >     > > > >> >     > Pedro
>     >     > > > >> >     >
>     >     > > > >> >
>     >     > > > >> >
>     >     > > > >>
>     >     > > >
>     >     > >
>     >     >
>     >
>     >
>     >
>
>
>

Reply via email to