Re: segmentation fault in master using mkdlnn

Zheng, Da Thu, 03 May 2018 07:36:30 -0700

There are a few problems with valgrind, which makes it not an ideal tool for 
mxnet with python interface.


First, valgrind generates a huge number of irrelevant messages, most of them 
from in Python itself.

Second, valgrind can't emulate all CPU instructions. I remember that when I run 
valgrind with mxnet, valgrind exits with a strange error. I later on found that 
it was caused by an unsupported CPU instructions.

Third, valgrind doesn't support multithreading well. As far as I know, valgrind 
runs everything in a single thread even if the program uses multi-threading. An 
error like this, which is likely caused by race condition, can't be caught by 
valgrind.

I used to use Address Sanitizer for memory errors. This tool is much faster and 
can work with multi-threads. However, it doesn't work with Python for some 
reason. 

One thing we potentially can do is to use memory checker for C++ unit tests. 
Not sure it'll cover all memory errors we want.

Best,
Da

On 5/3/18, 6:50 AM, "Pedro Larroy" <pedro.larroy.li...@gmail.com> wrote:

    It's very difficult to reproduce, non-deterministic. We were also running
    without signal handlers in CI so there are no stack traces unfortunately.
    
    Care to elaborate why valgrind doesn't work with Python?
    
    
    
    On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zhengda1...@gmail.com> wrote:
    
    > can we build it in CI？segfault doesn't happen infrequently.
    >
    > 2018年5月2日 下午11:34，"Chris Olivier" <cjolivie...@gmail.com>写道：
    >
    > > you can try Intel Inspector, which is like an enhanced version of
    > valgrind
    > > with a GUI and whatnot.
    > >
    > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <zhengda1...@gmail.com> wrote:
    > >
    > > > valgrind doesn't work with Python. also, valgrind doesn't support some
    > > > CPU instructions used by MXNet (I think some instructions related to
    > > > random generator).
    > > >
    > > >
    > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <bhavintha...@gmail.com>
    > > > wrote:
    > > > > Have you tried running with valgrind to get some clues on the
    > > root-cause?
    > > > >
    > > > > Bhavin Thaker.
    > > > >
    > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <zhengda1...@gmail.com>
    > wrote:
    > > > >
    > > > >> It might also be possible that this isn't an MKLDNN bug.
    > > > >> I just saw a similar memory error without MKLDNN build.
    > > > >>
    > > > >>
    > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
    > > incubator-mxnet/detail/PR-10783/1/pipeline
    > > > >>
    > > > >> Best,
    > > > >> Da
    > > > >>
    > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dzz...@amazon.com>
    > wrote:
    > > > >> > There might be a race condition that causes the memory error.
    > > > >> > It might be caused by this PR:
    > > > >> > https://github.com/apache/incubator-mxnet/pull/10706/files
    > > > >> > This PR removes MKLDNN memory from NDArray.
    > > > >> > However, I don't know why this causes memory error. If someone is
    > > > using
    > > > >> the memory, it should still hold the memory with shared pointer.
    > > > >> > But I do see the memory error increase after this PR is merged.
    > > > >> >
    > > > >> > Best,
    > > > >> > Da
    > > > >> >
    > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
    > pedro.larroy.li...@gmail.com>
    > > > >> wrote:
    > > > >> >
    > > > >> >     I couldn't reproduce locally with:
    > > > >> >
    > > > >> >     ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
    > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
    > > > >> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
    > > > >> >
    > > > >> >
    > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
    > > > >> pedro.larroy.li...@gmail.com>
    > > > >> >     wrote:
    > > > >> >
    > > > >> >     > Hi
    > > > >> >     >
    > > > >> >     > Seems master is not running  anymore, there's a 
segmentation
    > > > fault
    > > > >> using
    > > > >> >     > MKDLNN-CPU
    > > > >> >     >
    > > > >> >     >
    > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
    > > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
    > > > >> >     >
    > > > >> >     >
    > > > >> >     > I see my PRs failing with a similar error.
    > > > >> >     >
    > > > >> >     > Pedro
    > > > >> >     >
    > > > >> >
    > > > >> >
    > > > >>
    > > >
    > >
    >

Re: segmentation fault in master using mkdlnn

Reply via email to