Bisect identifies 
https://github.com/apache/incubator-mxnet/commit/425319cb59904573bd3fe1b6fe0a7381eceb9bbd

Thus this is an issue with jemalloc + llvm libopemnp.

The correct reproducer for latest master branch is


  git clone --recursive https://github.com/apache/incubator-mxnet/ mxnet
  cd mxnet
  git checkout a726c406964b9cd17efa826738a662e09d973972 # workaround 
https://github.com/apache/incubator-mxnet/issues/17514
  mkdir build; cd build;
  cmake -DUSE_CPP_PACKAGE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -GNinja
-DUSE_CUDA=OFF -DUSE_JEMALLOC=ON ..
  ninja
  ./cpp-package/example/test_regress_label  # run a 2-3 times to reproduce

Let's move the discussion to about fixing the jemalloc, openmp incompatibility
to https://github.com/apache/incubator-mxnet/issues/17043 



@Chris, could you look into this issue as it only happens with LLVM OpenMP?



@Przemek: For 1.6.0 releas notes I suggest include recommendation to set
USE_JEMALLOC=OFF when compiling from source.

This note should probably be added in any case, as building with USE_JEMALLOC=ON
is broken on Ubuntu Ubuntu 18.10 and higher, as well as Debian Stable.

Given these release notes, +1 for the release.


Best regards
Leonard

On Tue, 2020-02-04 at 22:26 +0000, Lausen, Leonard wrote:
> Actually below reproducer is wrong. The issue was apparently fixed on master
> recently. I'm running an automated bisect and will report the result later.
> 
> On Tue, 2020-02-04 at 21:44 +0000, Lausen, Leonard wrote:
> > Hi Chris,
> > 
> > you previously found and fixed a OMP race condition during fork at 
> > https://github.com/apache/incubator-mxnet/pull/17039
> > 
> > This time no forks are involved. Could you run the following reproducer on
> > master branch:
> > 
> >   git clone --recursive https://github.com/apache/incubator-mxnet/ mxnet
> >   cd mxnet
> >   git checkout a726c406964b9cd17efa826738a662e09d973972 # workaround 
> > https://github.com/apache/incubator-mxnet/issues/17514
> >   mkdir build; cd build;
> >   cmake -DUSE_CPP_PACKAGE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -GNinja
> > -DUSE_CUDA=OFF ..
> >   ninja
> >   ./cpp-package/example/test_regress_label  # run a 2-3 times to reproduce
> > 
> > 
> > As you are OpenMP expert, you may be able to identify the root cause withe
> > relative ease.
> > 
> > Thank you,
> > 
> > Leonard
> > 
> > On Tue, 2020-02-04 at 11:06 -0800, Chris Olivier wrote:
> > > When "fixing", please "fix" through actual root-cause analysis (use gdb,
> > > for instance) and not simply by guesswork and cutting out things which
> > > probably aren't actually at fault (blaming an OMP library that's in
> > > worldwide distribution int he billions should be treated with great
> > > skepticism).
> > > 
> > > On Tue, Feb 4, 2020 at 10:44 AM Lin Yuan <apefor...@gmail.com> wrote:
> > > 
> > > > Pedro,
> > > > 
> > > > While I agree with you we need to fix this usability issue, I don't
> > > > think
> > > > this is a release blocker as Przemek mentioned above. Could we fix this
> > > > in
> > > > the next minor release?
> > > > 
> > > > Thanks,
> > > > 
> > > > Lin
> > > > 
> > > > On Tue, Feb 4, 2020 at 10:38 AM Pedro Larroy <
> > > > pedro.larroy.li...@gmail.com
> > > > wrote:
> > > > 
> > > > > Right. Would it be possible to have the CMake build also use libgomp
> > > > > for
> > > > > consistency with the releases until these issues are resolved?
> > > > > This can affect anyone compiling the distribution with CMake and also
> > > > > happens randomly in CI, worsening the contributor experience due to CI
> > > > > failures.
> > > > > 
> > > > > On Tue, Feb 4, 2020 at 9:33 AM Przemysław Trędak <ptre...@apache.org>
> > > > > wrote:
> > > > > 
> > > > > > Hi Pedro,
> > > > > > 
> > > > > > From the issue that you linked it seems that you are using the LLVM
> > > > > > OpenMP, whereas I believe the actual release uses libgomp (at least
> > > > > that's
> > > > > > what seems to be the conclusion from this issue:
> > > > > > https://github.com/apache/incubator-mxnet/issues/16891)?
> > > > > > 
> > > > > > Przemek
> > > > > > 
> > > > > > On 2020/02/04 03:42:30, Pedro Larroy <pedro.larroy.li...@gmail.com>
> > > > > > wrote:
> > > > > > > -1
> > > > > > > 
> > > > > > > Unit tests passed in CPU build.
> > > > > > > 
> > > > > > > I observe crashes related to openmp using cpp unit tests:
> > > > > > > 
> > > > > > > https://github.com/apache/incubator-mxnet/issues/17043
> > > > > > > 
> > > > > > > Pedro.
> > > > > > > 
> > > > > > > On Mon, Feb 3, 2020 at 6:44 PM Chaitanya Bapat <
> > > > > > > chai.ba...@gmail.com
> > > > > > wrote:
> > > > > > > > +1
> > > > > > > > Successfully built MXNet 1.6.0rc2 on Linux
> > > > > > > > Tested for OpPerf utility
> > > > > > > > For CPU -
> > > > > > > > 
> > > > https://gist.github.com/ChaiBapchya/d5ecc3e971c5a3c558d672477b4b6b9c
> > > > > > > > Works well!
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On Mon, 3 Feb 2020 at 15:43, Lin Yuan <apefor...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > 
> > > > > > > > > +1
> > > > > > > > > 
> > > > > > > > > Tested Horovod with mnist example. My compiler flags are
> > > > > > > > > below:
> > > > > > > > > 
> > > > > > > > > [✔ CUDA, ✔ CUDNN, ✔ NCCL, ✔ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔
> > > > > > CPU_SSE2,
> > > > > > > > ✔
> > > > > > > > > CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX,
> > > > > > > > > ✖
> > > > > > > > CPU_AVX2, ✔
> > > > > > > > > OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS,
> > > > > > > > > ✖
> > > > > > > > BLAS_MKL, ✖
> > > > > > > > > BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✔ OPENCV, ✖ CAFFE, ✖ PROFILER,
> > > > > > > > > ✔
> > > > > > > > > DIST_KVSTORE, ✖ CXX14, ✖ INT64_TENSOR_SIZE, ✖ SIGNAL_HANDLER,
> > > > > > > > > ✖
> > > > > > DEBUG, ✖
> > > > > > > > > TVM_OP]
> > > > > > > > > 
> > > > > > > > > Lin
> > > > > > > > > 
> > > > > > > > > On Sat, Feb 1, 2020 at 9:55 PM Tao Lv <ta...@apache.org>
> > > > > > > > > wrote:
> > > > > > > > > 
> > > > > > > > > > +1
> > > > > > > > > > 
> > > > > > > > > > I tested below items:
> > > > > > > > > > 1. download artifacts from Apache dist repo;
> > > > > > > > > > 2. the signature looks good;
> > > > > > > > > > 3. build from source code with MKL-DNN and MKL on centos;
> > > > > > > > > > 4. run fp32 and int8 inference of ResNet50 under
> > > > > > > > /example/quantization/.
> > > > > > > > > > thanks,
> > > > > > > > > > -tao
> > > > > > > > > > 
> > > > > > > > > > On Sun, Feb 2, 2020 at 11:00 AM Tao Lv <ta...@apache.org>
> > > > wrote:
> > > > > > > > > > > I see. I was looking at this page:
> > > > > > > > > > > 
> > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.6.0.rc2
> > > > > > > > > > > On Sun, Feb 2, 2020 at 4:54 AM Przemysław Trędak <
> > > > > > ptre...@apache.org
> > > > > > > > > > > wrote:
> > > > > > > > > > > 
> > > > > > > > > > > > Hi Tao,
> > > > > > > > > > > > 
> > > > > > > > > > > > Could you tell me where did you look for it and did not
> > > > > > > > > > > > find
> > > > > > it? I
> > > > > > > > > just
> > > > > > > > > > > > checked and both
> > > > > > > > > > > > 
> > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.6.0.rc2/
> > > > > > > > and
> > > > > > > > > > > > draft of the release on GitHub have them.
> > > > > > > > > > > > 
> > > > > > > > > > > > Thank you
> > > > > > > > > > > > Przemek
> > > > > > > > > > > > 
> > > > > > > > > > > > On 2020/02/01 14:23:11, Tao Lv <ta...@apache.org> wrote:
> > > > > > > > > > > > > It seems the src tar and signature are missing from
> > > > > > > > > > > > > the
> > > > tag.
> > > > > > > > > > > > > On Fri, Jan 31, 2020 at 11:09 AM Przemysław Trędak <
> > > > > > > > > > ptre...@apache.org>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > Dear MXNet community,
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > This is the vote to release Apache MXNet
> > > > > > > > > > > > > > (incubating)
> > > > > > version
> > > > > > > > > 1.6.0.
> > > > > > > > > > > > > > Voting starts today and will close on Monday
> > > > > > > > > > > > > > 2/3/2020
> > > > > 23:59
> > > > > > PST.
> > > > > > > > > > > > > > Link to release notes:
> > > > > > > > > > > > > > 
> > > > > > https://cwiki.apache.org/confluence/display/MXNET/1.6.0+Release+notes
> > > > > > > > > > > > > > Link to release candidate:
> > > > > > > > > > > > > > 
> > > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.6.0.rc2
> > > > > > > > > > > > > > Link to source and signatures on apache dist server:
> > > > > > > > > > > > > > 
> > > > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.6.0.rc2/
> > > > > > > > > > > > > > The differences comparing to previous release
> > > > > > > > > > > > > > candidate
> > > > > > > > 1.6.0.rc1:
> > > > > > > > > > > > > >  * Fixes for license issues (#17361, #17375, #17370,
> > > > > #17460)
> > > > > > > > > > > > > >  * Bugfix for saving LSTM layer parameter (#17288)
> > > > > > > > > > > > > >  * Bugfix for downloading the model from model zoo
> > > > > > > > > > > > > > from
> > > > > > multiple
> > > > > > > > > > > > processes
> > > > > > > > > > > > > > (#17372)
> > > > > > > > > > > > > >  * Fixed a symbol.py in AMP for GluonNLP (#17408)
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Please remember to TEST first before voting
> > > > > > > > > > > > > > accordingly:
> > > > > > > > > > > > > > +1 = approve
> > > > > > > > > > > > > > +0 = no opinion
> > > > > > > > > > > > > > -1 = disapprove (provide reason)
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > Przemyslaw Tredak
> > > > > > > > > > > > > > 
> > > > > > > > 
> > > > > > > > --
> > > > > > > > *Chaitanya Prakash Bapat*
> > > > > > > > *+1 (973) 953-6299*
> > > > > > > > 
> > > > > > > > [image: https://www.linkedin.com//in/chaibapat25]
> > > > > > > > <https://github.com/ChaiBapchya>[image:
> > > > > > https://www.facebook.com/chaibapat
> > > > > > > > ]
> > > > > > > > <https://www.facebook.com/chaibapchya>[image:
> > > > > > > > https://twitter.com/ChaiBapchya] <
> > > > > > > > https://twitter.com/ChaiBapchya
> > > > > > > [image:
> > > > > > > > https://www.linkedin.com//in/chaibapat25]
> > > > > > > > <https://www.linkedin.com//in/chaibapchya/>
> > > > > > > > 

Reply via email to