Re: OMP

Pedro Larroy Tue, 25 Jun 2019 13:21:21 -0700

Nobody claimed that the original lockup has to do with OMP, but the
fix caused re-entrancy into OMP initialization as explained below. So
I agree with your statement that the bug that using pthread_atfork was
fixing is not related with OMP, but the fix is causing interactions
with OMP as described above.


Pedro.

On Tue, Jun 25, 2019 at 12:33 PM Chris Olivier <[email protected]> wrote:
>
> The call stacks there are mostly associated with the execution engine
> threads, which are not OMP threads.  That lockup doesn't look to me to be
> related to OMP   -- the execution engine uses its own thread pool logic --
> I'm pretty familiar with that part of the code.  Unless I am missing one --
> can you point to the one that looks OMP-related?
>
>
> On Tue, Jun 25, 2019 at 10:35 AM Pedro Larroy <[email protected]>
> wrote:
>
> > Thanks for digging that out Kellen. That's good info so maybe it would
> > be good to rework the fix with the info you provided and remove the
> > pthread_atfork handlers.
> > Do you think setting the device would avoid the problem seen on the
> > backtrace you provided?  specifically here:
> >
> > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600#file-hang_bt-L24
> >
> > On Mon, Jun 24, 2019 at 6:43 PM kellen sunderland
> > <[email protected]> wrote:
> > >
> > > I remember at the time we also had a read through of this blog post, but
> > to
> > > use the code looked like it was following the advice:
> > >
> > https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/
> > >
> > > On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland <
> > > [email protected]> wrote:
> > >
> > > > I remember this hang as well, it was pretty hard to reproduce IIRC.  I
> > > > believe the stacks for the hang are here:
> > > >
> > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600
> > and
> > > > the trick was we could only debug it up to the point that we hit:
> > > >
> > > > #0  0x00007fec6df1ba4f in futex_wait (private=0, expected=1,
> > > > futex_word=0x7fec60843758)
> > > > at ../sysdeps/unix/sysv/linux/futex-internal.h:61
> > > > #1  futex_wait_simple (private=0, expected=1,
> > futex_word=0x7fec60843758)
> > > >     at ../sysdeps/nptl/futex-internal.h:135
> > > > #2  __pthread_once_slow (once_control=0x7fec60843758,
> > > > init_routine=0x7fec605f38f0)
> > > >     at pthread_once.c:105
> > > > ...
> > > > #6  0x00007fec6061c577 in cudaSetDevice () from
> > > > /usr/local/cuda/lib64/libcudart.so.9.0
> > > >
> > > > because the code in libcudart is obviously closed source we couldn't
> > dig
> > > > into what threading work was going on when we called cudaSetDevice.
> > > >
> > > > On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy <
> > [email protected]>
> > > > wrote:
> > > >
> > > >> If you check initialize.cc we seem to be explicitly disabling that
> > > >> behaviour in pthread_at_fork which seems to cause thread contention
> > > >> during multiprocessing. Why do we need this major advantage for the
> > > >> library if that's the case?
> > > >>
> > > >> Related PRs:
> > > >>
> > > >> https://github.com/apache/incubator-mxnet/pull/10820
> > > >> https://github.com/apache/incubator-mxnet/issues/14396
> > > >>
> > > >> The original code was authored in this PR:
> > > >>
> > > >> https://github.com/apache/incubator-mxnet/pull/8677
> > > >>
> > > >> I actually remember this fix, it was done during a release as the cuda
> > > >> runtime was forking and the engine was being re-entered. If that
> > > >> situation is now happening anymore it might not be needed any longer.
> > > >> I don't think we know the cause why there was a fork inside cuda, so
> > > >> the code has grown around a fix for an issue which its root cause was
> > > >> not understood, and side effects which this fix caused afterwards.
> > > >>
> > > >> My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in
> > > >> the link above, no libgomp.
> > > >>
> > > >> I didn't try the Make build.
> > > >>
> > > >> I would refactor the code linked above and stop using pthread_at_fork,
> > > >> since OMP assumes it won't be initialized twice, but needs to be very
> > > >> well tested to make sure it doesn't cause bugs or affect the fixes
> > > >> done on the linked PRs above.
> > > >>
> > > >> Pedro.
> > > >>
> > > >> On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier <[email protected]>
> > > >> wrote:
> > > >> >
> > > >> > one major advantage of intel/llvm omp is that it spawns a new thread
> > > >> pool
> > > >> > after fork if a thread pool was already created. this is so that omp
> > > >> can be
> > > >> > used in the forked processes. libgomp doesn’t do this so it’ll just
> > > >> lock up
> > > >> > if you try to do omp in the forked process.
> > > >> >
> > > >> > is your build linking libgomp as well?
> > > >> >
> > > >> > standard mkl build (from Makefile) uses same omp library. are there
> > > >> > problems with that build?
> > > >> >
> > > >> > what changes need to be made to make the assertion not fire?
> > > >> >
> > > >> > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy <
> > > >> [email protected]>
> > > >> > wrote:
> > > >> >
> > > >> > > There's an assertion which is easily reproducible, and also
> > there's a
> > > >> > > crash including core dump, the latter is not easy to reproduce
> > for me
> > > >> > > in different environments. I have also seen mxnet getting stuck
> > > >> > > without progressing with this build configuration and using no
> > CPU at
> > > >> > > all when running unit tests.
> > > >> > >
> > > >> > > In my view, the root cause of the assertion is that we are
> > re-entering
> > > >> > > OMP initialization when spawning threads on the following code
> > through
> > > >> > > pthread_at_fork
> > > >> > >
> > > >> > >
> > > >>
> > https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58
> > > >> > >
> > > >> > > This causes double initialization of the OMP engine, including the
> > > >> > > assertion which you are asking about,  and I suspect some
> > additional
> > > >> > > overhead. That's the shady forking part you are asking for.
> > > >> > >
> > > >> > > A question for you: What is the cause of runtime differences
> > between
> > > >> > > OMP runtimes? Shouldn't the implementation overhead diminish as
> > > >> > > threads run longer?
> > > >> > >
> > > >> > > Pedro.
> > > >> > >
> > > >> > > On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier <
> > [email protected]>
> > > >> > > wrote:
> > > >> > > >
> > > >> > > > What’s the reason for the assertion failure? btw classifying an
> > > >> assertion
> > > >> > > > failure a “crash” is debatable. As I stated in the original
> > issue a
> > > >> long
> > > >> > > > time ago, it’s possible something shady is being done with when
> > > >> forking
> > > >> > > > that should be fixed.  The assertion should be root caused.
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy <
> > > >> > > [email protected]>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Added a dockerfile, and reports of a crash in my local machine
> > > >> when
> > > >> > > > > running MKL+OMP+DEBUG, with Anton's branch the crash happened
> > as
> > > >> well.
> > > >> > > > > I couldn't reproduce the crash on my EC2 machine:
> > > >> > > > > Added the backtrace of the crash as well.
> > > >> > > > >
> > > >> > > > > https://github.com/apache/incubator-mxnet/issues/10856
> > > >> > > > >
> > > >> > > > > Dockerfile here:
> > > >> > > > >
> > > >> > > > > https://github.com/larroy/mxnet_omp
> > > >> > > > >
> > > >> > > > > Kind regards.
> > > >> > > > >
> > > >> > > > > Pedro.
> > > >> > > > >
> > > >> > > > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu <
> > > >> > > [email protected]>
> > > >> > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > As already proposed, I think the easiest way to get a common
> > > >> > > > > understanding
> > > >> > > > > > is if we start with a few docker containers. Pedro, would
> > it be
> > > >> > > possible
> > > >> > > > > > for you to wrap your benchmarks into a few containers that
> > will
> > > >> > > produce
> > > >> > > > > > your shown results? That way, we can avoid possible
> > > >> > > misunderstandings and
> > > >> > > > > > also pinpoint the exact parts where people disagree or
> > > >> misunderstood
> > > >> > > each
> > > >> > > > > > other.
> > > >> > > > > >
> > > >> > > > > > -Marco
> > > >> > > > > >
> > > >> > > > > > Pedro Larroy <[email protected]> schrieb am Do.,
> > > >> 20. Juni
> > > >> > > > > 2019,
> > > >> > > > > > 21:47:
> > > >> > > > > >
> > > >> > > > > > > I can confirm that we are linking with two versions of
> > omp,
> > > >> I'm
> > > >> > > > > > > gaining more clarity into this topic, but I have still
> > > >> questions,
> > > >> > > the
> > > >> > > > > > > facts that I got so far are the folllowing:
> > > >> > > > > > >
> > > >> > > > > > > * #1: We are linking with two versions of omp, intel's omp
> > > >> and llvm
> > > >> > > > > > > openmp when building with MKL enabled.
> > > >> > > > > > > * #2: We have 3 different possible OMP versions: Intel OMP
> > > >> (comes
> > > >> > > with
> > > >> > > > > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with
> > gcc)
> > > >> (This
> > > >> > > > > > > one is used on the PR proposed by Anton).
> > > >> > > > > > >
> > > >> > > > > > > Questions:
> > > >> > > > > > >
> > > >> > > > > > >  * #1 Is it ok to have two versions of openmp linked at
> > the
> > > >> same
> > > >> > > time?
> > > >> > > > > > >  * #2 Which implementation of OMP gives the best
> > > >> performance?  (See
> > > >> > > > > > > total training time of my measurement for a partial
> > answer)
> > > >> > > > > > >  * #3 Should we have a build flag so we can choose the OMP
> > > >> version
> > > >> > > at
> > > >> > > > > > > runtime?
> > > >> > > > > > >  * #4 Which Compiler and build flags did Chris use to get
> > 10x
> > > >> > > slowdown?
> > > >> > > > > > >  * #5 @Stas: is there a script to replicate your
> > benchmarks
> > > >> > > easily? If
> > > >> > > > > > > so could you provide a link?  I think we would need to
> > > >> reproduce
> > > >> > > your
> > > >> > > > > > > benchmarks and verify which versions are being linked.
> > It's
> > > >> > > possible
> > > >> > > > > > > that while compiling with MKL intel's omp was pulled in
> > > >> instead of
> > > >> > > > > > > GNU OpenMP.
> > > >> > > > > > >  * #6 @Chris: how to maintain the copy of LLVM's Openmp?
> > > >> Should we
> > > >> > > > > > > update the subrepo regularly?
> > > >> > > > > > >
> > > >> > > > > > > My conclusion so far:
> > > >> > > > > > >
> > > >> > > > > > >  * #1 We should avoid linking two versions of omp if
> > possible
> > > >> and
> > > >> > > > > > > allow users to choose one in the build as we do for BLAS.
> > > >> > > > > > >  * #2 For performance reasons and more control vs
> > different
> > > >> > > compiler
> > > >> > > > > > > versions seems it makes indeed sense to keep the LLVM
> > OpenMP
> > > >> > > version
> > > >> > > > > > > in 3rdparty for now. So unless some more data is
> > gathered, it
> > > >> makes
> > > >> > > > > > > sense not to remove it as of now.
> > > >> > > > > > >  * #3 We should provide build options to choose which
> > openmp
> > > >> > > library
> > > >> > > > > > > is to be used from the three options available, including
> > > >> libgomp.
> > > >> > > > > > >  * #4 Refining the build we could also enable OpenMP in
> > mac
> > > >> without
> > > >> > > > > > > additional contortions (doesn't work as of today):
> > > >> > > > > > > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
> > > >> > > > > > >  * #5 We should add different omp versions to our
> > benchmarks
> > > >> and
> > > >> > > track
> > > >> > > > > > > the performance, so this data is available for prescribing
> > > >> the best
> > > >> > > > > > > build options and for binary releases.
> > > >> > > > > > >
> > > >> > > > > > > This is also an interesting related gh issue posted in the
> > > >> mkl-dnn
> > > >> > > > > > > repository:  https://github.com/intel/mkl-dnn/issues/230
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > I don't observe the order of magnitude divergence
> > reported by
> > > >> > > Chris in
> > > >> > > > > > > vanilla Ubuntu 18.04 in samples / s but the full training
> > > >> finishes
> > > >> > > > > > > indeed faster with the OMP from 3rdparty (LLVM openmp) vs
> > > >> libgomp.
> > > >> > > > > > >
> > > >> > > > > > > There's also differences in training time when using MKL
> > and
> > > >> the ,
> > > >> > > > > > > it's actually a bit slower, I don't know if it's related
> > to
> > > >> OMP.
> > > >> > > > > > >
> > > >> > > > > > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
> > > >> > > > > > >
> > > >> > > > > > > Anton's branch:  [email protected]:lebeg/incubator-mxnet.git
> > > >>  branch
> > > >> > > > > 'omp'
> > > >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
> > > >> > > > > > > build/libmxnet.so |grep -i omp
> > > >> > > > > > >         libgomp.so.1 =>
> > /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > >> > > > > > > (0x00007fd99a51d000)
> > > >> > > > > > >
> > > >> > > > > > > time python train_mnist.py
> > > >> > > > > > >
> > > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.984176
> > > >> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 41617.00
> > > >> samples/sec
> > > >> > > > > > >  accuracy=1.000000
> > > >> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 47990.69
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999531
> > > >> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 47517.01
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999687
> > > >> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 47430.53
> > > >> samples/sec
> > > >> > > > > > >  accuracy=1.000000
> > > >> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 47649.77
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999687
> > > >> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 51708.12
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999687
> > > >> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 57228.63
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999375
> > > >> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 50887.85
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999844
> > > >> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 53947.98
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999531
> > > >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > >> > > > > > > INFO:root:Epoch[19] Time cost=1.219
> > > >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983977
> > > >> > > > > > > 1011.98user 26.78system 0:31.54elapsed 3292%CPU
> > > >> (0avgtext+0avgdata
> > > >> > > > > > > 1146052maxresident)k
> > > >> > > > > > > 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps
> > > >> > > > > > >
> > > >> > > > > > > Master, MKL ON:
> > > >> > > > > > >
> > > >> > > > > > > (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification
> > > >> [master]>
> > > >> > > ldd
> > > >> > > > > > > ../../build/libmxnet.so | grep -i omp
> > > >> > > > > > >         libomp.so =>
> > > >> > > > > > >
> > > >> > >
> > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > >> > > > > > > (0x00007f05ba38f000)
> > > >> > > > > > >         libiomp5.so =>
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > >
> > > >> > >
> > > >>
> > /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
> > > >> > > > > > > (0x00007f05b09f4000)
> > > >> > > > > > >
> > > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.982484
> > > >> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 36651.63
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999691
> > > >> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 45093.98
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999844
> > > >> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 45146.84
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999687
> > > >> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 45119.90
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999687
> > > >> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 44998.96
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999531
> > > >> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 45072.25
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999844
> > > >> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 44969.79
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999844
> > > >> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 44962.78
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999844
> > > >> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 44945.47
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999375
> > > >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717
> > > >> > > > > > > INFO:root:Epoch[19] Time cost=1.367
> > > >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.982783
> > > >> > > > > > > 854.97user 847.21system 0:41.44elapsed 4106%CPU
> > > >> (0avgtext+0avgdata
> > > >> > > > > > > 1154348maxresident)k
> > > >> > > > > > > 0inputs+0outputs (0major+3624361minor)pagefaults 0swaps
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > MKL OFF:
> > > >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]>
> > grep -i
> > > >> MKL
> > > >> > > > > > > cmake_options.yml
> > > >> > > > > > > USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
> > > >> > > > > > > USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL
> > > >> found) IF
> > > >> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > >> > > > > > > USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL
> > found)
> > > >> IF
> > > >> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> > > >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd
> > > >> > > > > > > build/libmxnet.so |grep -i omp
> > > >> > > > > > >         libomp.so =>
> > > >> > > > > > >
> > > >> > >
> > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> > > >> > > > > > > (0x00007fb720c54000)
> > > >> > > > > > >
> > > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.983479
> > > >> > > > > > > INFO:root:Epoch[19] Batch [0-100]       Speed: 46784.02
> > > >> samples/sec
> > > >> > > > > > >  accuracy=1.000000
> > > >> > > > > > > INFO:root:Epoch[19] Batch [100-200]     Speed: 48824.29
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999687
> > > >> > > > > > > INFO:root:Epoch[19] Batch [200-300]     Speed: 49190.31
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999687
> > > >> > > > > > > INFO:root:Epoch[19] Batch [300-400]     Speed: 51518.77
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999844
> > > >> > > > > > > INFO:root:Epoch[19] Batch [400-500]     Speed: 51551.62
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999844
> > > >> > > > > > > INFO:root:Epoch[19] Batch [500-600]     Speed: 49026.35
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999844
> > > >> > > > > > > INFO:root:Epoch[19] Batch [600-700]     Speed: 49002.46
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999375
> > > >> > > > > > > INFO:root:Epoch[19] Batch [700-800]     Speed: 48980.55
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999687
> > > >> > > > > > > INFO:root:Epoch[19] Batch [800-900]     Speed: 47402.56
> > > >> samples/sec
> > > >> > > > > > >  accuracy=0.999844
> > > >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999767
> > > >> > > > > > > INFO:root:Epoch[19] Time cost=1.259
> > > >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983181
> > > >> > > > > > > 755.36user 754.94system 0:35.89elapsed 4207%CPU
> > > >> (0avgtext+0avgdata
> > > >> > > > > > > 1147008maxresident)k
> > > >> > > > > > > 0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps
> > > >> > > > > > >
> > > >> > > > > > > Let me know what you think.
> > > >> > > > > > >
> > > >> > > > > > > Link to the original PR:
> > > >> > > > > > > https://github.com/apache/incubator-mxnet/pull/12160
> > > >> > > > > > >
> > > >> > > > > > > Thanks.
> > > >> > > > > > >
> > > >> > > > > > > On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
> > > >> > > > > > > <[email protected]> wrote:
> > > >> > > > > > > >
> > > >> > > > > > > > "if you’re linking in two then you’re doing something
> > > >> wrong."
> > > >> > > > > Correct,
> > > >> > > > > > > > that's one thing I believe we've got consensus on.  So
> > > >> let's call
> > > >> > > > > that
> > > >> > > > > > > out
> > > >> > > > > > > > as a bug to be fixed.
> > > >> > > > > > > >
> > > >> > > > > > > > Let's move forward with some reproducible numbers and
> > then
> > > >> > > discuss
> > > >> > > > > the
> > > >> > > > > > > pros
> > > >> > > > > > > > / cons of which particular OMP implementation we should
> > use.
> > > >> > > > > > > >
> > > >> > > > > > > > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <
> > > >> > > > > > > [email protected]>
> > > >> > > > > > > > wrote:
> > > >> > > > > > > >
> > > >> > > > > > > > > Hi Chris
> > > >> > > > > > > > >
> > > >> > > > > > > > > I would ask you to have a bit of patience and help us
> > > >> with your
> > > >> > > > > > > > > experience in this matter. Nobody is ignoring
> > anything, I
> > > >> > > think we
> > > >> > > > > are
> > > >> > > > > > > > > individually gathering feedbacks and trying to
> > understand
> > > >> the
> > > >> > > > > multiple
> > > >> > > > > > > > > contributions done to this topic including yours,
> > then go
> > > >> step
> > > >> > > by
> > > >> > > > > > > > > step, understand what is going on and run experiments
> > and
> > > >> > > report
> > > >> > > > > back
> > > >> > > > > > > > > to the list or the corresponding github item. It was
> > > >> suggested
> > > >> > > by
> > > >> > > > > > > > > Kellen to prepare some containers, this takes effort.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Regarding your final comment, most of us also have
> > many
> > > >> other
> > > >> > > > > things
> > > >> > > > > > > > > to do and responsibilities even if our daytime jobs
> > might
> > > >> > > involve
> > > >> > > > > > > > > MXNet in some form or another. I think that's part of
> > the
> > > >> > > privilege
> > > >> > > > > > > > > and responsibility of working close with an open
> > source
> > > >> > > project and
> > > >> > > > > > > > > the magic of collaboration across organizations. Let's
> > > >> all be
> > > >> > > > > patient
> > > >> > > > > > > > > and take some time to understand and reason about this
> > > >> topic
> > > >> > > which
> > > >> > > > > is
> > > >> > > > > > > > > not simple. Since we decided to step back and gather
> > more
> > > >> data
> > > >> > > > > let's
> > > >> > > > > > > > > take time and do it properly.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Personally I hope to find time to look again into this
> > > >> issue
> > > >> > > before
> > > >> > > > > > > > > the end of the week.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Thanks.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Pedro.
> > > >> > > > > > > > >
> > > >> > > > > > > > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <
> > > >> > > > > [email protected]>
> > > >> > > > > > > > > wrote:
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > if you’re linking in two then you’re doing something
> > > >> wrong.
> > > >> > > You
> > > >> > > > > can
> > > >> > > > > > > see
> > > >> > > > > > > > > by
> > > >> > > > > > > > > > my email yesterday that only one is linked in. This
> > is
> > > >> also
> > > >> > > the
> > > >> > > > > case
> > > >> > > > > > > with
> > > >> > > > > > > > > > the mkl version built by the Makefile — only the
> > Intel
> > > >> OMP
> > > >> > > > > library is
> > > >> > > > > > > > > used
> > > >> > > > > > > > > > (no libgomp).
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > That being said, Do you have clear evidence that
> > using
> > > >> Intel
> > > >> > > OMP
> > > >> > > > > is
> > > >> > > > > > > both
> > > >> > > > > > > > > > problematic and the situation isn’t fixable?  The
> > > >> burden of
> > > >> > > > > proof is
> > > >> > > > > > > on
> > > >> > > > > > > > > the
> > > >> > > > > > > > > > ones requesting the change — it is not my
> > > >> responsibility to
> > > >> > > > > justify
> > > >> > > > > > > the
> > > >> > > > > > > > > > current state.  There must be something “terrible”
> > and
> > > >> > > unfixable
> > > >> > > > > to
> > > >> > > > > > > > > justify
> > > >> > > > > > > > > > a change.  I have seen no proof of this in all this
> > > >> time.
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > On a side note, I mentioned a couple of things in my
> > > >> email
> > > >> > > > > yesterday
> > > >> > > > > > > that
> > > >> > > > > > > > > > still are not being responded to (they were also
> > > >> ignored in
> > > >> > > the
> > > >> > > > > last
> > > >> > > > > > > > > > incarnation of this “discussion” — I have much
> > > >> experience in
> > > >> > > this
> > > >> > > > > > > matter
> > > >> > > > > > > > > to
> > > >> > > > > > > > > > assume “discussion” is a waste of my time, seeing
> > and I
> > > >> am
> > > >> > > not
> > > >> > > > > paid
> > > >> > > > > > > to
> > > >> > > > > > > > > > “work on” mxnet like y’all are).
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > -C
> > > >> > > > > > > > > >
> > > >> > > > > > > > > >
> > > >> > > > > > > > > >
> > > >> > > > > > > > > >
> > > >> > > > > > > > > >
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> > > >> > > > > > > > > > [email protected]> wrote:
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > > I've also quite often seen two versions of OpenMP
> > > >> linked.
> > > >> > > I
> > > >> > > > > think
> > > >> > > > > > > we
> > > >> > > > > > > > > can
> > > >> > > > > > > > > > > all agree we probably want to avoid linking in two
> > > >> > > libraries
> > > >> > > > > that
> > > >> > > > > > > do
> > > >> > > > > > > > > > > effectively the same thing.
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > The performance questions should be fairly
> > straight
> > > >> > > forward to
> > > >> > > > > > > > > demonstrate
> > > >> > > > > > > > > > > right?  Could we just collaborate on a few minimal
> > > >> > > Dockerfiles
> > > >> > > > > that
> > > >> > > > > > > > > show
> > > >> > > > > > > > > > > (or don't show) Intel OpenMP performance speedups
> > > >> with the
> > > >> > > > > > > workloads
> > > >> > > > > > > > > Chris
> > > >> > > > > > > > > > > is referencing?
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov,
> > Stanislav <
> > > >> > > > > > > > > > > [email protected]> wrote:
> > > >> > > > > > > > > > >
> > > >> > > > > > > > > > > > Hi, Chris!
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > Stas here - I've gathered that performance data.
> > > >> > > > > > > > > > > > Sure thing, I can be wrong, but please
> > elaborate a
> > > >> bit on
> > > >> > > > > what
> > > >> > > > > > > we are
> > > >> > > > > > > > > > > > missing.
> > > >> > > > > > > > > > > > Be assured, intentional misdirection was never a
> > > >> case.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > Thanks a lot for being constructive.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > > Turning Intel OMP on and off (and MKL as well,
> > > >> since it
> > > >> > > > > tends
> > > >> > > > > > > to
> > > >> > > > > > > > > pull
> > > >> > > > > > > > > > > in
> > > >> > > > > > > > > > > > omp, depending which one is linked in).
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > We never ever considered turning MKL off. We
> > are on
> > > >> the
> > > >> > > same
> > > >> > > > > page
> > > >> > > > > > > > > here -
> > > >> > > > > > > > > > > > MKL is crucial for the performance.
> > > >> > > > > > > > > > > > Why should we? There's a GOMP-linked version of
> > MKL,
> > > >> > > that we
> > > >> > > > > can
> > > >> > > > > > > use.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > What we did - we measured, if using compilers
> > > >> default
> > > >> > > OpenMP
> > > >> > > > > > > > > > > > implementation instead of referenced source code
> > > >> > > > > distribution of
> > > >> > > > > > > > > OpenMP
> > > >> > > > > > > > > > > > makes anything slower.
> > > >> > > > > > > > > > > > We have found the impact to be hardly
> > measurable.
> > > >> > > > > > > > > > > > The difference between GOMP and iOMP is <5% on
> > our
> > > >> > > > > benchmarks,
> > > >> > > > > > > most
> > > >> > > > > > > > > of
> > > >> > > > > > > > > > > the
> > > >> > > > > > > > > > > > time less than that.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > We just suggest to simplify the build of mxnet,
> > by
> > > >> > > removing
> > > >> > > > > the
> > > >> > > > > > > > > > > > unnecessary dependency.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > During that we discovered for example the
> > following
> > > >> > > amazing
> > > >> > > > > > > issue:
> > > >> > > > > > > > > > > >
> > > >> https://github.com/apache/incubator-mxnet/issues/14087
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > Best Regards
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > Stas
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > > On 18.06.19, 18:24, "Chris Olivier" <
> > > >> > > [email protected]>
> > > >> > > > > > > wrote:
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     I am very reluctant to feed the trolls
> > again,
> > > >> and
> > > >> > > this
> > > >> > > > > will
> > > >> > > > > > > be
> > > >> > > > > > > > > teh
> > > >> > > > > > > > > > > last
> > > >> > > > > > > > > > > >     time I address Pedro or Anton on the
> > subject,
> > > >> but
> > > >> > > since I
> > > >> > > > > > > think
> > > >> > > > > > > > > the
> > > >> > > > > > > > > > > > numbers
> > > >> > > > > > > > > > > >     being presented are incorrect (either by te
> > > >> builders
> > > >> > > not
> > > >> > > > > > > really
> > > >> > > > > > > > > > > >     understanding what they are building, or
> > > >> possibly
> > > >> > > > > intentional
> > > >> > > > > > > > > > > > misdirection):
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     Turning Intel OMP on and off (and MKL as
> > well,
> > > >> since
> > > >> > > it
> > > >> > > > > > > tends to
> > > >> > > > > > > > > pull
> > > >> > > > > > > > > > > > in
> > > >> > > > > > > > > > > >     omp, depending which one is linked in).
> > > >> > > > > > > > > > > >     There is a HUGE difference.  This is
> > consistent
> > > >> with
> > > >> > > my
> > > >> > > > > > > > > experience
> > > >> > > > > > > > > > > > before
> > > >> > > > > > > > > > > >     when it was added.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     default mnist:
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     python
> > > >> ../example/image-classification/train_mnist.py
> > > >> > > > > > > > > > > >     INFO:root:start with arguments
> > > >> > > Namespace(add_stn=False,
> > > >> > > > > > > > > > > batch_size=64,
> > > >> > > > > > > > > > > >     disp_batches=100, dtype='float32',
> > > >> gc_threshold=0.5,
> > > >> > > > > > > > > gc_type='none',
> > > >> > > > > > > > > > > >     gpus=None, image_shape='1, 28, 28',
> > > >> > > > > initializer='default',
> > > >> > > > > > > > > > > >     kv_store='device', load_epoch=None, loss='',
> > > >> lr=0.05,
> > > >> > > > > > > > > lr_factor=0.1,
> > > >> > > > > > > > > > > >     lr_step_epochs='10', macrobatch_size=0,
> > > >> > > > > model_prefix=None,
> > > >> > > > > > > > > mom=0.9,
> > > >> > > > > > > > > > > >     monitor=0, network='mlp', num_classes=10,
> > > >> > > num_epochs=20,
> > > >> > > > > > > > > > > >     num_examples=60000, num_layers=None,
> > > >> optimizer='sgd',
> > > >> > > > > > > > > > > >     profile_server_suffix='',
> > > >> profile_worker_suffix='',
> > > >> > > > > > > > > save_period=1,
> > > >> > > > > > > > > > > >     test_io=0, top_k=0, warmup_epochs=5,
> > > >> > > > > > > warmup_strategy='linear',
> > > >> > > > > > > > > > > > wd=0.0001)
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     INTEL OMP:
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     ldd libmxnet.so | grep omp
> > > >> > > > > > > > > > > >             libomp.so =>
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > >
> > > >> /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > > >> > > > > > > > > > > >     (0x00007f978fde7000)
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     :root:Epoch[0] Batch [0-100]        Speed:
> > > >> 31548.09
> > > >> > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.780012
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]
> > Speed:
> > > >> > > 16073.21
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.920469
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]
> > Speed:
> > > >> > > 19075.91
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.928281
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]
> > Speed:
> > > >> > > 23211.36
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.942813
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]
> > Speed:
> > > >> > > 22139.79
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.938750
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]
> > Speed:
> > > >> > > 23225.52
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.946562
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]
> > Speed:
> > > >> > > 19547.41
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.953281
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]
> > Speed:
> > > >> > > 24111.73
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.951562
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]
> > Speed:
> > > >> > > 13959.88
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.957500
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Train-accuracy=0.925423
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Time cost=3.806
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0]
> > Validation-accuracy=0.962580
> > > >> > > > > > > > > > > >     INFO:root:Epoch[1] Batch [0-100]
> > Speed:
> > > >> > > 24560.21
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.968131
> > > >> > > > > > > > > > > >     INFO:root:Epoch[1] Batch [100-200]
> > Speed:
> > > >> > > 23457.03
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.966250
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     LIBGOMP:
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     ldd libmxnet.so | grep omp
> > > >> > > > > > > > > > > >             libgomp.so.1 =>
> > > >> > > > > > > /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > >> > > > > > > > > > > >     (0x00007f25c25dd000)
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [0-100]
> > Speed:
> > > >> > > 1731.01
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >      accuracy=0.782488
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [100-200]
> > Speed:
> > > >> > > 3551.32
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >      accuracy=0.907813
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [200-300]
> > Speed:
> > > >> > > 1991.00
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >      accuracy=0.927188
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [300-400]
> > Speed:
> > > >> > > 2175.45
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >      accuracy=0.937969
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [400-500]
> > Speed:
> > > >> > > 1644.95
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >      accuracy=0.942187
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [500-600]
> > Speed:
> > > >> > > 6444.58
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >      accuracy=0.950156
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [600-700]
> > Speed:
> > > >> > > 7842.16
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >      accuracy=0.947969
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [700-800]
> > Speed:
> > > >> > > 9412.07
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >      accuracy=0.953750
> > > >> > > > > > > > > > > >     INFO:root:Epoch[0] Batch [800-900]
> > Speed:
> > > >> > > 12707.58
> > > >> > > > > > > > > samples/sec
> > > >> > > > > > > > > > > >     accuracy=0.953125
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     That being said, there's other issued beyond
> > > >> speed.
> > > >> > > The
> > > >> > > > > > > DEFAULT
> > > >> > > > > > > > > > > build
> > > >> > > > > > > > > > > > from
> > > >> > > > > > > > > > > >     makefile (not CMake) uses Intel OMP mkl (I
> > > >> showed
> > > >> > > > > before) and
> > > >> > > > > > > > > > > > mysteriously
> > > >> > > > > > > > > > > >     it has no issues?  This seems highly
> > suspicious.
> > > >> > > All I
> > > >> > > > > see
> > > >> > > > > > > is a
> > > >> > > > > > > > > lot
> > > >> > > > > > > > > > > of
> > > >> > > > > > > > > > > >     hand-waving and conjecture and pointing to
> > > >> > > StackOverflow
> > > >> > > > > > > posts
> > > >> > > > > > > > > made
> > > >> > > > > > > > > > > by
> > > >> > > > > > > > > > > >     people who may be of questionable pedigree
> > to
> > > >> begin
> > > >> > > with.
> > > >> > > > > > > This
> > > >> > > > > > > > > > > smells
> > > >> > > > > > > > > > > > of a
> > > >> > > > > > > > > > > >     Pedro-ego-fight rather than one of purely
> > > >> technical
> > > >> > > > > merit.
> > > >> > > > > > > > > Also, if
> > > >> > > > > > > > > > > > one
> > > >> > > > > > > > > > > >     knows how OMP works,  they would be very
> > > >> suspicious
> > > >> > > of
> > > >> > > > > the
> > > >> > > > > > > > > > > > "intermittent
> > > >> > > > > > > > > > > >     hangs" claim -- that's probably just broken
> > race
> > > >> > > > > conditions
> > > >> > > > > > > > > elsewhere
> > > >> > > > > > > > > > > > until
> > > >> > > > > > > > > > > >     proven differently.  It'd tend freeze on the
> > > >> first
> > > >> > > use if
> > > >> > > > > > > > > something
> > > >> > > > > > > > > > > is
> > > >> > > > > > > > > > > >     wrong (try using libgomp after a fork and
> > see),
> > > >> since
> > > >> > > > > worker
> > > >> > > > > > > > > threads"
> > > >> > > > > > > > > > > >     wouldn't be assigned/joined properly.
> > IntelOMP
> > > >> is
> > > >> > > > > faster,
> > > >> > > > > > > but
> > > >> > > > > > > > > also
> > > >> > > > > > > > > > > has
> > > >> > > > > > > > > > > >     other advantages, such as allowing OMP
> > after a
> > > >> fork.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     I actually addressed a lot of issues and
> > ask for
> > > >> > > > > > > clarification
> > > >> > > > > > > > > in the
> > > >> > > > > > > > > > > >     original PR's way back when, but they're all
> > > >> just
> > > >> > > > > ignored.
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >     -Chris
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > > >
> > > >> > > > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > >
> > > >> > > > >
> > > >> > >
> > > >>
> > > >
> >
> >

Re: OMP

Reply via email to