Nobody claimed that the original lockup has to do with OMP, but the fix caused re-entrancy into OMP initialization as explained below. So I agree with your statement that the bug that using pthread_atfork was fixing is not related with OMP, but the fix is causing interactions with OMP as described above.
Pedro. On Tue, Jun 25, 2019 at 12:33 PM Chris Olivier <[email protected]> wrote: > > The call stacks there are mostly associated with the execution engine > threads, which are not OMP threads. That lockup doesn't look to me to be > related to OMP -- the execution engine uses its own thread pool logic -- > I'm pretty familiar with that part of the code. Unless I am missing one -- > can you point to the one that looks OMP-related? > > > On Tue, Jun 25, 2019 at 10:35 AM Pedro Larroy <[email protected]> > wrote: > > > Thanks for digging that out Kellen. That's good info so maybe it would > > be good to rework the fix with the info you provided and remove the > > pthread_atfork handlers. > > Do you think setting the device would avoid the problem seen on the > > backtrace you provided? specifically here: > > > > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600#file-hang_bt-L24 > > > > On Mon, Jun 24, 2019 at 6:43 PM kellen sunderland > > <[email protected]> wrote: > > > > > > I remember at the time we also had a read through of this blog post, but > > to > > > use the code looked like it was following the advice: > > > > > https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/ > > > > > > On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland < > > > [email protected]> wrote: > > > > > > > I remember this hang as well, it was pretty hard to reproduce IIRC. I > > > > believe the stacks for the hang are here: > > > > > > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600 > > and > > > > the trick was we could only debug it up to the point that we hit: > > > > > > > > #0 0x00007fec6df1ba4f in futex_wait (private=0, expected=1, > > > > futex_word=0x7fec60843758) > > > > at ../sysdeps/unix/sysv/linux/futex-internal.h:61 > > > > #1 futex_wait_simple (private=0, expected=1, > > futex_word=0x7fec60843758) > > > > at ../sysdeps/nptl/futex-internal.h:135 > > > > #2 __pthread_once_slow (once_control=0x7fec60843758, > > > > init_routine=0x7fec605f38f0) > > > > at pthread_once.c:105 > > > > ... > > > > #6 0x00007fec6061c577 in cudaSetDevice () from > > > > /usr/local/cuda/lib64/libcudart.so.9.0 > > > > > > > > because the code in libcudart is obviously closed source we couldn't > > dig > > > > into what threading work was going on when we called cudaSetDevice. > > > > > > > > On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy < > > [email protected]> > > > > wrote: > > > > > > > >> If you check initialize.cc we seem to be explicitly disabling that > > > >> behaviour in pthread_at_fork which seems to cause thread contention > > > >> during multiprocessing. Why do we need this major advantage for the > > > >> library if that's the case? > > > >> > > > >> Related PRs: > > > >> > > > >> https://github.com/apache/incubator-mxnet/pull/10820 > > > >> https://github.com/apache/incubator-mxnet/issues/14396 > > > >> > > > >> The original code was authored in this PR: > > > >> > > > >> https://github.com/apache/incubator-mxnet/pull/8677 > > > >> > > > >> I actually remember this fix, it was done during a release as the cuda > > > >> runtime was forking and the engine was being re-entered. If that > > > >> situation is now happening anymore it might not be needed any longer. > > > >> I don't think we know the cause why there was a fork inside cuda, so > > > >> the code has grown around a fix for an issue which its root cause was > > > >> not understood, and side effects which this fix caused afterwards. > > > >> > > > >> My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in > > > >> the link above, no libgomp. > > > >> > > > >> I didn't try the Make build. > > > >> > > > >> I would refactor the code linked above and stop using pthread_at_fork, > > > >> since OMP assumes it won't be initialized twice, but needs to be very > > > >> well tested to make sure it doesn't cause bugs or affect the fixes > > > >> done on the linked PRs above. > > > >> > > > >> Pedro. > > > >> > > > >> On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier <[email protected]> > > > >> wrote: > > > >> > > > > >> > one major advantage of intel/llvm omp is that it spawns a new thread > > > >> pool > > > >> > after fork if a thread pool was already created. this is so that omp > > > >> can be > > > >> > used in the forked processes. libgomp doesn’t do this so it’ll just > > > >> lock up > > > >> > if you try to do omp in the forked process. > > > >> > > > > >> > is your build linking libgomp as well? > > > >> > > > > >> > standard mkl build (from Makefile) uses same omp library. are there > > > >> > problems with that build? > > > >> > > > > >> > what changes need to be made to make the assertion not fire? > > > >> > > > > >> > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy < > > > >> [email protected]> > > > >> > wrote: > > > >> > > > > >> > > There's an assertion which is easily reproducible, and also > > there's a > > > >> > > crash including core dump, the latter is not easy to reproduce > > for me > > > >> > > in different environments. I have also seen mxnet getting stuck > > > >> > > without progressing with this build configuration and using no > > CPU at > > > >> > > all when running unit tests. > > > >> > > > > > >> > > In my view, the root cause of the assertion is that we are > > re-entering > > > >> > > OMP initialization when spawning threads on the following code > > through > > > >> > > pthread_at_fork > > > >> > > > > > >> > > > > > >> > > https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58 > > > >> > > > > > >> > > This causes double initialization of the OMP engine, including the > > > >> > > assertion which you are asking about, and I suspect some > > additional > > > >> > > overhead. That's the shady forking part you are asking for. > > > >> > > > > > >> > > A question for you: What is the cause of runtime differences > > between > > > >> > > OMP runtimes? Shouldn't the implementation overhead diminish as > > > >> > > threads run longer? > > > >> > > > > > >> > > Pedro. > > > >> > > > > > >> > > On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier < > > [email protected]> > > > >> > > wrote: > > > >> > > > > > > >> > > > What’s the reason for the assertion failure? btw classifying an > > > >> assertion > > > >> > > > failure a “crash” is debatable. As I stated in the original > > issue a > > > >> long > > > >> > > > time ago, it’s possible something shady is being done with when > > > >> forking > > > >> > > > that should be fixed. The assertion should be root caused. > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy < > > > >> > > [email protected]> > > > >> > > > wrote: > > > >> > > > > > > >> > > > > Added a dockerfile, and reports of a crash in my local machine > > > >> when > > > >> > > > > running MKL+OMP+DEBUG, with Anton's branch the crash happened > > as > > > >> well. > > > >> > > > > I couldn't reproduce the crash on my EC2 machine: > > > >> > > > > Added the backtrace of the crash as well. > > > >> > > > > > > > >> > > > > https://github.com/apache/incubator-mxnet/issues/10856 > > > >> > > > > > > > >> > > > > Dockerfile here: > > > >> > > > > > > > >> > > > > https://github.com/larroy/mxnet_omp > > > >> > > > > > > > >> > > > > Kind regards. > > > >> > > > > > > > >> > > > > Pedro. > > > >> > > > > > > > >> > > > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu < > > > >> > > [email protected]> > > > >> > > > > wrote: > > > >> > > > > > > > > >> > > > > > As already proposed, I think the easiest way to get a common > > > >> > > > > understanding > > > >> > > > > > is if we start with a few docker containers. Pedro, would > > it be > > > >> > > possible > > > >> > > > > > for you to wrap your benchmarks into a few containers that > > will > > > >> > > produce > > > >> > > > > > your shown results? That way, we can avoid possible > > > >> > > misunderstandings and > > > >> > > > > > also pinpoint the exact parts where people disagree or > > > >> misunderstood > > > >> > > each > > > >> > > > > > other. > > > >> > > > > > > > > >> > > > > > -Marco > > > >> > > > > > > > > >> > > > > > Pedro Larroy <[email protected]> schrieb am Do., > > > >> 20. Juni > > > >> > > > > 2019, > > > >> > > > > > 21:47: > > > >> > > > > > > > > >> > > > > > > I can confirm that we are linking with two versions of > > omp, > > > >> I'm > > > >> > > > > > > gaining more clarity into this topic, but I have still > > > >> questions, > > > >> > > the > > > >> > > > > > > facts that I got so far are the folllowing: > > > >> > > > > > > > > > >> > > > > > > * #1: We are linking with two versions of omp, intel's omp > > > >> and llvm > > > >> > > > > > > openmp when building with MKL enabled. > > > >> > > > > > > * #2: We have 3 different possible OMP versions: Intel OMP > > > >> (comes > > > >> > > with > > > >> > > > > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with > > gcc) > > > >> (This > > > >> > > > > > > one is used on the PR proposed by Anton). > > > >> > > > > > > > > > >> > > > > > > Questions: > > > >> > > > > > > > > > >> > > > > > > * #1 Is it ok to have two versions of openmp linked at > > the > > > >> same > > > >> > > time? > > > >> > > > > > > * #2 Which implementation of OMP gives the best > > > >> performance? (See > > > >> > > > > > > total training time of my measurement for a partial > > answer) > > > >> > > > > > > * #3 Should we have a build flag so we can choose the OMP > > > >> version > > > >> > > at > > > >> > > > > > > runtime? > > > >> > > > > > > * #4 Which Compiler and build flags did Chris use to get > > 10x > > > >> > > slowdown? > > > >> > > > > > > * #5 @Stas: is there a script to replicate your > > benchmarks > > > >> > > easily? If > > > >> > > > > > > so could you provide a link? I think we would need to > > > >> reproduce > > > >> > > your > > > >> > > > > > > benchmarks and verify which versions are being linked. > > It's > > > >> > > possible > > > >> > > > > > > that while compiling with MKL intel's omp was pulled in > > > >> instead of > > > >> > > > > > > GNU OpenMP. > > > >> > > > > > > * #6 @Chris: how to maintain the copy of LLVM's Openmp? > > > >> Should we > > > >> > > > > > > update the subrepo regularly? > > > >> > > > > > > > > > >> > > > > > > My conclusion so far: > > > >> > > > > > > > > > >> > > > > > > * #1 We should avoid linking two versions of omp if > > possible > > > >> and > > > >> > > > > > > allow users to choose one in the build as we do for BLAS. > > > >> > > > > > > * #2 For performance reasons and more control vs > > different > > > >> > > compiler > > > >> > > > > > > versions seems it makes indeed sense to keep the LLVM > > OpenMP > > > >> > > version > > > >> > > > > > > in 3rdparty for now. So unless some more data is > > gathered, it > > > >> makes > > > >> > > > > > > sense not to remove it as of now. > > > >> > > > > > > * #3 We should provide build options to choose which > > openmp > > > >> > > library > > > >> > > > > > > is to be used from the three options available, including > > > >> libgomp. > > > >> > > > > > > * #4 Refining the build we could also enable OpenMP in > > mac > > > >> without > > > >> > > > > > > additional contortions (doesn't work as of today): > > > >> > > > > > > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/ > > > >> > > > > > > * #5 We should add different omp versions to our > > benchmarks > > > >> and > > > >> > > track > > > >> > > > > > > the performance, so this data is available for prescribing > > > >> the best > > > >> > > > > > > build options and for binary releases. > > > >> > > > > > > > > > >> > > > > > > This is also an interesting related gh issue posted in the > > > >> mkl-dnn > > > >> > > > > > > repository: https://github.com/intel/mkl-dnn/issues/230 > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > I don't observe the order of magnitude divergence > > reported by > > > >> > > Chris in > > > >> > > > > > > vanilla Ubuntu 18.04 in samples / s but the full training > > > >> finishes > > > >> > > > > > > indeed faster with the OMP from 3rdparty (LLVM openmp) vs > > > >> libgomp. > > > >> > > > > > > > > > >> > > > > > > There's also differences in training time when using MKL > > and > > > >> the , > > > >> > > > > > > it's actually a bit slower, I don't know if it's related > > to > > > >> OMP. > > > >> > > > > > > > > > >> > > > > > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) > > > >> > > > > > > > > > >> > > > > > > Anton's branch: [email protected]:lebeg/incubator-mxnet.git > > > >> branch > > > >> > > > > 'omp' > > > >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd > > > >> > > > > > > build/libmxnet.so |grep -i omp > > > >> > > > > > > libgomp.so.1 => > > /usr/lib/x86_64-linux-gnu/libgomp.so.1 > > > >> > > > > > > (0x00007fd99a51d000) > > > >> > > > > > > > > > >> > > > > > > time python train_mnist.py > > > >> > > > > > > > > > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.984176 > > > >> > > > > > > INFO:root:Epoch[19] Batch [0-100] Speed: 41617.00 > > > >> samples/sec > > > >> > > > > > > accuracy=1.000000 > > > >> > > > > > > INFO:root:Epoch[19] Batch [100-200] Speed: 47990.69 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999531 > > > >> > > > > > > INFO:root:Epoch[19] Batch [200-300] Speed: 47517.01 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999687 > > > >> > > > > > > INFO:root:Epoch[19] Batch [300-400] Speed: 47430.53 > > > >> samples/sec > > > >> > > > > > > accuracy=1.000000 > > > >> > > > > > > INFO:root:Epoch[19] Batch [400-500] Speed: 47649.77 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999687 > > > >> > > > > > > INFO:root:Epoch[19] Batch [500-600] Speed: 51708.12 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999687 > > > >> > > > > > > INFO:root:Epoch[19] Batch [600-700] Speed: 57228.63 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999375 > > > >> > > > > > > INFO:root:Epoch[19] Batch [700-800] Speed: 50887.85 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999844 > > > >> > > > > > > INFO:root:Epoch[19] Batch [800-900] Speed: 53947.98 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999531 > > > >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717 > > > >> > > > > > > INFO:root:Epoch[19] Time cost=1.219 > > > >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983977 > > > >> > > > > > > 1011.98user 26.78system 0:31.54elapsed 3292%CPU > > > >> (0avgtext+0avgdata > > > >> > > > > > > 1146052maxresident)k > > > >> > > > > > > 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps > > > >> > > > > > > > > > >> > > > > > > Master, MKL ON: > > > >> > > > > > > > > > >> > > > > > > (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification > > > >> [master]> > > > >> > > ldd > > > >> > > > > > > ../../build/libmxnet.so | grep -i omp > > > >> > > > > > > libomp.so => > > > >> > > > > > > > > > >> > > > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so > > > >> > > > > > > (0x00007f05ba38f000) > > > >> > > > > > > libiomp5.so => > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > >> > > /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so > > > >> > > > > > > (0x00007f05b09f4000) > > > >> > > > > > > > > > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.982484 > > > >> > > > > > > INFO:root:Epoch[19] Batch [0-100] Speed: 36651.63 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999691 > > > >> > > > > > > INFO:root:Epoch[19] Batch [100-200] Speed: 45093.98 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999844 > > > >> > > > > > > INFO:root:Epoch[19] Batch [200-300] Speed: 45146.84 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999687 > > > >> > > > > > > INFO:root:Epoch[19] Batch [300-400] Speed: 45119.90 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999687 > > > >> > > > > > > INFO:root:Epoch[19] Batch [400-500] Speed: 44998.96 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999531 > > > >> > > > > > > INFO:root:Epoch[19] Batch [500-600] Speed: 45072.25 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999844 > > > >> > > > > > > INFO:root:Epoch[19] Batch [600-700] Speed: 44969.79 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999844 > > > >> > > > > > > INFO:root:Epoch[19] Batch [700-800] Speed: 44962.78 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999844 > > > >> > > > > > > INFO:root:Epoch[19] Batch [800-900] Speed: 44945.47 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999375 > > > >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717 > > > >> > > > > > > INFO:root:Epoch[19] Time cost=1.367 > > > >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.982783 > > > >> > > > > > > 854.97user 847.21system 0:41.44elapsed 4106%CPU > > > >> (0avgtext+0avgdata > > > >> > > > > > > 1154348maxresident)k > > > >> > > > > > > 0inputs+0outputs (0major+3624361minor)pagefaults 0swaps > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > MKL OFF: > > > >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> > > grep -i > > > >> MKL > > > >> > > > > > > cmake_options.yml > > > >> > > > > > > USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found > > > >> > > > > > > USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL > > > >> found) IF > > > >> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE) > > > >> > > > > > > USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL > > found) > > > >> IF > > > >> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE) > > > >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd > > > >> > > > > > > build/libmxnet.so |grep -i omp > > > >> > > > > > > libomp.so => > > > >> > > > > > > > > > >> > > > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so > > > >> > > > > > > (0x00007fb720c54000) > > > >> > > > > > > > > > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.983479 > > > >> > > > > > > INFO:root:Epoch[19] Batch [0-100] Speed: 46784.02 > > > >> samples/sec > > > >> > > > > > > accuracy=1.000000 > > > >> > > > > > > INFO:root:Epoch[19] Batch [100-200] Speed: 48824.29 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999687 > > > >> > > > > > > INFO:root:Epoch[19] Batch [200-300] Speed: 49190.31 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999687 > > > >> > > > > > > INFO:root:Epoch[19] Batch [300-400] Speed: 51518.77 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999844 > > > >> > > > > > > INFO:root:Epoch[19] Batch [400-500] Speed: 51551.62 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999844 > > > >> > > > > > > INFO:root:Epoch[19] Batch [500-600] Speed: 49026.35 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999844 > > > >> > > > > > > INFO:root:Epoch[19] Batch [600-700] Speed: 49002.46 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999375 > > > >> > > > > > > INFO:root:Epoch[19] Batch [700-800] Speed: 48980.55 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999687 > > > >> > > > > > > INFO:root:Epoch[19] Batch [800-900] Speed: 47402.56 > > > >> samples/sec > > > >> > > > > > > accuracy=0.999844 > > > >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999767 > > > >> > > > > > > INFO:root:Epoch[19] Time cost=1.259 > > > >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983181 > > > >> > > > > > > 755.36user 754.94system 0:35.89elapsed 4207%CPU > > > >> (0avgtext+0avgdata > > > >> > > > > > > 1147008maxresident)k > > > >> > > > > > > 0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps > > > >> > > > > > > > > > >> > > > > > > Let me know what you think. > > > >> > > > > > > > > > >> > > > > > > Link to the original PR: > > > >> > > > > > > https://github.com/apache/incubator-mxnet/pull/12160 > > > >> > > > > > > > > > >> > > > > > > Thanks. > > > >> > > > > > > > > > >> > > > > > > On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland > > > >> > > > > > > <[email protected]> wrote: > > > >> > > > > > > > > > > >> > > > > > > > "if you’re linking in two then you’re doing something > > > >> wrong." > > > >> > > > > Correct, > > > >> > > > > > > > that's one thing I believe we've got consensus on. So > > > >> let's call > > > >> > > > > that > > > >> > > > > > > out > > > >> > > > > > > > as a bug to be fixed. > > > >> > > > > > > > > > > >> > > > > > > > Let's move forward with some reproducible numbers and > > then > > > >> > > discuss > > > >> > > > > the > > > >> > > > > > > pros > > > >> > > > > > > > / cons of which particular OMP implementation we should > > use. > > > >> > > > > > > > > > > >> > > > > > > > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy < > > > >> > > > > > > [email protected]> > > > >> > > > > > > > wrote: > > > >> > > > > > > > > > > >> > > > > > > > > Hi Chris > > > >> > > > > > > > > > > > >> > > > > > > > > I would ask you to have a bit of patience and help us > > > >> with your > > > >> > > > > > > > > experience in this matter. Nobody is ignoring > > anything, I > > > >> > > think we > > > >> > > > > are > > > >> > > > > > > > > individually gathering feedbacks and trying to > > understand > > > >> the > > > >> > > > > multiple > > > >> > > > > > > > > contributions done to this topic including yours, > > then go > > > >> step > > > >> > > by > > > >> > > > > > > > > step, understand what is going on and run experiments > > and > > > >> > > report > > > >> > > > > back > > > >> > > > > > > > > to the list or the corresponding github item. It was > > > >> suggested > > > >> > > by > > > >> > > > > > > > > Kellen to prepare some containers, this takes effort. > > > >> > > > > > > > > > > > >> > > > > > > > > Regarding your final comment, most of us also have > > many > > > >> other > > > >> > > > > things > > > >> > > > > > > > > to do and responsibilities even if our daytime jobs > > might > > > >> > > involve > > > >> > > > > > > > > MXNet in some form or another. I think that's part of > > the > > > >> > > privilege > > > >> > > > > > > > > and responsibility of working close with an open > > source > > > >> > > project and > > > >> > > > > > > > > the magic of collaboration across organizations. Let's > > > >> all be > > > >> > > > > patient > > > >> > > > > > > > > and take some time to understand and reason about this > > > >> topic > > > >> > > which > > > >> > > > > is > > > >> > > > > > > > > not simple. Since we decided to step back and gather > > more > > > >> data > > > >> > > > > let's > > > >> > > > > > > > > take time and do it properly. > > > >> > > > > > > > > > > > >> > > > > > > > > Personally I hope to find time to look again into this > > > >> issue > > > >> > > before > > > >> > > > > > > > > the end of the week. > > > >> > > > > > > > > > > > >> > > > > > > > > Thanks. > > > >> > > > > > > > > > > > >> > > > > > > > > Pedro. > > > >> > > > > > > > > > > > >> > > > > > > > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier < > > > >> > > > > [email protected]> > > > >> > > > > > > > > wrote: > > > >> > > > > > > > > > > > > >> > > > > > > > > > if you’re linking in two then you’re doing something > > > >> wrong. > > > >> > > You > > > >> > > > > can > > > >> > > > > > > see > > > >> > > > > > > > > by > > > >> > > > > > > > > > my email yesterday that only one is linked in. This > > is > > > >> also > > > >> > > the > > > >> > > > > case > > > >> > > > > > > with > > > >> > > > > > > > > > the mkl version built by the Makefile — only the > > Intel > > > >> OMP > > > >> > > > > library is > > > >> > > > > > > > > used > > > >> > > > > > > > > > (no libgomp). > > > >> > > > > > > > > > > > > >> > > > > > > > > > That being said, Do you have clear evidence that > > using > > > >> Intel > > > >> > > OMP > > > >> > > > > is > > > >> > > > > > > both > > > >> > > > > > > > > > problematic and the situation isn’t fixable? The > > > >> burden of > > > >> > > > > proof is > > > >> > > > > > > on > > > >> > > > > > > > > the > > > >> > > > > > > > > > ones requesting the change — it is not my > > > >> responsibility to > > > >> > > > > justify > > > >> > > > > > > the > > > >> > > > > > > > > > current state. There must be something “terrible” > > and > > > >> > > unfixable > > > >> > > > > to > > > >> > > > > > > > > justify > > > >> > > > > > > > > > a change. I have seen no proof of this in all this > > > >> time. > > > >> > > > > > > > > > > > > >> > > > > > > > > > On a side note, I mentioned a couple of things in my > > > >> email > > > >> > > > > yesterday > > > >> > > > > > > that > > > >> > > > > > > > > > still are not being responded to (they were also > > > >> ignored in > > > >> > > the > > > >> > > > > last > > > >> > > > > > > > > > incarnation of this “discussion” — I have much > > > >> experience in > > > >> > > this > > > >> > > > > > > matter > > > >> > > > > > > > > to > > > >> > > > > > > > > > assume “discussion” is a waste of my time, seeing > > and I > > > >> am > > > >> > > not > > > >> > > > > paid > > > >> > > > > > > to > > > >> > > > > > > > > > “work on” mxnet like y’all are). > > > >> > > > > > > > > > > > > >> > > > > > > > > > -C > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland < > > > >> > > > > > > > > > [email protected]> wrote: > > > >> > > > > > > > > > > > > >> > > > > > > > > > > I've also quite often seen two versions of OpenMP > > > >> linked. > > > >> > > I > > > >> > > > > think > > > >> > > > > > > we > > > >> > > > > > > > > can > > > >> > > > > > > > > > > all agree we probably want to avoid linking in two > > > >> > > libraries > > > >> > > > > that > > > >> > > > > > > do > > > >> > > > > > > > > > > effectively the same thing. > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > The performance questions should be fairly > > straight > > > >> > > forward to > > > >> > > > > > > > > demonstrate > > > >> > > > > > > > > > > right? Could we just collaborate on a few minimal > > > >> > > Dockerfiles > > > >> > > > > that > > > >> > > > > > > > > show > > > >> > > > > > > > > > > (or don't show) Intel OpenMP performance speedups > > > >> with the > > > >> > > > > > > workloads > > > >> > > > > > > > > Chris > > > >> > > > > > > > > > > is referencing? > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, > > Stanislav < > > > >> > > > > > > > > > > [email protected]> wrote: > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > Hi, Chris! > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > Stas here - I've gathered that performance data. > > > >> > > > > > > > > > > > Sure thing, I can be wrong, but please > > elaborate a > > > >> bit on > > > >> > > > > what > > > >> > > > > > > we are > > > >> > > > > > > > > > > > missing. > > > >> > > > > > > > > > > > Be assured, intentional misdirection was never a > > > >> case. > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > Thanks a lot for being constructive. > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > Turning Intel OMP on and off (and MKL as well, > > > >> since it > > > >> > > > > tends > > > >> > > > > > > to > > > >> > > > > > > > > pull > > > >> > > > > > > > > > > in > > > >> > > > > > > > > > > > omp, depending which one is linked in). > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > We never ever considered turning MKL off. We > > are on > > > >> the > > > >> > > same > > > >> > > > > page > > > >> > > > > > > > > here - > > > >> > > > > > > > > > > > MKL is crucial for the performance. > > > >> > > > > > > > > > > > Why should we? There's a GOMP-linked version of > > MKL, > > > >> > > that we > > > >> > > > > can > > > >> > > > > > > use. > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > What we did - we measured, if using compilers > > > >> default > > > >> > > OpenMP > > > >> > > > > > > > > > > > implementation instead of referenced source code > > > >> > > > > distribution of > > > >> > > > > > > > > OpenMP > > > >> > > > > > > > > > > > makes anything slower. > > > >> > > > > > > > > > > > We have found the impact to be hardly > > measurable. > > > >> > > > > > > > > > > > The difference between GOMP and iOMP is <5% on > > our > > > >> > > > > benchmarks, > > > >> > > > > > > most > > > >> > > > > > > > > of > > > >> > > > > > > > > > > the > > > >> > > > > > > > > > > > time less than that. > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > We just suggest to simplify the build of mxnet, > > by > > > >> > > removing > > > >> > > > > the > > > >> > > > > > > > > > > > unnecessary dependency. > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > During that we discovered for example the > > following > > > >> > > amazing > > > >> > > > > > > issue: > > > >> > > > > > > > > > > > > > > >> https://github.com/apache/incubator-mxnet/issues/14087 > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > Best Regards > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > Stas > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > On 18.06.19, 18:24, "Chris Olivier" < > > > >> > > [email protected]> > > > >> > > > > > > wrote: > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > I am very reluctant to feed the trolls > > again, > > > >> and > > > >> > > this > > > >> > > > > will > > > >> > > > > > > be > > > >> > > > > > > > > teh > > > >> > > > > > > > > > > last > > > >> > > > > > > > > > > > time I address Pedro or Anton on the > > subject, > > > >> but > > > >> > > since I > > > >> > > > > > > think > > > >> > > > > > > > > the > > > >> > > > > > > > > > > > numbers > > > >> > > > > > > > > > > > being presented are incorrect (either by te > > > >> builders > > > >> > > not > > > >> > > > > > > really > > > >> > > > > > > > > > > > understanding what they are building, or > > > >> possibly > > > >> > > > > intentional > > > >> > > > > > > > > > > > misdirection): > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > Turning Intel OMP on and off (and MKL as > > well, > > > >> since > > > >> > > it > > > >> > > > > > > tends to > > > >> > > > > > > > > pull > > > >> > > > > > > > > > > > in > > > >> > > > > > > > > > > > omp, depending which one is linked in). > > > >> > > > > > > > > > > > There is a HUGE difference. This is > > consistent > > > >> with > > > >> > > my > > > >> > > > > > > > > experience > > > >> > > > > > > > > > > > before > > > >> > > > > > > > > > > > when it was added. > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > default mnist: > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > python > > > >> ../example/image-classification/train_mnist.py > > > >> > > > > > > > > > > > INFO:root:start with arguments > > > >> > > Namespace(add_stn=False, > > > >> > > > > > > > > > > batch_size=64, > > > >> > > > > > > > > > > > disp_batches=100, dtype='float32', > > > >> gc_threshold=0.5, > > > >> > > > > > > > > gc_type='none', > > > >> > > > > > > > > > > > gpus=None, image_shape='1, 28, 28', > > > >> > > > > initializer='default', > > > >> > > > > > > > > > > > kv_store='device', load_epoch=None, loss='', > > > >> lr=0.05, > > > >> > > > > > > > > lr_factor=0.1, > > > >> > > > > > > > > > > > lr_step_epochs='10', macrobatch_size=0, > > > >> > > > > model_prefix=None, > > > >> > > > > > > > > mom=0.9, > > > >> > > > > > > > > > > > monitor=0, network='mlp', num_classes=10, > > > >> > > num_epochs=20, > > > >> > > > > > > > > > > > num_examples=60000, num_layers=None, > > > >> optimizer='sgd', > > > >> > > > > > > > > > > > profile_server_suffix='', > > > >> profile_worker_suffix='', > > > >> > > > > > > > > save_period=1, > > > >> > > > > > > > > > > > test_io=0, top_k=0, warmup_epochs=5, > > > >> > > > > > > warmup_strategy='linear', > > > >> > > > > > > > > > > > wd=0.0001) > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > INTEL OMP: > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > ldd libmxnet.so | grep omp > > > >> > > > > > > > > > > > libomp.so => > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > >> /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so > > > >> > > > > > > > > > > > (0x00007f978fde7000) > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > :root:Epoch[0] Batch [0-100] Speed: > > > >> 31548.09 > > > >> > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.780012 > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [100-200] > > Speed: > > > >> > > 16073.21 > > > >> > > > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.920469 > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [200-300] > > Speed: > > > >> > > 19075.91 > > > >> > > > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.928281 > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [300-400] > > Speed: > > > >> > > 23211.36 > > > >> > > > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.942813 > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [400-500] > > Speed: > > > >> > > 22139.79 > > > >> > > > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.938750 > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [500-600] > > Speed: > > > >> > > 23225.52 > > > >> > > > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.946562 > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [600-700] > > Speed: > > > >> > > 19547.41 > > > >> > > > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.953281 > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [700-800] > > Speed: > > > >> > > 24111.73 > > > >> > > > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.951562 > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [800-900] > > Speed: > > > >> > > 13959.88 > > > >> > > > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.957500 > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Train-accuracy=0.925423 > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Time cost=3.806 > > > >> > > > > > > > > > > > INFO:root:Epoch[0] > > Validation-accuracy=0.962580 > > > >> > > > > > > > > > > > INFO:root:Epoch[1] Batch [0-100] > > Speed: > > > >> > > 24560.21 > > > >> > > > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.968131 > > > >> > > > > > > > > > > > INFO:root:Epoch[1] Batch [100-200] > > Speed: > > > >> > > 23457.03 > > > >> > > > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.966250 > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > LIBGOMP: > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > ldd libmxnet.so | grep omp > > > >> > > > > > > > > > > > libgomp.so.1 => > > > >> > > > > > > /usr/lib/x86_64-linux-gnu/libgomp.so.1 > > > >> > > > > > > > > > > > (0x00007f25c25dd000) > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [0-100] > > Speed: > > > >> > > 1731.01 > > > >> > > > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.782488 > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [100-200] > > Speed: > > > >> > > 3551.32 > > > >> > > > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.907813 > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [200-300] > > Speed: > > > >> > > 1991.00 > > > >> > > > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.927188 > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [300-400] > > Speed: > > > >> > > 2175.45 > > > >> > > > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.937969 > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [400-500] > > Speed: > > > >> > > 1644.95 > > > >> > > > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.942187 > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [500-600] > > Speed: > > > >> > > 6444.58 > > > >> > > > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.950156 > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [600-700] > > Speed: > > > >> > > 7842.16 > > > >> > > > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.947969 > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [700-800] > > Speed: > > > >> > > 9412.07 > > > >> > > > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.953750 > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [800-900] > > Speed: > > > >> > > 12707.58 > > > >> > > > > > > > > samples/sec > > > >> > > > > > > > > > > > accuracy=0.953125 > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > That being said, there's other issued beyond > > > >> speed. > > > >> > > The > > > >> > > > > > > DEFAULT > > > >> > > > > > > > > > > build > > > >> > > > > > > > > > > > from > > > >> > > > > > > > > > > > makefile (not CMake) uses Intel OMP mkl (I > > > >> showed > > > >> > > > > before) and > > > >> > > > > > > > > > > > mysteriously > > > >> > > > > > > > > > > > it has no issues? This seems highly > > suspicious. > > > >> > > All I > > > >> > > > > see > > > >> > > > > > > is a > > > >> > > > > > > > > lot > > > >> > > > > > > > > > > of > > > >> > > > > > > > > > > > hand-waving and conjecture and pointing to > > > >> > > StackOverflow > > > >> > > > > > > posts > > > >> > > > > > > > > made > > > >> > > > > > > > > > > by > > > >> > > > > > > > > > > > people who may be of questionable pedigree > > to > > > >> begin > > > >> > > with. > > > >> > > > > > > This > > > >> > > > > > > > > > > smells > > > >> > > > > > > > > > > > of a > > > >> > > > > > > > > > > > Pedro-ego-fight rather than one of purely > > > >> technical > > > >> > > > > merit. > > > >> > > > > > > > > Also, if > > > >> > > > > > > > > > > > one > > > >> > > > > > > > > > > > knows how OMP works, they would be very > > > >> suspicious > > > >> > > of > > > >> > > > > the > > > >> > > > > > > > > > > > "intermittent > > > >> > > > > > > > > > > > hangs" claim -- that's probably just broken > > race > > > >> > > > > conditions > > > >> > > > > > > > > elsewhere > > > >> > > > > > > > > > > > until > > > >> > > > > > > > > > > > proven differently. It'd tend freeze on the > > > >> first > > > >> > > use if > > > >> > > > > > > > > something > > > >> > > > > > > > > > > is > > > >> > > > > > > > > > > > wrong (try using libgomp after a fork and > > see), > > > >> since > > > >> > > > > worker > > > >> > > > > > > > > threads" > > > >> > > > > > > > > > > > wouldn't be assigned/joined properly. > > IntelOMP > > > >> is > > > >> > > > > faster, > > > >> > > > > > > but > > > >> > > > > > > > > also > > > >> > > > > > > > > > > has > > > >> > > > > > > > > > > > other advantages, such as allowing OMP > > after a > > > >> fork. > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > I actually addressed a lot of issues and > > ask for > > > >> > > > > > > clarification > > > >> > > > > > > > > in the > > > >> > > > > > > > > > > > original PR's way back when, but they're all > > > >> just > > > >> > > > > ignored. > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > -Chris > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >
