I remember at the time we also had a read through of this blog post, but to use the code looked like it was following the advice: https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/
On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland < kellen.sunderl...@gmail.com> wrote: > I remember this hang as well, it was pretty hard to reproduce IIRC. I > believe the stacks for the hang are here: > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600 and > the trick was we could only debug it up to the point that we hit: > > #0 0x00007fec6df1ba4f in futex_wait (private=0, expected=1, > futex_word=0x7fec60843758) > at ../sysdeps/unix/sysv/linux/futex-internal.h:61 > #1 futex_wait_simple (private=0, expected=1, futex_word=0x7fec60843758) > at ../sysdeps/nptl/futex-internal.h:135 > #2 __pthread_once_slow (once_control=0x7fec60843758, > init_routine=0x7fec605f38f0) > at pthread_once.c:105 > ... > #6 0x00007fec6061c577 in cudaSetDevice () from > /usr/local/cuda/lib64/libcudart.so.9.0 > > because the code in libcudart is obviously closed source we couldn't dig > into what threading work was going on when we called cudaSetDevice. > > On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy <pedro.larroy.li...@gmail.com> > wrote: > >> If you check initialize.cc we seem to be explicitly disabling that >> behaviour in pthread_at_fork which seems to cause thread contention >> during multiprocessing. Why do we need this major advantage for the >> library if that's the case? >> >> Related PRs: >> >> https://github.com/apache/incubator-mxnet/pull/10820 >> https://github.com/apache/incubator-mxnet/issues/14396 >> >> The original code was authored in this PR: >> >> https://github.com/apache/incubator-mxnet/pull/8677 >> >> I actually remember this fix, it was done during a release as the cuda >> runtime was forking and the engine was being re-entered. If that >> situation is now happening anymore it might not be needed any longer. >> I don't think we know the cause why there was a fork inside cuda, so >> the code has grown around a fix for an issue which its root cause was >> not understood, and side effects which this fix caused afterwards. >> >> My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in >> the link above, no libgomp. >> >> I didn't try the Make build. >> >> I would refactor the code linked above and stop using pthread_at_fork, >> since OMP assumes it won't be initialized twice, but needs to be very >> well tested to make sure it doesn't cause bugs or affect the fixes >> done on the linked PRs above. >> >> Pedro. >> >> On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier <cjolivie...@gmail.com> >> wrote: >> > >> > one major advantage of intel/llvm omp is that it spawns a new thread >> pool >> > after fork if a thread pool was already created. this is so that omp >> can be >> > used in the forked processes. libgomp doesn’t do this so it’ll just >> lock up >> > if you try to do omp in the forked process. >> > >> > is your build linking libgomp as well? >> > >> > standard mkl build (from Makefile) uses same omp library. are there >> > problems with that build? >> > >> > what changes need to be made to make the assertion not fire? >> > >> > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy < >> pedro.larroy.li...@gmail.com> >> > wrote: >> > >> > > There's an assertion which is easily reproducible, and also there's a >> > > crash including core dump, the latter is not easy to reproduce for me >> > > in different environments. I have also seen mxnet getting stuck >> > > without progressing with this build configuration and using no CPU at >> > > all when running unit tests. >> > > >> > > In my view, the root cause of the assertion is that we are re-entering >> > > OMP initialization when spawning threads on the following code through >> > > pthread_at_fork >> > > >> > > >> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58 >> > > >> > > This causes double initialization of the OMP engine, including the >> > > assertion which you are asking about, and I suspect some additional >> > > overhead. That's the shady forking part you are asking for. >> > > >> > > A question for you: What is the cause of runtime differences between >> > > OMP runtimes? Shouldn't the implementation overhead diminish as >> > > threads run longer? >> > > >> > > Pedro. >> > > >> > > On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier <cjolivie...@gmail.com> >> > > wrote: >> > > > >> > > > What’s the reason for the assertion failure? btw classifying an >> assertion >> > > > failure a “crash” is debatable. As I stated in the original issue a >> long >> > > > time ago, it’s possible something shady is being done with when >> forking >> > > > that should be fixed. The assertion should be root caused. >> > > > >> > > > >> > > > >> > > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy < >> > > pedro.larroy.li...@gmail.com> >> > > > wrote: >> > > > >> > > > > Added a dockerfile, and reports of a crash in my local machine >> when >> > > > > running MKL+OMP+DEBUG, with Anton's branch the crash happened as >> well. >> > > > > I couldn't reproduce the crash on my EC2 machine: >> > > > > Added the backtrace of the crash as well. >> > > > > >> > > > > https://github.com/apache/incubator-mxnet/issues/10856 >> > > > > >> > > > > Dockerfile here: >> > > > > >> > > > > https://github.com/larroy/mxnet_omp >> > > > > >> > > > > Kind regards. >> > > > > >> > > > > Pedro. >> > > > > >> > > > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu < >> > > marco.g.ab...@gmail.com> >> > > > > wrote: >> > > > > > >> > > > > > As already proposed, I think the easiest way to get a common >> > > > > understanding >> > > > > > is if we start with a few docker containers. Pedro, would it be >> > > possible >> > > > > > for you to wrap your benchmarks into a few containers that will >> > > produce >> > > > > > your shown results? That way, we can avoid possible >> > > misunderstandings and >> > > > > > also pinpoint the exact parts where people disagree or >> misunderstood >> > > each >> > > > > > other. >> > > > > > >> > > > > > -Marco >> > > > > > >> > > > > > Pedro Larroy <pedro.larroy.li...@gmail.com> schrieb am Do., >> 20. Juni >> > > > > 2019, >> > > > > > 21:47: >> > > > > > >> > > > > > > I can confirm that we are linking with two versions of omp, >> I'm >> > > > > > > gaining more clarity into this topic, but I have still >> questions, >> > > the >> > > > > > > facts that I got so far are the folllowing: >> > > > > > > >> > > > > > > * #1: We are linking with two versions of omp, intel's omp >> and llvm >> > > > > > > openmp when building with MKL enabled. >> > > > > > > * #2: We have 3 different possible OMP versions: Intel OMP >> (comes >> > > with >> > > > > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) >> (This >> > > > > > > one is used on the PR proposed by Anton). >> > > > > > > >> > > > > > > Questions: >> > > > > > > >> > > > > > > * #1 Is it ok to have two versions of openmp linked at the >> same >> > > time? >> > > > > > > * #2 Which implementation of OMP gives the best >> performance? (See >> > > > > > > total training time of my measurement for a partial answer) >> > > > > > > * #3 Should we have a build flag so we can choose the OMP >> version >> > > at >> > > > > > > runtime? >> > > > > > > * #4 Which Compiler and build flags did Chris use to get 10x >> > > slowdown? >> > > > > > > * #5 @Stas: is there a script to replicate your benchmarks >> > > easily? If >> > > > > > > so could you provide a link? I think we would need to >> reproduce >> > > your >> > > > > > > benchmarks and verify which versions are being linked. It's >> > > possible >> > > > > > > that while compiling with MKL intel's omp was pulled in >> instead of >> > > > > > > GNU OpenMP. >> > > > > > > * #6 @Chris: how to maintain the copy of LLVM's Openmp? >> Should we >> > > > > > > update the subrepo regularly? >> > > > > > > >> > > > > > > My conclusion so far: >> > > > > > > >> > > > > > > * #1 We should avoid linking two versions of omp if possible >> and >> > > > > > > allow users to choose one in the build as we do for BLAS. >> > > > > > > * #2 For performance reasons and more control vs different >> > > compiler >> > > > > > > versions seems it makes indeed sense to keep the LLVM OpenMP >> > > version >> > > > > > > in 3rdparty for now. So unless some more data is gathered, it >> makes >> > > > > > > sense not to remove it as of now. >> > > > > > > * #3 We should provide build options to choose which openmp >> > > library >> > > > > > > is to be used from the three options available, including >> libgomp. >> > > > > > > * #4 Refining the build we could also enable OpenMP in mac >> without >> > > > > > > additional contortions (doesn't work as of today): >> > > > > > > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/ >> > > > > > > * #5 We should add different omp versions to our benchmarks >> and >> > > track >> > > > > > > the performance, so this data is available for prescribing >> the best >> > > > > > > build options and for binary releases. >> > > > > > > >> > > > > > > This is also an interesting related gh issue posted in the >> mkl-dnn >> > > > > > > repository: https://github.com/intel/mkl-dnn/issues/230 >> > > > > > > >> > > > > > > >> > > > > > > I don't observe the order of magnitude divergence reported by >> > > Chris in >> > > > > > > vanilla Ubuntu 18.04 in samples / s but the full training >> finishes >> > > > > > > indeed faster with the OMP from 3rdparty (LLVM openmp) vs >> libgomp. >> > > > > > > >> > > > > > > There's also differences in training time when using MKL and >> the , >> > > > > > > it's actually a bit slower, I don't know if it's related to >> OMP. >> > > > > > > >> > > > > > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) >> > > > > > > >> > > > > > > Anton's branch: g...@github.com:lebeg/incubator-mxnet.git >> branch >> > > > > 'omp' >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd >> > > > > > > build/libmxnet.so |grep -i omp >> > > > > > > libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 >> > > > > > > (0x00007fd99a51d000) >> > > > > > > >> > > > > > > time python train_mnist.py >> > > > > > > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.984176 >> > > > > > > INFO:root:Epoch[19] Batch [0-100] Speed: 41617.00 >> samples/sec >> > > > > > > accuracy=1.000000 >> > > > > > > INFO:root:Epoch[19] Batch [100-200] Speed: 47990.69 >> samples/sec >> > > > > > > accuracy=0.999531 >> > > > > > > INFO:root:Epoch[19] Batch [200-300] Speed: 47517.01 >> samples/sec >> > > > > > > accuracy=0.999687 >> > > > > > > INFO:root:Epoch[19] Batch [300-400] Speed: 47430.53 >> samples/sec >> > > > > > > accuracy=1.000000 >> > > > > > > INFO:root:Epoch[19] Batch [400-500] Speed: 47649.77 >> samples/sec >> > > > > > > accuracy=0.999687 >> > > > > > > INFO:root:Epoch[19] Batch [500-600] Speed: 51708.12 >> samples/sec >> > > > > > > accuracy=0.999687 >> > > > > > > INFO:root:Epoch[19] Batch [600-700] Speed: 57228.63 >> samples/sec >> > > > > > > accuracy=0.999375 >> > > > > > > INFO:root:Epoch[19] Batch [700-800] Speed: 50887.85 >> samples/sec >> > > > > > > accuracy=0.999844 >> > > > > > > INFO:root:Epoch[19] Batch [800-900] Speed: 53947.98 >> samples/sec >> > > > > > > accuracy=0.999531 >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717 >> > > > > > > INFO:root:Epoch[19] Time cost=1.219 >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983977 >> > > > > > > 1011.98user 26.78system 0:31.54elapsed 3292%CPU >> (0avgtext+0avgdata >> > > > > > > 1146052maxresident)k >> > > > > > > 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps >> > > > > > > >> > > > > > > Master, MKL ON: >> > > > > > > >> > > > > > > (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification >> [master]> >> > > ldd >> > > > > > > ../../build/libmxnet.so | grep -i omp >> > > > > > > libomp.so => >> > > > > > > >> > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so >> > > > > > > (0x00007f05ba38f000) >> > > > > > > libiomp5.so => >> > > > > > > >> > > > > > > >> > > > > >> > > >> /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so >> > > > > > > (0x00007f05b09f4000) >> > > > > > > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.982484 >> > > > > > > INFO:root:Epoch[19] Batch [0-100] Speed: 36651.63 >> samples/sec >> > > > > > > accuracy=0.999691 >> > > > > > > INFO:root:Epoch[19] Batch [100-200] Speed: 45093.98 >> samples/sec >> > > > > > > accuracy=0.999844 >> > > > > > > INFO:root:Epoch[19] Batch [200-300] Speed: 45146.84 >> samples/sec >> > > > > > > accuracy=0.999687 >> > > > > > > INFO:root:Epoch[19] Batch [300-400] Speed: 45119.90 >> samples/sec >> > > > > > > accuracy=0.999687 >> > > > > > > INFO:root:Epoch[19] Batch [400-500] Speed: 44998.96 >> samples/sec >> > > > > > > accuracy=0.999531 >> > > > > > > INFO:root:Epoch[19] Batch [500-600] Speed: 45072.25 >> samples/sec >> > > > > > > accuracy=0.999844 >> > > > > > > INFO:root:Epoch[19] Batch [600-700] Speed: 44969.79 >> samples/sec >> > > > > > > accuracy=0.999844 >> > > > > > > INFO:root:Epoch[19] Batch [700-800] Speed: 44962.78 >> samples/sec >> > > > > > > accuracy=0.999844 >> > > > > > > INFO:root:Epoch[19] Batch [800-900] Speed: 44945.47 >> samples/sec >> > > > > > > accuracy=0.999375 >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999717 >> > > > > > > INFO:root:Epoch[19] Time cost=1.367 >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.982783 >> > > > > > > 854.97user 847.21system 0:41.44elapsed 4106%CPU >> (0avgtext+0avgdata >> > > > > > > 1154348maxresident)k >> > > > > > > 0inputs+0outputs (0major+3624361minor)pagefaults 0swaps >> > > > > > > >> > > > > > > >> > > > > > > MKL OFF: >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> grep -i >> MKL >> > > > > > > cmake_options.yml >> > > > > > > USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found >> > > > > > > USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL >> found) IF >> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE) >> > > > > > > USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found) >> IF >> > > > > > > USE_MKL_IF_AVAILABLE AND (NOT APPLE) >> > > > > > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd >> > > > > > > build/libmxnet.so |grep -i omp >> > > > > > > libomp.so => >> > > > > > > >> > > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so >> > > > > > > (0x00007fb720c54000) >> > > > > > > >> > > > > > > INFO:root:Epoch[18] Validation-accuracy=0.983479 >> > > > > > > INFO:root:Epoch[19] Batch [0-100] Speed: 46784.02 >> samples/sec >> > > > > > > accuracy=1.000000 >> > > > > > > INFO:root:Epoch[19] Batch [100-200] Speed: 48824.29 >> samples/sec >> > > > > > > accuracy=0.999687 >> > > > > > > INFO:root:Epoch[19] Batch [200-300] Speed: 49190.31 >> samples/sec >> > > > > > > accuracy=0.999687 >> > > > > > > INFO:root:Epoch[19] Batch [300-400] Speed: 51518.77 >> samples/sec >> > > > > > > accuracy=0.999844 >> > > > > > > INFO:root:Epoch[19] Batch [400-500] Speed: 51551.62 >> samples/sec >> > > > > > > accuracy=0.999844 >> > > > > > > INFO:root:Epoch[19] Batch [500-600] Speed: 49026.35 >> samples/sec >> > > > > > > accuracy=0.999844 >> > > > > > > INFO:root:Epoch[19] Batch [600-700] Speed: 49002.46 >> samples/sec >> > > > > > > accuracy=0.999375 >> > > > > > > INFO:root:Epoch[19] Batch [700-800] Speed: 48980.55 >> samples/sec >> > > > > > > accuracy=0.999687 >> > > > > > > INFO:root:Epoch[19] Batch [800-900] Speed: 47402.56 >> samples/sec >> > > > > > > accuracy=0.999844 >> > > > > > > INFO:root:Epoch[19] Train-accuracy=0.999767 >> > > > > > > INFO:root:Epoch[19] Time cost=1.259 >> > > > > > > INFO:root:Epoch[19] Validation-accuracy=0.983181 >> > > > > > > 755.36user 754.94system 0:35.89elapsed 4207%CPU >> (0avgtext+0avgdata >> > > > > > > 1147008maxresident)k >> > > > > > > 0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps >> > > > > > > >> > > > > > > Let me know what you think. >> > > > > > > >> > > > > > > Link to the original PR: >> > > > > > > https://github.com/apache/incubator-mxnet/pull/12160 >> > > > > > > >> > > > > > > Thanks. >> > > > > > > >> > > > > > > On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland >> > > > > > > <kellen.sunderl...@gmail.com> wrote: >> > > > > > > > >> > > > > > > > "if you’re linking in two then you’re doing something >> wrong." >> > > > > Correct, >> > > > > > > > that's one thing I believe we've got consensus on. So >> let's call >> > > > > that >> > > > > > > out >> > > > > > > > as a bug to be fixed. >> > > > > > > > >> > > > > > > > Let's move forward with some reproducible numbers and then >> > > discuss >> > > > > the >> > > > > > > pros >> > > > > > > > / cons of which particular OMP implementation we should use. >> > > > > > > > >> > > > > > > > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy < >> > > > > > > pedro.larroy.li...@gmail.com> >> > > > > > > > wrote: >> > > > > > > > >> > > > > > > > > Hi Chris >> > > > > > > > > >> > > > > > > > > I would ask you to have a bit of patience and help us >> with your >> > > > > > > > > experience in this matter. Nobody is ignoring anything, I >> > > think we >> > > > > are >> > > > > > > > > individually gathering feedbacks and trying to understand >> the >> > > > > multiple >> > > > > > > > > contributions done to this topic including yours, then go >> step >> > > by >> > > > > > > > > step, understand what is going on and run experiments and >> > > report >> > > > > back >> > > > > > > > > to the list or the corresponding github item. It was >> suggested >> > > by >> > > > > > > > > Kellen to prepare some containers, this takes effort. >> > > > > > > > > >> > > > > > > > > Regarding your final comment, most of us also have many >> other >> > > > > things >> > > > > > > > > to do and responsibilities even if our daytime jobs might >> > > involve >> > > > > > > > > MXNet in some form or another. I think that's part of the >> > > privilege >> > > > > > > > > and responsibility of working close with an open source >> > > project and >> > > > > > > > > the magic of collaboration across organizations. Let's >> all be >> > > > > patient >> > > > > > > > > and take some time to understand and reason about this >> topic >> > > which >> > > > > is >> > > > > > > > > not simple. Since we decided to step back and gather more >> data >> > > > > let's >> > > > > > > > > take time and do it properly. >> > > > > > > > > >> > > > > > > > > Personally I hope to find time to look again into this >> issue >> > > before >> > > > > > > > > the end of the week. >> > > > > > > > > >> > > > > > > > > Thanks. >> > > > > > > > > >> > > > > > > > > Pedro. >> > > > > > > > > >> > > > > > > > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier < >> > > > > cjolivie...@apache.org> >> > > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > if you’re linking in two then you’re doing something >> wrong. >> > > You >> > > > > can >> > > > > > > see >> > > > > > > > > by >> > > > > > > > > > my email yesterday that only one is linked in. This is >> also >> > > the >> > > > > case >> > > > > > > with >> > > > > > > > > > the mkl version built by the Makefile — only the Intel >> OMP >> > > > > library is >> > > > > > > > > used >> > > > > > > > > > (no libgomp). >> > > > > > > > > > >> > > > > > > > > > That being said, Do you have clear evidence that using >> Intel >> > > OMP >> > > > > is >> > > > > > > both >> > > > > > > > > > problematic and the situation isn’t fixable? The >> burden of >> > > > > proof is >> > > > > > > on >> > > > > > > > > the >> > > > > > > > > > ones requesting the change — it is not my >> responsibility to >> > > > > justify >> > > > > > > the >> > > > > > > > > > current state. There must be something “terrible” and >> > > unfixable >> > > > > to >> > > > > > > > > justify >> > > > > > > > > > a change. I have seen no proof of this in all this >> time. >> > > > > > > > > > >> > > > > > > > > > On a side note, I mentioned a couple of things in my >> email >> > > > > yesterday >> > > > > > > that >> > > > > > > > > > still are not being responded to (they were also >> ignored in >> > > the >> > > > > last >> > > > > > > > > > incarnation of this “discussion” — I have much >> experience in >> > > this >> > > > > > > matter >> > > > > > > > > to >> > > > > > > > > > assume “discussion” is a waste of my time, seeing and I >> am >> > > not >> > > > > paid >> > > > > > > to >> > > > > > > > > > “work on” mxnet like y’all are). >> > > > > > > > > > >> > > > > > > > > > -C >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland < >> > > > > > > > > > kellen.sunderl...@gmail.com> wrote: >> > > > > > > > > > >> > > > > > > > > > > I've also quite often seen two versions of OpenMP >> linked. >> > > I >> > > > > think >> > > > > > > we >> > > > > > > > > can >> > > > > > > > > > > all agree we probably want to avoid linking in two >> > > libraries >> > > > > that >> > > > > > > do >> > > > > > > > > > > effectively the same thing. >> > > > > > > > > > > >> > > > > > > > > > > The performance questions should be fairly straight >> > > forward to >> > > > > > > > > demonstrate >> > > > > > > > > > > right? Could we just collaborate on a few minimal >> > > Dockerfiles >> > > > > that >> > > > > > > > > show >> > > > > > > > > > > (or don't show) Intel OpenMP performance speedups >> with the >> > > > > > > workloads >> > > > > > > > > Chris >> > > > > > > > > > > is referencing? >> > > > > > > > > > > >> > > > > > > > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav < >> > > > > > > > > > > stanislav.tsuk...@gmail.com> wrote: >> > > > > > > > > > > >> > > > > > > > > > > > Hi, Chris! >> > > > > > > > > > > > >> > > > > > > > > > > > Stas here - I've gathered that performance data. >> > > > > > > > > > > > Sure thing, I can be wrong, but please elaborate a >> bit on >> > > > > what >> > > > > > > we are >> > > > > > > > > > > > missing. >> > > > > > > > > > > > Be assured, intentional misdirection was never a >> case. >> > > > > > > > > > > > >> > > > > > > > > > > > Thanks a lot for being constructive. >> > > > > > > > > > > > >> > > > > > > > > > > > > Turning Intel OMP on and off (and MKL as well, >> since it >> > > > > tends >> > > > > > > to >> > > > > > > > > pull >> > > > > > > > > > > in >> > > > > > > > > > > > omp, depending which one is linked in). >> > > > > > > > > > > > >> > > > > > > > > > > > We never ever considered turning MKL off. We are on >> the >> > > same >> > > > > page >> > > > > > > > > here - >> > > > > > > > > > > > MKL is crucial for the performance. >> > > > > > > > > > > > Why should we? There's a GOMP-linked version of MKL, >> > > that we >> > > > > can >> > > > > > > use. >> > > > > > > > > > > > >> > > > > > > > > > > > What we did - we measured, if using compilers >> default >> > > OpenMP >> > > > > > > > > > > > implementation instead of referenced source code >> > > > > distribution of >> > > > > > > > > OpenMP >> > > > > > > > > > > > makes anything slower. >> > > > > > > > > > > > We have found the impact to be hardly measurable. >> > > > > > > > > > > > The difference between GOMP and iOMP is <5% on our >> > > > > benchmarks, >> > > > > > > most >> > > > > > > > > of >> > > > > > > > > > > the >> > > > > > > > > > > > time less than that. >> > > > > > > > > > > > >> > > > > > > > > > > > We just suggest to simplify the build of mxnet, by >> > > removing >> > > > > the >> > > > > > > > > > > > unnecessary dependency. >> > > > > > > > > > > > >> > > > > > > > > > > > During that we discovered for example the following >> > > amazing >> > > > > > > issue: >> > > > > > > > > > > > >> https://github.com/apache/incubator-mxnet/issues/14087 >> > > > > > > > > > > > >> > > > > > > > > > > > Best Regards >> > > > > > > > > > > > >> > > > > > > > > > > > Stas >> > > > > > > > > > > > >> > > > > > > > > > > > On 18.06.19, 18:24, "Chris Olivier" < >> > > cjolivie...@gmail.com> >> > > > > > > wrote: >> > > > > > > > > > > > >> > > > > > > > > > > > I am very reluctant to feed the trolls again, >> and >> > > this >> > > > > will >> > > > > > > be >> > > > > > > > > teh >> > > > > > > > > > > last >> > > > > > > > > > > > time I address Pedro or Anton on the subject, >> but >> > > since I >> > > > > > > think >> > > > > > > > > the >> > > > > > > > > > > > numbers >> > > > > > > > > > > > being presented are incorrect (either by te >> builders >> > > not >> > > > > > > really >> > > > > > > > > > > > understanding what they are building, or >> possibly >> > > > > intentional >> > > > > > > > > > > > misdirection): >> > > > > > > > > > > > >> > > > > > > > > > > > Turning Intel OMP on and off (and MKL as well, >> since >> > > it >> > > > > > > tends to >> > > > > > > > > pull >> > > > > > > > > > > > in >> > > > > > > > > > > > omp, depending which one is linked in). >> > > > > > > > > > > > There is a HUGE difference. This is consistent >> with >> > > my >> > > > > > > > > experience >> > > > > > > > > > > > before >> > > > > > > > > > > > when it was added. >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > default mnist: >> > > > > > > > > > > > >> > > > > > > > > > > > python >> ../example/image-classification/train_mnist.py >> > > > > > > > > > > > INFO:root:start with arguments >> > > Namespace(add_stn=False, >> > > > > > > > > > > batch_size=64, >> > > > > > > > > > > > disp_batches=100, dtype='float32', >> gc_threshold=0.5, >> > > > > > > > > gc_type='none', >> > > > > > > > > > > > gpus=None, image_shape='1, 28, 28', >> > > > > initializer='default', >> > > > > > > > > > > > kv_store='device', load_epoch=None, loss='', >> lr=0.05, >> > > > > > > > > lr_factor=0.1, >> > > > > > > > > > > > lr_step_epochs='10', macrobatch_size=0, >> > > > > model_prefix=None, >> > > > > > > > > mom=0.9, >> > > > > > > > > > > > monitor=0, network='mlp', num_classes=10, >> > > num_epochs=20, >> > > > > > > > > > > > num_examples=60000, num_layers=None, >> optimizer='sgd', >> > > > > > > > > > > > profile_server_suffix='', >> profile_worker_suffix='', >> > > > > > > > > save_period=1, >> > > > > > > > > > > > test_io=0, top_k=0, warmup_epochs=5, >> > > > > > > warmup_strategy='linear', >> > > > > > > > > > > > wd=0.0001) >> > > > > > > > > > > > >> > > > > > > > > > > > INTEL OMP: >> > > > > > > > > > > > >> > > > > > > > > > > > ldd libmxnet.so | grep omp >> > > > > > > > > > > > libomp.so => >> > > > > > > > > > > > >> > > > > > > > > >> > > > > >> /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so >> > > > > > > > > > > > (0x00007f978fde7000) >> > > > > > > > > > > > >> > > > > > > > > > > > :root:Epoch[0] Batch [0-100] Speed: >> 31548.09 >> > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.780012 >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [100-200] Speed: >> > > 16073.21 >> > > > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.920469 >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [200-300] Speed: >> > > 19075.91 >> > > > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.928281 >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [300-400] Speed: >> > > 23211.36 >> > > > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.942813 >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [400-500] Speed: >> > > 22139.79 >> > > > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.938750 >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [500-600] Speed: >> > > 23225.52 >> > > > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.946562 >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [600-700] Speed: >> > > 19547.41 >> > > > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.953281 >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [700-800] Speed: >> > > 24111.73 >> > > > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.951562 >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [800-900] Speed: >> > > 13959.88 >> > > > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.957500 >> > > > > > > > > > > > INFO:root:Epoch[0] Train-accuracy=0.925423 >> > > > > > > > > > > > INFO:root:Epoch[0] Time cost=3.806 >> > > > > > > > > > > > INFO:root:Epoch[0] Validation-accuracy=0.962580 >> > > > > > > > > > > > INFO:root:Epoch[1] Batch [0-100] Speed: >> > > 24560.21 >> > > > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.968131 >> > > > > > > > > > > > INFO:root:Epoch[1] Batch [100-200] Speed: >> > > 23457.03 >> > > > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.966250 >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > LIBGOMP: >> > > > > > > > > > > > >> > > > > > > > > > > > ldd libmxnet.so | grep omp >> > > > > > > > > > > > libgomp.so.1 => >> > > > > > > /usr/lib/x86_64-linux-gnu/libgomp.so.1 >> > > > > > > > > > > > (0x00007f25c25dd000) >> > > > > > > > > > > > >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [0-100] Speed: >> > > 1731.01 >> > > > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.782488 >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [100-200] Speed: >> > > 3551.32 >> > > > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.907813 >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [200-300] Speed: >> > > 1991.00 >> > > > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.927188 >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [300-400] Speed: >> > > 2175.45 >> > > > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.937969 >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [400-500] Speed: >> > > 1644.95 >> > > > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.942187 >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [500-600] Speed: >> > > 6444.58 >> > > > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.950156 >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [600-700] Speed: >> > > 7842.16 >> > > > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.947969 >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [700-800] Speed: >> > > 9412.07 >> > > > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.953750 >> > > > > > > > > > > > INFO:root:Epoch[0] Batch [800-900] Speed: >> > > 12707.58 >> > > > > > > > > samples/sec >> > > > > > > > > > > > accuracy=0.953125 >> > > > > > > > > > > > >> > > > > > > > > > > > That being said, there's other issued beyond >> speed. >> > > The >> > > > > > > DEFAULT >> > > > > > > > > > > build >> > > > > > > > > > > > from >> > > > > > > > > > > > makefile (not CMake) uses Intel OMP mkl (I >> showed >> > > > > before) and >> > > > > > > > > > > > mysteriously >> > > > > > > > > > > > it has no issues? This seems highly suspicious. >> > > All I >> > > > > see >> > > > > > > is a >> > > > > > > > > lot >> > > > > > > > > > > of >> > > > > > > > > > > > hand-waving and conjecture and pointing to >> > > StackOverflow >> > > > > > > posts >> > > > > > > > > made >> > > > > > > > > > > by >> > > > > > > > > > > > people who may be of questionable pedigree to >> begin >> > > with. >> > > > > > > This >> > > > > > > > > > > smells >> > > > > > > > > > > > of a >> > > > > > > > > > > > Pedro-ego-fight rather than one of purely >> technical >> > > > > merit. >> > > > > > > > > Also, if >> > > > > > > > > > > > one >> > > > > > > > > > > > knows how OMP works, they would be very >> suspicious >> > > of >> > > > > the >> > > > > > > > > > > > "intermittent >> > > > > > > > > > > > hangs" claim -- that's probably just broken race >> > > > > conditions >> > > > > > > > > elsewhere >> > > > > > > > > > > > until >> > > > > > > > > > > > proven differently. It'd tend freeze on the >> first >> > > use if >> > > > > > > > > something >> > > > > > > > > > > is >> > > > > > > > > > > > wrong (try using libgomp after a fork and see), >> since >> > > > > worker >> > > > > > > > > threads" >> > > > > > > > > > > > wouldn't be assigned/joined properly. IntelOMP >> is >> > > > > faster, >> > > > > > > but >> > > > > > > > > also >> > > > > > > > > > > has >> > > > > > > > > > > > other advantages, such as allowing OMP after a >> fork. >> > > > > > > > > > > > >> > > > > > > > > > > > I actually addressed a lot of issues and ask for >> > > > > > > clarification >> > > > > > > > > in the >> > > > > > > > > > > > original PR's way back when, but they're all >> just >> > > > > ignored. >> > > > > > > > > > > > >> > > > > > > > > > > > -Chris >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > >> > > >> >