Hi Chris, Following up on the issue, are all things resolved in the discussion?
If yes, I kindly ask you to reopen this PR and remove ‘requesting changes’ status: https://github.com/apache/incubator-mxnet/pull/12160 Thank you. Best Anton вт, 27 нояб. 2018 г. в 17:15, Anton Chernov <mecher...@gmail.com>: > Another thing to take into consideration: > > All python artefacts that are created (PyPi) are built with make and are > not using the bundled OpenMP library. > > One step for the switch to CMake to happen is the approval and merging of > the mentioned PR: > > https://github.com/apache/incubator-mxnet/pull/12160 > > If there are no other objections I kindly ask Chris Olivier to remove his > 'requesting changes' veto on it to unblock the CMake overhaul work. > > Thank you. > > Best > Anton > > чт, 22 нояб. 2018 г. в 17:11, Anton Chernov <mecher...@gmail.com>: > >> >> Thank you for you answer, Chris. >> >> > The whole “mixing omp libraries” is something that occurs in production >> every day and certainly in everything that uses mkl. >> >> I'm afraid this statement is wrong. Intel MKL-DNN strictly ensures that >> this mixture is not happening: >> >> "Intel MKL-DNN uses OpenMP* for parallelism and requires an OpenMP >> runtime library to work. As different OpenMP runtimes may not be binary >> compatible it's important to ensure that only one OpenMP runtime is used >> throughout the application. Having more than one OpenMP runtime initialized >> may lead to undefined behavior resulting in incorrect results or crashes." >> [1] >> >> That is why 2 different MKLML libraries are provided: >> >> lib/libmklml_gnu.so | Intel MKL small library for GNU* OpenMP runtime >> lib/libmklml_intel.so | Intel MKL small library for Intel(R) OpenMP >> runtime >> >> > is the suggestion that libiomp be removed from mkl? >> >> That is certainly not my suggestion. >> >> > have you spoken with intel? have you consulted Intel at all? >> >> Yes, I have asked for comments on the issue. >> >> > “hard to debug random crash”. you’re seeing an assertion which is >> probably ... >> >> I'm seeing the result of undefined behaviour. And I want to put emphasis >> on the following statement: >> >> I disregards of whether there is a particular reason for the assert - it >> is a result of behaviour that should not happen. There are valid ways how >> to use llvm OpenMP in MXNet and the current way is not one of them. >> >> > The lack of root-causing the problem and knee-jerk solution here makes >> me >> uncomfortable. >> >> I hope that my efforts highlighting the problems reach you to mitigate >> your uncomfort. >> >> > if you want to see performance differences there’s an environment >> variable >> you can set in the mxnet omp tuning code that will print overhead and >> execution times for the current omp library. >> >> I don't want to see performance differences in the current OpenMP >> library. I want to remove the current OpenMP library and use the one >> provided by the compiler. >> >> >> >> Best >> Anton >> >> [1] https://github.com/intel/mkl-dnn/blame/master/README.md#L261-L265 >> >> чт, 22 нояб. 2018 г. в 16:50, Chris Olivier <cjolivie...@apache.org>: >> >>> Do you not work on CI mostly? My apologies for thinking that was some >>> sort >>> of team effort between you and a few others that were passionate about CI >>> keeping the CI system running smoothly. >>> >>> You have source code, you have the line the assertion is on. If you can’t >>> describe what’s going wrong that causes the assertion, then I don’t >>> really >>> have anything more to add to this conversation beyond what’s below: >>> >>> The whole “mixing omp libraries” is something that occurs in production >>> every day and certainly in everything that uses mkl. It may occasionally >>> cause problems for some edge cases when there is super-complex linking >>> strategies and dynamic loading. But this is not one of those edge cases. >>> Mostly blaming this is a red herring for other thread-related problems >>> and >>> people switch omp library and the timing of their code changes and they >>> stop seeing the problem. I’ve spent my entire career doing heavily >>> multiphreaded c++ development and i’ve seen that a million times. is the >>> suggestion that libiomp be removed from mkl? have you spoken with intel? >>> have you consulted Intel at all? >>> >>> and what you are seeing isn’t some “hard to debug random crash”. you’re >>> seeing an assertion which is probably related to omp trying to create a >>> thread pool after a fork and something was done in the mxnet code to make >>> that sketchy to do. I’d suggest filing an issue with the llvm openmp just >>> like you’d file with any other not-well-understood behavior in mxnet. >>> >>> The lack of root-causing the problem and knee-jerk solution here makes me >>> uncomfortable. >>> >>> if you want to see performance differences there’s an environment >>> variable >>> you can set in the mxnet omp tuning code that will print overhead and >>> execution times for the current omp library. >>> >>> >>> >>> >>> >>> >>> >>> On Thu, Nov 22, 2018 at 7:12 AM Anton Chernov <mecher...@gmail.com> >>> wrote: >>> >>> > Hi Chris, >>> > >>> > Thank you for your answer. If you have noticed the initial email comes >>> from >>> > me, Anton Chernov (@lebeg on Github) and thus the proposal is not from >>> any >>> > 'Ci' team that you've mentioned, but from me personally. >>> > >>> > You are writing: >>> > >>> > > someone is doing something unhealthy when they fork ... >>> > >>> > I'm missing any context to understand what you mean. >>> > >>> > > we get a lot of performance gain from OMP ... >>> > >>> > There is no data that would prove this statement and therefore it is a >>> > random guess. >>> > >>> > > in many months, no investigation has occurred as to WHY the >>> assertion is >>> > failing. >>> > >>> > The investigation has concluded that this is happening due to undefined >>> > behaviour which is, in my opinion, a suffient answer that does not >>> require >>> > to go any deeper. >>> > >>> > > the pr is vetoed until such a time that the actual root cause of the >>> > problem is known. >>> > >>> > And considering the statements above there is no valid reason to veto >>> the >>> > PR. >>> > >>> > >>> > Best >>> > Anton >>> > >>> > чт, 22 нояб. 2018 г. в 15:38, Chris Olivier <cjolivie...@gmail.com>: >>> > >>> > > 3x less overhead* >>> > > >>> > > On Thu, Nov 22, 2018 at 6:25 AM Chris Olivier <cjolivie...@gmail.com >>> > >>> > > wrote: >>> > > >>> > > > someone is doing something unhealthy when they fork, which is >>> causing >>> > an >>> > > > assertion in the openmp library. the same assertion that would >>> fire in >>> > > mkl, >>> > > > which is linked to libiomp5 (exact same omp library). this is new >>> > > behavior >>> > > > and most likely due to an error or suboptimal approach in the >>> forking >>> > > logic >>> > > > in mxnet. >>> > > > >>> > > > in order to circumvent the assert, the Ci team is proposing to >>> remove >>> > the >>> > > > library completely which is equivalent to cutting off your leg to >>> make >>> > > the >>> > > > pain from stubbing your toe go away. >>> > > > >>> > > > we get a lot of performance gain from OMP. is has about a 1/3 less >>> > > > overhead for entering omp regions and also supports omp regions >>> after a >>> > > > fork, which libgomp does not. >>> > > > >>> > > > in many months, no investigation has occurred as to WHY the >>> assertion >>> > is >>> > > > failing. >>> > > > >>> > > > the pr is vetoed until such a time that the actual root cause of >>> the >>> > > > problem is known. >>> > > > >>> > > > >>> > > > thanks, >>> > > > >>> > > > -Chris. >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > On Thu, Nov 22, 2018 at 4:36 AM Anton Chernov <mecher...@gmail.com >>> > >>> > > wrote: >>> > > > >>> > > >> Dear MXNet community, >>> > > >> >>> > > >> I would like to drive attention to an important issue that is >>> present >>> > in >>> > > >> the MXNet CMake build: usage of bundled llvm OpenMP library. >>> > > >> >>> > > >> I have opened a PR to remove it: >>> > > >> https://github.com/apache/incubator-mxnet/pull/12160 >>> > > >> >>> > > >> The issue was closed, but I am strong in my oppinion that it's the >>> > right >>> > > >> thing to do. >>> > > >> >>> > > >> *Background* >>> > > >> If you want to use OpenMP pragmas in your code for >>> parallelization you >>> > > >> would supply a special flag to the compiler: >>> > > >> >>> > > >> - Clang / -fopenmp >>> > > >> https://openmp.llvm.org/ >>> > > >> >>> > > >> - GCC / -fopenmp >>> > > >> https://gcc.gnu.org/onlinedocs/libgomp/Enabling-OpenMP.html >>> > > >> >>> > > >> - Intel / [Q]openmp >>> > > >> >>> > > >> >>> > > >>> > >>> https://software.intel.com/en-us/node/522689#6E24682E-F411-4AE3-A04D-ECD81C7008D1 >>> > > >> >>> > > >> - Visual Studio: /openmp (Enable OpenMP 2.0 Support) >>> > > >> https://msdn.microsoft.com/en-us/library/tt15eb9t.aspx >>> > > >> >>> > > >> Each of the compilers would enable the '#pragma omp' directive >>> during >>> > > >> C/C++ >>> > > >> compilation and arrange for automatic linking of the OpenMP >>> runtime >>> > > >> library >>> > > >> supplied by each complier separately. >>> > > >> >>> > > >> Thus, to use the advantages of an OpenMP implementation one has to >>> > > compile >>> > > >> the code with the corresponding compiler. >>> > > >> >>> > > >> Currently, in MXNet CMake build scripts a bundled version of llvm >>> > OpenMP >>> > > >> is >>> > > >> used ([1] and [2]) to replace the OpenMP library supplied by the >>> > > compiler. >>> > > >> >>> > > >> I will quote here the README from the MKL-DNN (Intel(R) Math >>> Kernel >>> > > >> Library >>> > > >> for Deep Neural Networks): >>> > > >> >>> > > >> "Intel MKL-DNN uses OpenMP* for parallelism and requires an OpenMP >>> > > runtime >>> > > >> library to work. As different OpenMP runtimes may not be binary >>> > > compatible >>> > > >> it's important to ensure that only one OpenMP runtime is used >>> > throughout >>> > > >> the application. Having more than one OpenMP runtime initialized >>> may >>> > > lead >>> > > >> to undefined behavior resulting in incorrect results or crashes." >>> [3] >>> > > >> >>> > > >> And: >>> > > >> >>> > > >> "Using GNU compiler with -fopenmp and -liomp5 options will link >>> the >>> > > >> application with both Intel and GNU OpenMP runtime libraries. This >>> > will >>> > > >> lead to undefined behavior of the application." [4] >>> > > >> >>> > > >> As can be seen from ldd for MXNet: >>> > > >> >>> > > >> $ ldd build/tests/mxnet_unit_tests | grep omp >>> > > >> libomp.so => >>> > /.../mxnet/build/3rdparty/openmp/runtime/src/libomp.so >>> > > >> (0x00007f697bc55000) >>> > > >> libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 >>> > > >> (0x00007f69660cd000) >>> > > >> >>> > > >> *Performance* >>> > > >> >>> > > >> The only performance data related to OpenMP in MXNet I was able to >>> > find >>> > > is >>> > > >> here: >>> > > >> >>> > > >> >>> > > >>> > >>> https://github.com/apache/incubator-mxnet/issues/9744#issuecomment-367711172 >>> > > >> >>> > > >> Which in my understanding is testing imact of different >>> environment >>> > > >> variables for the same setup (using same bundled OpenMP library). >>> > > >> >>> > > >> The libraries may differ in implementation and the Thread Affinity >>> > > >> Interface [5] may have significant impact on performance. >>> > > >> >>> > > >> All compliers support it: >>> > > >> >>> > > >> - Clang / KMP_AFFINITY >>> > > >> >>> > > >> >>> > > >>> > >>> https://github.com/clang-ykt/openmp/blob/master/runtime/src/kmp_affinity.cpp >>> > > >> >>> > > >> - GCC / GOMP_CPU_AFFINITY >>> > > >> >>> > > >> >>> > > >>> > >>> https://gcc.gnu.org/onlinedocs/gcc-4.7.1/libgomp/GOMP_005fCPU_005fAFFINITY.html >>> > > >> >>> > > >> - Intel / KMP_AFFINITY >>> > > >> >>> > > >> >>> > > >>> > >>> https://software.intel.com/en-us/node/522689#6E24682E-F411-4AE3-A04D-ECD81C7008D1 >>> > > >> >>> > > >> - Visual Studio / SetThreadAffinityMask >>> > > >> >>> > > >> >>> > > >>> > >>> https://docs.microsoft.com/en-us/windows/desktop/api/winbase/nf-winbase-setthreadaffinitymask >>> > > >> >>> > > >> *Issues* >>> > > >> >>> > > >> Failed OpenMP assertion when loading MXNet compiled with DEBUG=1 >>> > > >> https://github.com/apache/incubator-mxnet/issues/10856 >>> > > >> >>> > > >> libomp.so dependency (need REAL fix) >>> > > >> https://github.com/apache/incubator-mxnet/issues/11417 >>> > > >> >>> > > >> mxnet-mkl (v0.12.0) crash when using (conda-installed) numpy with >>> MKL >>> > > >> https://github.com/apache/incubator-mxnet/issues/8532 >>> > > >> >>> > > >> Performance regression when OMP_NUM_THREADS environment variable >>> is >>> > not >>> > > >> set >>> > > >> https://github.com/apache/incubator-mxnet/issues/9744 >>> > > >> >>> > > >> Poor concat CPU performance on CUDA builds >>> > > >> https://github.com/apache/incubator-mxnet/issues/11905 >>> > > >> >>> > > >> I would appreciate hearing your thoughts. >>> > > >> >>> > > >> >>> > > >> Best >>> > > >> Anton >>> > > >> >>> > > >> [1] >>> > > >> >>> > > >> >>> > > >>> > >>> https://github.com/apache/incubator-mxnet/blob/master/CMakeLists.txt#L400-L405 >>> > > >> [2] >>> https://github.com/apache/incubator-mxnet/tree/master/3rdparty >>> > > >> [3] >>> https://github.com/intel/mkl-dnn/blame/master/README.md#L261-L265 >>> > > >> [4] >>> https://github.com/intel/mkl-dnn/blame/master/README.md#L278-L280 >>> > > >> [5] https://software.intel.com/en-us/node/522691 >>> > > >> >>> > > > >>> > > >>> > >>> >>