Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Chris Olivier
please answer the questions in my last email regarding the suspected issue
in mxnet as well as on that PR you opened.


On Sun, Dec 8, 2019 at 7:00 PM Lausen, Leonard 
wrote:

> The assertion failure in the MXNet DEBUG build goes away by updating LLVM
> OpenMP
> to the latest released version. All evidence I have points to the assertion
> failure being due to a bug in the 2 years old UNRELEASED version of LLVM
> OpenMP.
> that we are using currently in CMake builds.
>
> Thus I'm requesting 3 commiters to approve
> https://github.com/apache/incubator-mxnet/pull/17012 to update to a
> released
> version of LLVM OpenMP.
>
> As described in the PR, the assertion is still part of LLVM OpenMP 9.0
> codebase.
> In particular look at lines
>
> https://github.com/llvm-mirror/openmp/blob/release_90/runtime/src/kmp_runtime.cpp#L6616
> and
>
> https://github.com/llvm-mirror/openmp/blob/37c72127e90360a020f351f18d9cccfc30e5145a/runtime/src/kmp_runtime.cpp#L6481
> where the latter is the line that currently fails in MXNet DEBUG build and
> the
> former is the equivalent line that doesn't fail in MXNet DEBUG builds after
> updating LLVM OpenMP.
>
>
>
> There is also a crash with Intel OpenMP as well as both the old UNRELEASED
> and
> the new, released version LLVM OpenMP that happens after forking. That
> crash
> doesn't go away and needs to be root-caused
> https://github.com/apache/incubator-mxnet/issues/14979
>
>
> On Sun, 2019-12-08 at 16:27 -0800, Pedro Larroy wrote:
> > Hi Leonard.
> >
> > Are you saying that you have updated this library and the problems
> desribed
> > in the related tickets are no longer present?
> >
> > P.
> >
> > On Sunday, December 8, 2019, Lausen, Leonard 
> > wrote:
> > > Thanks Pedro and Chris for your responses.
> > >
> > > After further investigation I find:
> > >
> > > 1) I don't think
> https://github.com/apache/incubator-mxnet/issues/14979 is
> > > caused by any incompatibility between gomp and llvm / intel omp. Rather
> > it's
> > > simply a problem of llvm / intel omp. See my comment to the issue for
> the
> > > methodology to arrive at this claim.
> > >
> > > 2) Regarding the assertion failure when compiling with (llvm)
> > 3rdparty/openmp,
> > > it can be fixed by updating the by now 2 years old llvm openmp code to
> the
> > > newest released version. I went ahead and opened a PR
> > > https://github.com/apache/incubator-mxnet/pull/17012
> > >
> > > Based on the investigation described in 1), I think Chris is right that
> > the
> > > assertion failure is not due to some interaction between gomp and llvm
> > omp.
> > > However, I'm not sure about Chris's suggestion that the assertion fa
>
> > > ilure
> > is due
> > > to a bug in MXNet. In fact, the failure goes away when updating the
> llvm
> > openmp
> > > code. So I think it's just due to a bug in the 2 years old code.
> > >
> > > @Chris, I think updating 3rdparty/openmp to fix the assertion issue is
> not
> > > contentious. Thus let's do it via lazy consensus (72 hours) or just
> > approve the
> > > PR and merge it.
> > >
> > > Please also take a look at my comment at #14979 and let everyone know
> if
> > you see
> > > any option to fix the bug while keeping 3rdparty/openmp. As this bug
> > affects an
> > > important use-case, I beleive we need to remove 3rdparty/openmp from
> the
> > CMake
> > > build as long as we don't find a solution for making #14979 work with
> > > 3rdparty/openmp.
> > >
> > > In fact, removing 3rdparty/openmp will then match the current Makefile
> > setup
> > > that according to my understanding is used to build the nightly
> releases
> > used by
> > > the majority of developers. Ie. most users actually don't use the CMake
> > build
> > > with 3rdparty/openmp. You can consider rescinding your veto on removing
> > > 3rdparty/openmp after reading through the evidence in that issue. If
> you
> > don't
> > > provide any evidence for why the methodology/conclusion in #14979 is
> > flawed, I
> > > will assume your previous veto is void based on Apache Voting rule as
> it
> > lacks
> > > technical justification and in any case was motivated by the assertion
> > issue,
> > > which I agree with you, is likely not due to gomp / omp interaction.
> > >
> > > Thank you
> > > Leonard
> > >
> > >
> > > On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
> > > > Stop disseminating false information:
> > > >
> > > > https://github.com/apache/incubator-mxnet/issues/14979
> > > >
> > > >
> > > > On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier 
> > wrote:
> > > > > -1
> > > > >
> > > > > mkldnn removed omp5 for licencing issues
> > > > > no bugs have actually been traced to the use of llvm openmp. only
> an
> > assert
> > > > > caused by an actual bug in mxnet code. there are suitable
> workarounds.
> > > > >
> > > > > over time llvm omp has simply been used as a “catch all” for random
> > > > > problems that aren’t related at all (such as getenv race condition
> in
> > an
> > > > > atfork call that isn’t even part of an omp 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Lausen, Leonard
Thanks Chris for the elaboration.

> What the assert in question is trying to say is that mxnet code is calling
> into omp library after a fork, but before the omp library’s atfork()
> handler is called, so the omp library has not yet initialized a new team if
> threads.  This looks to be the case in one of the call stacks on that
> issue. This is problematic for any openmp library which supports omp after
> a fork, and may not be deterministic from build to build, since the order
> of static init calls for a given module is undefined (i think mxnet is
> initializing omp during static init, but this may not matter).

I disagree. The assert fails without any forking. You can trigger it by running
`python3 -c 'import mxnet'` a few times.

As described in https://github.com/apache/incubator-mxnet/pull/17012 the 
previously failing assertion is still part of LLVM OpenMP 9.0 codebase. In 
particular you can compare line 
https://github.com/llvm-mirror/openmp/blob/release_90/runtime/src/kmp_runtime.cpp#L6616
to 
https://github.com/llvm-mirror/openmp/blob/37c72127e90360a020f351f18d9cccfc30e5145a/runtime/src/kmp_runtime.cpp#L6481
where the latter is the line that currently fails in MXNet.

I would like to reiterate, we are currently using a random and UNRELEASED LLVM
OpenMP version from 2017. There is no evidence that the assertion failure is due
to a bug in MXNet, but rather there is strong evidence that it is due to a bug
in the previously used LLVM version. (See the latter part of this email
regarding your suggestion about bug in MXNet fork handling).

> It stands to reason that linking this or that library may affect the assert
> occurring because it’s not known at what time one of the dependent
> libraries initializes omp (thus causing it to hook its atfork handler), so
> it is not surprising that mucking with dependencies may cause the assert to
> occur or not occur.

With respect to the Assert failure, there is no difference in "linking this or
that library". We are only updating the version of LLVM OpenMP codebase. See 
https://github.com/apache/incubator-mxnet/pull/17012

Thus your veto on updating the shipped LLVM OpenMP lacks technical justification
and is void. I'm requesting 3 commiters to approve the PR to go ahead with the
update.

> So if mxnet is doing that, it is a bug and remains a problem regardless of
> the omp library and probably should be fixed.  llvm omp happens to be nice
> enough to tell you you’re doing something wrong, at least when built in
> debug mode.

Yes, we need to investigate if there is a bug. Let's do this based on 
https://github.com/apache/incubator-mxnet/issues/14979 which is a "real" crash
that happens both with the old unreleased and the new 9.0 LLVM OpenMP versions.

> Once this issue is resolved, we can discuss the library inclusion itself.
> My objection is “fixing” what appears to be a bug by effectively
> “commenting out the assert” which is what i stated in the very beginning.

This is a reasonable approach, as long as you are willing to help fix it. As
made evident by your comments, you have more experience with advanced OpenMP
libraries. Thus let's work together on fixing the root cause, assuming it's
indeed a bug in MXNet.

If we run into any roadblock with finding a root cause in MXNet, the only
alternative I see is to remove the 3rdparty/openmp library, as we can not rule
out the possibility of an LLVM OpenMP bug either.

> Is there another explanation for the call stack with the assert?  Can this
> bug be ruled out?

Yes. The other explanation is that it's due to a bug in the old LLVM OpenMP
version (see above).
The stack trace you refer to is
https://github.com/apache/incubator-mxnet/issues/10856#issuecomment-505162890
relies on mkl and is hard to reproduce.

With respect to establishing and fixing the bug in MXNet, let's focus on the
crash described in https://github.com/apache/incubator-mxnet/issues/14979 which
is happening with Intel OpenMP, old unreleased LLVM OpenMP and the current LLVM
OpenMP 9.0 release. It involves forking and thus seems quite related to the
first paragraph of your email compared to the assertion failure which happens
without any forking.

Best regards
Leonard

On Sun, 2019-12-08 at 07:58 -0800, Chris Olivier wrote:
> btw the call stack I am referring to below is the one where I explained
> this problem before and after I got a hostile response, I locked the issue.
> 
> On Sun, Dec 8, 2019 at 7:24 AM Chris Olivier  wrote:
> 
> > Again, here is what I suspect the bug is in mxnet:
> > 
> > The way that advanced openmp libraries handle a fork is that they hook an
> > atfork() callback in which, in the new process, it creates a new “team” of
> > threads to use for its thread pool (since all of the thread handles in its
> > data structure belong to the previous process). atfork() callback order is
> > the order at which the callbacks are registered, which will tend to be the
> > first call to the openmp library.  For this reason, the 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Pedro Larroy
Great investigation thank you. I have to agree with your analysis and for
helping resolving this long standing issue.

This will not repair the damage made to the community of losing 3-4
valuable contributors. Introducing a library that causes bugs then blocking
changes and locking gh issues which attempt to remove or workaround the
issues in addition to making rude comments and worse things that are better
left out is still not acceptable and begs for an apology from Chris.

P.




On Sunday, December 8, 2019, Lausen, Leonard 
wrote:
> Thanks Pedro and Chris for your responses.
>
> After further investigation I find:
>
> 1) I don't think https://github.com/apache/incubator-mxnet/issues/14979 is
> caused by any incompatibility between gomp and llvm / intel omp. Rather
it's
> simply a problem of llvm / intel omp. See my comment to the issue for the
> methodology to arrive at this claim.
>
> 2) Regarding the assertion failure when compiling with (llvm)
3rdparty/openmp,
> it can be fixed by updating the by now 2 years old llvm openmp code to the
> newest released version. I went ahead and opened a PR
> https://github.com/apache/incubator-mxnet/pull/17012
>
> Based on the investigation described in 1), I think Chris is right that
the
> assertion failure is not due to some interaction between gomp and llvm
omp.
> However, I'm not sure about Chris's suggestion that the assertion failure
is due
> to a bug in MXNet. In fact, the failure goes away when updating the llvm
openmp
> code. So I think it's just due to a bug in the 2 years old code.
>
> @Chris, I think updating 3rdparty/openmp to fix the assertion issue is not
> contentious. Thus let's do it via lazy consensus (72 hours) or just
approve the
> PR and merge it.
>
> Please also take a look at my comment at #14979 and let everyone know if
you see
> any option to fix the bug while keeping 3rdparty/openmp. As this bug
affects an
> important use-case, I beleive we need to remove 3rdparty/openmp from the
CMake
> build as long as we don't find a solution for making #14979 work with
> 3rdparty/openmp.
>
> In fact, removing 3rdparty/openmp will then match the current Makefile
setup
> that according to my understanding is used to build the nightly releases
used by
> the majority of developers. Ie. most users actually don't use the CMake
build
> with 3rdparty/openmp. You can consider rescinding your veto on removing
> 3rdparty/openmp after reading through the evidence in that issue. If you
don't
> provide any evidence for why the methodology/conclusion in #14979 is
flawed, I
> will assume your previous veto is void based on Apache Voting rule as it
lacks
> technical justification and in any case was motivated by the assertion
issue,
> which I agree with you, is likely not due to gomp / omp interaction.
>
> Thank you
> Leonard
>
>
> On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
>> Stop disseminating false information:
>>
>> https://github.com/apache/incubator-mxnet/issues/14979
>>
>>
>> On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier 
wrote:
>>
>> > -1
>> >
>> > mkldnn removed omp5 for licencing issues
>> > no bugs have actually been traced to the use of llvm openmp. only an
assert
>> > caused by an actual bug in mxnet code. there are suitable workarounds.
>> >
>> > over time llvm omp has simply been used as a “catch all” for random
>> > problems that aren’t related at all (such as getenv race condition in
an
>> > atfork call that isn’t even part of an omp parallel region).
>> >
>> > proposal is now and has always been roughly equivalent to the idea of
>> > “comment out an assert rather than fix the bug it’s reporting”.
>> >
>> > Up until very recently, Makefile version of mxnet used libomp5 for
YEARS
>> > and not libgomp, with no issue reported (omp not built in debug mode),
so
>> > the equivalent configuration from CMake mysteriously causing myriads if
>> > problems has questionable merit and smells more like a hubris
situation.
>> >
>> > I use tensorflow as well and it links to libomp5 rather than libgomp.
>> >
>> > if the assert problem is really a problem, the bug being reported
would be
>> > prioritized and fixed. it should be fixed regardless. all the time
spent by
>> > some CI people trying to remove this could have simply fixed the
actual bug
>> > in a small fraction of the time.
>> >
>> >
>> > On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard

>> > wrote:
>> >
>> > > I think it's reasonable to assume that the Intel MKLDNN team is an
>> > > "authorative"
>> > > source about the issue of compilation with OpenMP and the OpenMP
runtime
>> > > library
>> > > related issues. Thus I suggest we follow the recommendation of Intel
>> > > MKLDNN team
>> > > within the MXNet project.
>> > >
>> > > Looking through the Intel MKLDNN documentation, I find [1]:
>> > >
>> > > > DNNL uses OpenMP runtime library provided by the compiler.
>> > >
>> > > as well as
>> > >
>> > > > it's important to ensure that only one OpenMP runtime is used
>> > throughout
>> > > the
>> > > > 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Pedro Larroy
Hi Leonard.

Are you saying that you have updated this library and the problems desribed
in the related tickets are no longer present?

P.

On Sunday, December 8, 2019, Lausen, Leonard 
wrote:
> Thanks Pedro and Chris for your responses.
>
> After further investigation I find:
>
> 1) I don't think https://github.com/apache/incubator-mxnet/issues/14979 is
> caused by any incompatibility between gomp and llvm / intel omp. Rather
it's
> simply a problem of llvm / intel omp. See my comment to the issue for the
> methodology to arrive at this claim.
>
> 2) Regarding the assertion failure when compiling with (llvm)
3rdparty/openmp,
> it can be fixed by updating the by now 2 years old llvm openmp code to the
> newest released version. I went ahead and opened a PR
> https://github.com/apache/incubator-mxnet/pull/17012
>
> Based on the investigation described in 1), I think Chris is right that
the
> assertion failure is not due to some interaction between gomp and llvm
omp.
> However, I'm not sure about Chris's suggestion that the assertion failure
is due
> to a bug in MXNet. In fact, the failure goes away when updating the llvm
openmp
> code. So I think it's just due to a bug in the 2 years old code.
>
> @Chris, I think updating 3rdparty/openmp to fix the assertion issue is not
> contentious. Thus let's do it via lazy consensus (72 hours) or just
approve the
> PR and merge it.
>
> Please also take a look at my comment at #14979 and let everyone know if
you see
> any option to fix the bug while keeping 3rdparty/openmp. As this bug
affects an
> important use-case, I beleive we need to remove 3rdparty/openmp from the
CMake
> build as long as we don't find a solution for making #14979 work with
> 3rdparty/openmp.
>
> In fact, removing 3rdparty/openmp will then match the current Makefile
setup
> that according to my understanding is used to build the nightly releases
used by
> the majority of developers. Ie. most users actually don't use the CMake
build
> with 3rdparty/openmp. You can consider rescinding your veto on removing
> 3rdparty/openmp after reading through the evidence in that issue. If you
don't
> provide any evidence for why the methodology/conclusion in #14979 is
flawed, I
> will assume your previous veto is void based on Apache Voting rule as it
lacks
> technical justification and in any case was motivated by the assertion
issue,
> which I agree with you, is likely not due to gomp / omp interaction.
>
> Thank you
> Leonard
>
>
> On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
>> Stop disseminating false information:
>>
>> https://github.com/apache/incubator-mxnet/issues/14979
>>
>>
>> On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier 
wrote:
>>
>> > -1
>> >
>> > mkldnn removed omp5 for licencing issues
>> > no bugs have actually been traced to the use of llvm openmp. only an
assert
>> > caused by an actual bug in mxnet code. there are suitable workarounds.
>> >
>> > over time llvm omp has simply been used as a “catch all” for random
>> > problems that aren’t related at all (such as getenv race condition in
an
>> > atfork call that isn’t even part of an omp parallel region).
>> >
>> > proposal is now and has always been roughly equivalent to the idea of
>> > “comment out an assert rather than fix the bug it’s reporting”.
>> >
>> > Up until very recently, Makefile version of mxnet used libomp5 for
YEARS
>> > and not libgomp, with no issue reported (omp not built in debug mode),
so
>> > the equivalent configuration from CMake mysteriously causing myriads if
>> > problems has questionable merit and smells more like a hubris
situation.
>> >
>> > I use tensorflow as well and it links to libomp5 rather than libgomp.
>> >
>> > if the assert problem is really a problem, the bug being reported
would be
>> > prioritized and fixed. it should be fixed regardless. all the time
spent by
>> > some CI people trying to remove this could have simply fixed the
actual bug
>> > in a small fraction of the time.
>> >
>> >
>> > On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard

>> > wrote:
>> >
>> > > I think it's reasonable to assume that the Intel MKLDNN team is an
>> > > "authorative"
>> > > source about the issue of compilation with OpenMP and the OpenMP
runtime
>> > > library
>> > > related issues. Thus I suggest we follow the recommendation of Intel
>> > > MKLDNN team
>> > > within the MXNet project.
>> > >
>> > > Looking through the Intel MKLDNN documentation, I find [1]:
>> > >
>> > > > DNNL uses OpenMP runtime library provided by the compiler.
>> > >
>> > > as well as
>> > >
>> > > > it's important to ensure that only one OpenMP runtime is used
>> > throughout
>> > > the
>> > > > application. Having more than one OpenMP runtime linked to an
>> > executable
>> > > may
>> > > > lead to undefined behavior including incorrect results or crashes.
>> > >
>> > > To keep our project maintainable and error free, I thus suggest we
follow
>> > > DNNL
>> > > and use the OpenMP runtime library provided by the compiler.
>> > > We have 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Pedro Larroy
This is actually useful information, thanks.

Still I don't see a justificqtion for vetoing being able to choose the
library at compile time. Fixing the issue you reasonably describe and being
able to choose are two orthogonal topics.


Thanks for the constructive information.

On Sunday, December 8, 2019, Chris Olivier  wrote:
> Again, here is what I suspect the bug is in mxnet:
>
> The way that advanced openmp libraries handle a fork is that they hook an
> atfork() callback in which, in the new process, it creates a new “team” of
> threads to use for its thread pool (since all of the thread handles in its
> data structure belong to the previous process). atfork() callback order is
> the order at which the callbacks are registered, which will tend to be the
> first call to the openmp library.  For this reason, the fork order will
> vary depending upon what other libraries might be linked in and whether
> they make omp calls before mxnet starts its static init.
>
> What the assert in question is trying to say is that mxnet code is calling
> into omp library after a fork, but before the omp library’s atfork()
> handler is called, so the omp library has not yet initialized a new team
if
> threads.  This looks to be the case in one of the call stacks on that
> issue. This is problematic for any openmp library which supports omp after
> a fork, and may not be deterministic from build to build, since the order
> of static init calls for a given module is undefined (i think mxnet is
> initializing omp during static init, but this may not matter).
>
> So if mxnet is doing that, it is a bug and remains a problem regardless of
> the omp library and probably should be fixed.  llvm omp happens to be nice
> enough to tell you you’re doing something wrong, at least when built in
> debug mode.
>
> Once this issue is resolved, we can discuss the library inclusion itself.
> My objection is “fixing” what appears to be a bug by effectively
> “commenting out the assert” which is what i stated in the very beginning.
>
> It stands to reason that linking this or that library may affect the
assert
> occurring because it’s not known at what time one of the dependent
> libraries initializes omp (thus causing it to hook its atfork handler), so
> it is not surprising that mucking with dependencies may cause the assert
to
> occur or not occur.
>
> Is there another explanation for the call stack with the assert?  Can this
> bug be ruled out?
>
>
> Here is an example of the atfork team concept with libgomp as well.
> Probably you can check the current libgomp code itself but this explains
> the code:
> https://patchwork.ozlabs.org/patch/319827/
>
>
>
>
>
>
>
>
> On Sun, Dec 8, 2019 at 2:21 AM Lausen, Leonard 
> wrote:
>
>> Thanks Pedro and Chris for your responses.
>>
>> After further investigation I find:
>>
>> 1) I don't think https://github.com/apache/incubator-mxnet/issues/14979
is
>> caused by any incompatibility between gomp and llvm / intel omp. Rather
>> it's
>> simply a problem of llvm / intel omp. See my comment to the issue for the
>> methodology to arrive at this claim.
>>
>> 2) Regarding the assertion failure when compiling with (llvm)
>> 3rdparty/openmp,
>> it can be fixed by updating the by now 2 years old llvm openmp code to
the
>> newest released version. I went ahead and opened a PR
>> https://github.com/apache/incubator-mxnet/pull/17012
>>
>> Based on the investigation described in 1), I think Chris is right that
the
>> assertion failure is not due to some interaction between gomp and llvm
omp.
>> However, I'm not sure about Chris's suggestion that the assertion failure
>> is due
>> to a bug in MXNet. In fact, the failure goes away when updating the llvm
>> openmp
>> code. So I think it's just due to a bug in the 2 years old code.
>>
>> @Chris, I think updating 3rdparty/openmp to fix the assertion issue is
not
>> contentious. Thus let's do it via lazy consensus (72 hours) or just
>> approve the
>> PR and merge it.
>>
>> Please also take a look at my comment at #14979 and let everyone know if
>> you see
>> any option to fix the bug while keeping 3rdparty/openmp. As this bug
>> affects an
>> important use-case, I beleive we need to remove 3rdparty/openmp from the
>> CMake
>> build as long as we don't find a solution for making #14979 work with
>> 3rdparty/openmp.
>>
>> In fact, removing 3rdparty/openmp will then match the current Makefile
>> setup
>> that according to my understanding is used to build the nightly releases
>> used by
>> the majority of developers. Ie. most users actually don't use the CMake
>> build
>> with 3rdparty/openmp. You can consider rescinding your veto on removing
>> 3rdparty/openmp after reading through the evidence in that issue. If you
>> don't
>> provide any evidence for why the methodology/conclusion in #14979 is
>> flawed, I
>> will assume your previous veto is void based on Apache Voting rule as it
>> lacks
>> technical justification and in any case was motivated by the assertion
>> 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Chris Olivier
btw the call stack I am referring to below is the one where I explained
this problem before and after I got a hostile response, I locked the issue.

On Sun, Dec 8, 2019 at 7:24 AM Chris Olivier  wrote:

> Again, here is what I suspect the bug is in mxnet:
>
> The way that advanced openmp libraries handle a fork is that they hook an
> atfork() callback in which, in the new process, it creates a new “team” of
> threads to use for its thread pool (since all of the thread handles in its
> data structure belong to the previous process). atfork() callback order is
> the order at which the callbacks are registered, which will tend to be the
> first call to the openmp library.  For this reason, the fork order will
> vary depending upon what other libraries might be linked in and whether
> they make omp calls before mxnet starts its static init.
>
> What the assert in question is trying to say is that mxnet code is calling
> into omp library after a fork, but before the omp library’s atfork()
> handler is called, so the omp library has not yet initialized a new team if
> threads.  This looks to be the case in one of the call stacks on that
> issue. This is problematic for any openmp library which supports omp after
> a fork, and may not be deterministic from build to build, since the order
> of static init calls for a given module is undefined (i think mxnet is
> initializing omp during static init, but this may not matter).
>
> So if mxnet is doing that, it is a bug and remains a problem regardless of
> the omp library and probably should be fixed.  llvm omp happens to be nice
> enough to tell you you’re doing something wrong, at least when built in
> debug mode.
>
> Once this issue is resolved, we can discuss the library inclusion itself.
> My objection is “fixing” what appears to be a bug by effectively
> “commenting out the assert” which is what i stated in the very beginning.
>
> It stands to reason that linking this or that library may affect the
> assert occurring because it’s not known at what time one of the dependent
> libraries initializes omp (thus causing it to hook its atfork handler), so
> it is not surprising that mucking with dependencies may cause the assert to
> occur or not occur.
>
> Is there another explanation for the call stack with the assert?  Can this
> bug be ruled out?
>
>
> Here is an example of the atfork team concept with libgomp as well.
> Probably you can check the current libgomp code itself but this explains
> the code:
> https://patchwork.ozlabs.org/patch/319827/
>
>
>
>
>
>
>
>
> On Sun, Dec 8, 2019 at 2:21 AM Lausen, Leonard 
> wrote:
>
>> Thanks Pedro and Chris for your responses.
>>
>> After further investigation I find:
>>
>> 1) I don't think https://github.com/apache/incubator-mxnet/issues/14979
>> is
>> caused by any incompatibility between gomp and llvm / intel omp. Rather
>> it's
>> simply a problem of llvm / intel omp. See my comment to the issue for the
>> methodology to arrive at this claim.
>>
>> 2) Regarding the assertion failure when compiling with (llvm)
>> 3rdparty/openmp,
>> it can be fixed by updating the by now 2 years old llvm openmp code to the
>> newest released version. I went ahead and opened a PR
>> https://github.com/apache/incubator-mxnet/pull/17012
>>
>> Based on the investigation described in 1), I think Chris is right that
>> the
>> assertion failure is not due to some interaction between gomp and llvm
>> omp.
>> However, I'm not sure about Chris's suggestion that the assertion failure
>> is due
>> to a bug in MXNet. In fact, the failure goes away when updating the llvm
>> openmp
>> code. So I think it's just due to a bug in the 2 years old code.
>>
>> @Chris, I think updating 3rdparty/openmp to fix the assertion issue is not
>> contentious. Thus let's do it via lazy consensus (72 hours) or just
>> approve the
>> PR and merge it.
>>
>> Please also take a look at my comment at #14979 and let everyone know if
>> you see
>> any option to fix the bug while keeping 3rdparty/openmp. As this bug
>> affects an
>> important use-case, I beleive we need to remove 3rdparty/openmp from the
>> CMake
>> build as long as we don't find a solution for making #14979 work with
>> 3rdparty/openmp.
>>
>> In fact, removing 3rdparty/openmp will then match the current Makefile
>> setup
>> that according to my understanding is used to build the nightly releases
>> used by
>> the majority of developers. Ie. most users actually don't use the CMake
>> build
>> with 3rdparty/openmp. You can consider rescinding your veto on removing
>> 3rdparty/openmp after reading through the evidence in that issue. If you
>> don't
>> provide any evidence for why the methodology/conclusion in #14979 is
>> flawed, I
>> will assume your previous veto is void based on Apache Voting rule as it
>> lacks
>> technical justification and in any case was motivated by the assertion
>> issue,
>> which I agree with you, is likely not due to gomp / omp interaction.
>>
>> Thank you
>> Leonard
>>
>>
>> On 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Chris Olivier
Again, here is what I suspect the bug is in mxnet:

The way that advanced openmp libraries handle a fork is that they hook an
atfork() callback in which, in the new process, it creates a new “team” of
threads to use for its thread pool (since all of the thread handles in its
data structure belong to the previous process). atfork() callback order is
the order at which the callbacks are registered, which will tend to be the
first call to the openmp library.  For this reason, the fork order will
vary depending upon what other libraries might be linked in and whether
they make omp calls before mxnet starts its static init.

What the assert in question is trying to say is that mxnet code is calling
into omp library after a fork, but before the omp library’s atfork()
handler is called, so the omp library has not yet initialized a new team if
threads.  This looks to be the case in one of the call stacks on that
issue. This is problematic for any openmp library which supports omp after
a fork, and may not be deterministic from build to build, since the order
of static init calls for a given module is undefined (i think mxnet is
initializing omp during static init, but this may not matter).

So if mxnet is doing that, it is a bug and remains a problem regardless of
the omp library and probably should be fixed.  llvm omp happens to be nice
enough to tell you you’re doing something wrong, at least when built in
debug mode.

Once this issue is resolved, we can discuss the library inclusion itself.
My objection is “fixing” what appears to be a bug by effectively
“commenting out the assert” which is what i stated in the very beginning.

It stands to reason that linking this or that library may affect the assert
occurring because it’s not known at what time one of the dependent
libraries initializes omp (thus causing it to hook its atfork handler), so
it is not surprising that mucking with dependencies may cause the assert to
occur or not occur.

Is there another explanation for the call stack with the assert?  Can this
bug be ruled out?


Here is an example of the atfork team concept with libgomp as well.
Probably you can check the current libgomp code itself but this explains
the code:
https://patchwork.ozlabs.org/patch/319827/








On Sun, Dec 8, 2019 at 2:21 AM Lausen, Leonard 
wrote:

> Thanks Pedro and Chris for your responses.
>
> After further investigation I find:
>
> 1) I don't think https://github.com/apache/incubator-mxnet/issues/14979 is
> caused by any incompatibility between gomp and llvm / intel omp. Rather
> it's
> simply a problem of llvm / intel omp. See my comment to the issue for the
> methodology to arrive at this claim.
>
> 2) Regarding the assertion failure when compiling with (llvm)
> 3rdparty/openmp,
> it can be fixed by updating the by now 2 years old llvm openmp code to the
> newest released version. I went ahead and opened a PR
> https://github.com/apache/incubator-mxnet/pull/17012
>
> Based on the investigation described in 1), I think Chris is right that the
> assertion failure is not due to some interaction between gomp and llvm omp.
> However, I'm not sure about Chris's suggestion that the assertion failure
> is due
> to a bug in MXNet. In fact, the failure goes away when updating the llvm
> openmp
> code. So I think it's just due to a bug in the 2 years old code.
>
> @Chris, I think updating 3rdparty/openmp to fix the assertion issue is not
> contentious. Thus let's do it via lazy consensus (72 hours) or just
> approve the
> PR and merge it.
>
> Please also take a look at my comment at #14979 and let everyone know if
> you see
> any option to fix the bug while keeping 3rdparty/openmp. As this bug
> affects an
> important use-case, I beleive we need to remove 3rdparty/openmp from the
> CMake
> build as long as we don't find a solution for making #14979 work with
> 3rdparty/openmp.
>
> In fact, removing 3rdparty/openmp will then match the current Makefile
> setup
> that according to my understanding is used to build the nightly releases
> used by
> the majority of developers. Ie. most users actually don't use the CMake
> build
> with 3rdparty/openmp. You can consider rescinding your veto on removing
> 3rdparty/openmp after reading through the evidence in that issue. If you
> don't
> provide any evidence for why the methodology/conclusion in #14979 is
> flawed, I
> will assume your previous veto is void based on Apache Voting rule as it
> lacks
> technical justification and in any case was motivated by the assertion
> issue,
> which I agree with you, is likely not due to gomp / omp interaction.
>
> Thank you
> Leonard
>
>
> On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
> > Stop disseminating false information:
> >
> > https://github.com/apache/incubator-mxnet/issues/14979
> >
> >
> > On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier 
> wrote:
> >
> > > -1
> > >
> > > mkldnn removed omp5 for licencing issues
> > > no bugs have actually been traced to the use of llvm openmp. only an
> 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Lausen, Leonard
Thanks Pedro and Chris for your responses.

After further investigation I find:

1) I don't think https://github.com/apache/incubator-mxnet/issues/14979 is
caused by any incompatibility between gomp and llvm / intel omp. Rather it's
simply a problem of llvm / intel omp. See my comment to the issue for the
methodology to arrive at this claim.

2) Regarding the assertion failure when compiling with (llvm) 3rdparty/openmp,
it can be fixed by updating the by now 2 years old llvm openmp code to the
newest released version. I went ahead and opened a PR 
https://github.com/apache/incubator-mxnet/pull/17012

Based on the investigation described in 1), I think Chris is right that the
assertion failure is not due to some interaction between gomp and llvm omp.
However, I'm not sure about Chris's suggestion that the assertion failure is due
to a bug in MXNet. In fact, the failure goes away when updating the llvm openmp
code. So I think it's just due to a bug in the 2 years old code.

@Chris, I think updating 3rdparty/openmp to fix the assertion issue is not
contentious. Thus let's do it via lazy consensus (72 hours) or just approve the
PR and merge it.

Please also take a look at my comment at #14979 and let everyone know if you see
any option to fix the bug while keeping 3rdparty/openmp. As this bug affects an
important use-case, I beleive we need to remove 3rdparty/openmp from the CMake
build as long as we don't find a solution for making #14979 work with
3rdparty/openmp.

In fact, removing 3rdparty/openmp will then match the current Makefile setup
that according to my understanding is used to build the nightly releases used by
the majority of developers. Ie. most users actually don't use the CMake build
with 3rdparty/openmp. You can consider rescinding your veto on removing
3rdparty/openmp after reading through the evidence in that issue. If you don't
provide any evidence for why the methodology/conclusion in #14979 is flawed, I
will assume your previous veto is void based on Apache Voting rule as it lacks
technical justification and in any case was motivated by the assertion issue,
which I agree with you, is likely not due to gomp / omp interaction.

Thank you
Leonard


On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
> Stop disseminating false information:
> 
> https://github.com/apache/incubator-mxnet/issues/14979
> 
> 
> On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier  wrote:
> 
> > -1
> > 
> > mkldnn removed omp5 for licencing issues
> > no bugs have actually been traced to the use of llvm openmp. only an assert
> > caused by an actual bug in mxnet code. there are suitable workarounds.
> > 
> > over time llvm omp has simply been used as a “catch all” for random
> > problems that aren’t related at all (such as getenv race condition in an
> > atfork call that isn’t even part of an omp parallel region).
> > 
> > proposal is now and has always been roughly equivalent to the idea of
> > “comment out an assert rather than fix the bug it’s reporting”.
> > 
> > Up until very recently, Makefile version of mxnet used libomp5 for YEARS
> > and not libgomp, with no issue reported (omp not built in debug mode), so
> > the equivalent configuration from CMake mysteriously causing myriads if
> > problems has questionable merit and smells more like a hubris situation.
> > 
> > I use tensorflow as well and it links to libomp5 rather than libgomp.
> > 
> > if the assert problem is really a problem, the bug being reported would be
> > prioritized and fixed. it should be fixed regardless. all the time spent by
> > some CI people trying to remove this could have simply fixed the actual bug
> > in a small fraction of the time.
> > 
> > 
> > On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard 
> > wrote:
> > 
> > > I think it's reasonable to assume that the Intel MKLDNN team is an
> > > "authorative"
> > > source about the issue of compilation with OpenMP and the OpenMP runtime
> > > library
> > > related issues. Thus I suggest we follow the recommendation of Intel
> > > MKLDNN team
> > > within the MXNet project.
> > > 
> > > Looking through the Intel MKLDNN documentation, I find [1]:
> > > 
> > > > DNNL uses OpenMP runtime library provided by the compiler.
> > > 
> > > as well as
> > > 
> > > > it's important to ensure that only one OpenMP runtime is used
> > throughout
> > > the
> > > > application. Having more than one OpenMP runtime linked to an
> > executable
> > > may
> > > > lead to undefined behavior including incorrect results or crashes.
> > > 
> > > To keep our project maintainable and error free, I thus suggest we follow
> > > DNNL
> > > and use the OpenMP runtime library provided by the compiler.
> > > We have limited ressources and finding the root cause for any bugs
> > > resulting
> > > from linking multiple OpenMP libraries as currently done is, in my
> > > opinion. not
> > > a good use of time. We know it's due to undefined behavior and we know
> > > it's best
> > > practice to use OpenMP runtime library provided by 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-07 Thread Pedro Larroy
Stop disseminating false information:

https://github.com/apache/incubator-mxnet/issues/14979


On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier  wrote:

> -1
>
> mkldnn removed omp5 for licencing issues
> no bugs have actually been traced to the use of llvm openmp. only an assert
> caused by an actual bug in mxnet code. there are suitable workarounds.
>
> over time llvm omp has simply been used as a “catch all” for random
> problems that aren’t related at all (such as getenv race condition in an
> atfork call that isn’t even part of an omp parallel region).
>
> proposal is now and has always been roughly equivalent to the idea of
> “comment out an assert rather than fix the bug it’s reporting”.
>
> Up until very recently, Makefile version of mxnet used libomp5 for YEARS
> and not libgomp, with no issue reported (omp not built in debug mode), so
> the equivalent configuration from CMake mysteriously causing myriads if
> problems has questionable merit and smells more like a hubris situation.
>
> I use tensorflow as well and it links to libomp5 rather than libgomp.
>
> if the assert problem is really a problem, the bug being reported would be
> prioritized and fixed. it should be fixed regardless. all the time spent by
> some CI people trying to remove this could have simply fixed the actual bug
> in a small fraction of the time.
>
>
> On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard 
> wrote:
>
> > I think it's reasonable to assume that the Intel MKLDNN team is an
> > "authorative"
> > source about the issue of compilation with OpenMP and the OpenMP runtime
> > library
> > related issues. Thus I suggest we follow the recommendation of Intel
> > MKLDNN team
> > within the MXNet project.
> >
> > Looking through the Intel MKLDNN documentation, I find [1]:
> >
> > > DNNL uses OpenMP runtime library provided by the compiler.
> >
> > as well as
> >
> > > it's important to ensure that only one OpenMP runtime is used
> throughout
> > the
> > > application. Having more than one OpenMP runtime linked to an
> executable
> > may
> > > lead to undefined behavior including incorrect results or crashes.
> >
> > To keep our project maintainable and error free, I thus suggest we follow
> > DNNL
> > and use the OpenMP runtime library provided by the compiler.
> > We have limited ressources and finding the root cause for any bugs
> > resulting
> > from linking multiple OpenMP libraries as currently done is, in my
> > opinion. not
> > a good use of time. We know it's due to undefined behavior and we know
> > it's best
> > practice to use OpenMP runtime library provided by the compiler. So let's
> > just
> > do that.
> >
> > I think given that MKL-DNN has also adopted the "OpenMP runtime library
> > provided
> > by the compiler" approach, this issue is not contentious anymore and
> > qualifies
> > for lazy consensus.
> >
> > Thus if there is no objection within 72 hours (lazy consensus), let's
> drop
> > bundled LLVM OpenMP from master [2]. If we find any issues due to
> > droppeing the
> > bundled LLVM OpenMP, we can always add it back prior to the next release.
> >
> > Best regards
> > Leonard
> >
> > [1]:
> >
> >
> https://github.com/intel/mkl-dnn/blob/433e086bf5d9e5ccfc9ec0b70322f931b6b1921d/doc/build/build_options.md#openmp
> > (This is the updated reference from Anton's previous comment, based on
> the
> > changes in MKLDNN done in the meantime
> >
> https://github.com/apache/incubator-mxnet/pull/12160#issuecomment-415078066
> > )
> > [2]: Alike https://github.com/apache/incubator-mxnet/pull/12160
> >
> >
> > On Fri, 2019-12-06 at 12:16 -0800, Pedro Larroy wrote:
> > > I will try to stay on the sidelines for now since previous
> conversations
> > > about OMP have not been productive here and I have spent way too much
> > time
> > > on this already, I'm not the first one giving up on trying to help with
> > > this topic.
> > >
> > > I would be glad if you guys can work together and find a solution. I
> will
> > > just put my understanding of the big picture hoping that it helps move
> it
> > > forward.
> > >
> > >
> > > Recently the intel omp library which seemed to have the best
> performance
> > of
> > > the 3 was removed from MKL.
> > >
> > > - There's 3 libraries in play, GNU Omp which is shipped with gcc
> (gomp),
> > > LLVM openmp in 3rdparty (llvm-omp), Intel OMP when using MKL, which is
> > > recently removed (iomp)
> > >
> > > - IOMP seems to have the best performance, there's stability issues
> > > producing crashes sometimes but the impact seems relatively small for
> > users
> > > and developers. In general seems linking with a different OMP version
> > that
> > > the one shipped with the compiler is known to cause stability issues
> but
> > > it's done anyway.
> > >
> > > - LLVM-OMP used when building with CMake, not used in the PIP releases
> or
> > > when building with Make. Has stability issues, hangs when running in
> > debug
> > > mode during test execution and produces tons of assertions in debug
> mode.
> > 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-07 Thread Lausen, Leonard
Chris, if you can fix this in a small fraction of a time, please go ahead and do
so. Could you clarify why you think Intel's statement is nonsense or not
applicable? "Because different OpenMP runtimes may not be binary-compatible,
it's important to ensure that only one OpenMP runtime is used throughout the
application."

Why do we need to build LLVM OpenMP with GCC? If we want LLVM OpenMP, why not
just use LLVM compiler? The LLVM OpenMP in MXNet repo has currently not been
updated since 2 years. If we rely on the compiler version, we don't have to
maintain OpenMP in our repo.

Thank you
Leonard

On Sat, 2019-12-07 at 07:03 -0800, Chris Olivier wrote:
> -1
> 
> mkldnn removed omp5 for licencing issues
> no bugs have actually been traced to the use of llvm openmp. only an assert
> caused by an actual bug in mxnet code. there are suitable workarounds.
> 
> over time llvm omp has simply been used as a “catch all” for random
> problems that aren’t related at all (such as getenv race condition in an
> atfork call that isn’t even part of an omp parallel region).
> 
> proposal is now and has always been roughly equivalent to the idea of
> “comment out an assert rather than fix the bug it’s reporting”.
> 
> Up until very recently, Makefile version of mxnet used libomp5 for YEARS
> and not libgomp, with no issue reported (omp not built in debug mode), so
> the equivalent configuration from CMake mysteriously causing myriads if
> problems has questionable merit and smells more like a hubris situation.
> 
> I use tensorflow as well and it links to libomp5 rather than libgomp.
> 
> if the assert problem is really a problem, the bug being reported would be
> prioritized and fixed. it should be fixed regardless. all the time spent by
> some CI people trying to remove this could have simply fixed the actual bug
> in a small fraction of the time.
> 
> 
> On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard 
> wrote:
> 
> > I think it's reasonable to assume that the Intel MKLDNN team is an
> > "authorative"
> > source about the issue of compilation with OpenMP and the OpenMP runtime
> > library
> > related issues. Thus I suggest we follow the recommendation of Intel
> > MKLDNN team
> > within the MXNet project.
> > 
> > Looking through the Intel MKLDNN documentation, I find [1]:
> > 
> > > DNNL uses OpenMP runtime library provided by the compiler.
> > 
> > as well as
> > 
> > > it's important to ensure that only one OpenMP runtime is used throughout
> > the
> > > application. Having more than one OpenMP runtime linked to an executable
> > may
> > > lead to undefined behavior including incorrect results or crashes.
> > 
> > To keep our project maintainable and error free, I thus suggest we follow
> > DNNL
> > and use the OpenMP runtime library provided by the compiler.
> > We have limited ressources and finding the root cause for any bugs
> > resulting
> > from linking multiple OpenMP libraries as currently done is, in my
> > opinion. not
> > a good use of time. We know it's due to undefined behavior and we know
> > it's best
> > practice to use OpenMP runtime library provided by the compiler. So let's
> > just
> > do that.
> > 
> > I think given that MKL-DNN has also adopted the "OpenMP runtime library
> > provided
> > by the compiler" approach, this issue is not contentious anymore and
> > qualifies
> > for lazy consensus.
> > 
> > Thus if there is no objection within 72 hours (lazy consensus), let's drop
> > bundled LLVM OpenMP from master [2]. If we find any issues due to
> > droppeing the
> > bundled LLVM OpenMP, we can always add it back prior to the next release.
> > 
> > Best regards
> > Leonard
> > 
> > [1]:
> > 
> > https://github.com/intel/mkl-dnn/blob/433e086bf5d9e5ccfc9ec0b70322f931b6b1921d/doc/build/build_options.md#openmp
> > (This is the updated reference from Anton's previous comment, based on the
> > changes in MKLDNN done in the meantime
> > https://github.com/apache/incubator-mxnet/pull/12160#issuecomment-415078066
> > )
> > [2]: Alike https://github.com/apache/incubator-mxnet/pull/12160
> > 
> > 
> > On Fri, 2019-12-06 at 12:16 -0800, Pedro Larroy wrote:
> > > I will try to stay on the sidelines for now since previous conversations
> > > about OMP have not been productive here and I have spent way too much
> > time
> > > on this already, I'm not the first one giving up on trying to help with
> > > this topic.
> > > 
> > > I would be glad if you guys can work together and find a solution. I will
> > > just put my understanding of the big picture hoping that it helps move it
> > > forward.
> > > 
> > > 
> > > Recently the intel omp library which seemed to have the best performance
> > of
> > > the 3 was removed from MKL.
> > > 
> > > - There's 3 libraries in play, GNU Omp which is shipped with gcc (gomp),
> > > LLVM openmp in 3rdparty (llvm-omp), Intel OMP when using MKL, which is
> > > recently removed (iomp)
> > > 
> > > - IOMP seems to have the best performance, there's stability issues
> > > producing 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-07 Thread Chris Olivier
-1

mkldnn removed omp5 for licencing issues
no bugs have actually been traced to the use of llvm openmp. only an assert
caused by an actual bug in mxnet code. there are suitable workarounds.

over time llvm omp has simply been used as a “catch all” for random
problems that aren’t related at all (such as getenv race condition in an
atfork call that isn’t even part of an omp parallel region).

proposal is now and has always been roughly equivalent to the idea of
“comment out an assert rather than fix the bug it’s reporting”.

Up until very recently, Makefile version of mxnet used libomp5 for YEARS
and not libgomp, with no issue reported (omp not built in debug mode), so
the equivalent configuration from CMake mysteriously causing myriads if
problems has questionable merit and smells more like a hubris situation.

I use tensorflow as well and it links to libomp5 rather than libgomp.

if the assert problem is really a problem, the bug being reported would be
prioritized and fixed. it should be fixed regardless. all the time spent by
some CI people trying to remove this could have simply fixed the actual bug
in a small fraction of the time.


On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard 
wrote:

> I think it's reasonable to assume that the Intel MKLDNN team is an
> "authorative"
> source about the issue of compilation with OpenMP and the OpenMP runtime
> library
> related issues. Thus I suggest we follow the recommendation of Intel
> MKLDNN team
> within the MXNet project.
>
> Looking through the Intel MKLDNN documentation, I find [1]:
>
> > DNNL uses OpenMP runtime library provided by the compiler.
>
> as well as
>
> > it's important to ensure that only one OpenMP runtime is used throughout
> the
> > application. Having more than one OpenMP runtime linked to an executable
> may
> > lead to undefined behavior including incorrect results or crashes.
>
> To keep our project maintainable and error free, I thus suggest we follow
> DNNL
> and use the OpenMP runtime library provided by the compiler.
> We have limited ressources and finding the root cause for any bugs
> resulting
> from linking multiple OpenMP libraries as currently done is, in my
> opinion. not
> a good use of time. We know it's due to undefined behavior and we know
> it's best
> practice to use OpenMP runtime library provided by the compiler. So let's
> just
> do that.
>
> I think given that MKL-DNN has also adopted the "OpenMP runtime library
> provided
> by the compiler" approach, this issue is not contentious anymore and
> qualifies
> for lazy consensus.
>
> Thus if there is no objection within 72 hours (lazy consensus), let's drop
> bundled LLVM OpenMP from master [2]. If we find any issues due to
> droppeing the
> bundled LLVM OpenMP, we can always add it back prior to the next release.
>
> Best regards
> Leonard
>
> [1]:
>
> https://github.com/intel/mkl-dnn/blob/433e086bf5d9e5ccfc9ec0b70322f931b6b1921d/doc/build/build_options.md#openmp
> (This is the updated reference from Anton's previous comment, based on the
> changes in MKLDNN done in the meantime
> https://github.com/apache/incubator-mxnet/pull/12160#issuecomment-415078066
> )
> [2]: Alike https://github.com/apache/incubator-mxnet/pull/12160
>
>
> On Fri, 2019-12-06 at 12:16 -0800, Pedro Larroy wrote:
> > I will try to stay on the sidelines for now since previous conversations
> > about OMP have not been productive here and I have spent way too much
> time
> > on this already, I'm not the first one giving up on trying to help with
> > this topic.
> >
> > I would be glad if you guys can work together and find a solution. I will
> > just put my understanding of the big picture hoping that it helps move it
> > forward.
> >
> >
> > Recently the intel omp library which seemed to have the best performance
> of
> > the 3 was removed from MKL.
> >
> > - There's 3 libraries in play, GNU Omp which is shipped with gcc (gomp),
> > LLVM openmp in 3rdparty (llvm-omp), Intel OMP when using MKL, which is
> > recently removed (iomp)
> >
> > - IOMP seems to have the best performance, there's stability issues
> > producing crashes sometimes but the impact seems relatively small for
> users
> > and developers. In general seems linking with a different OMP version
> that
> > the one shipped with the compiler is known to cause stability issues but
> > it's done anyway.
> >
> > - LLVM-OMP used when building with CMake, not used in the PIP releases or
> > when building with Make. Has stability issues, hangs when running in
> debug
> > mode during test execution and produces tons of assertions in debug mode.
> > Might have some small performance gains but there is no clear cut data
> that
> > showcases significant performance gains.
> >
> > - GOMP is the version shipped with GCC and the PIP wheels without MKL,
> has
> > no stability problems.
> >
> > As a ballpark, IOMP might give 10% performance improvement in some cases.
> >
> > We need to document well how users should tune and configure MXNet when
> > 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-06 Thread Lausen, Leonard
I think it's reasonable to assume that the Intel MKLDNN team is an "authorative"
source about the issue of compilation with OpenMP and the OpenMP runtime library
related issues. Thus I suggest we follow the recommendation of Intel MKLDNN team
within the MXNet project.

Looking through the Intel MKLDNN documentation, I find [1]:

> DNNL uses OpenMP runtime library provided by the compiler.

as well as

> it's important to ensure that only one OpenMP runtime is used throughout the
> application. Having more than one OpenMP runtime linked to an executable may
> lead to undefined behavior including incorrect results or crashes.

To keep our project maintainable and error free, I thus suggest we follow DNNL
and use the OpenMP runtime library provided by the compiler.
We have limited ressources and finding the root cause for any bugs resulting
from linking multiple OpenMP libraries as currently done is, in my opinion. not
a good use of time. We know it's due to undefined behavior and we know it's best
practice to use OpenMP runtime library provided by the compiler. So let's just
do that.

I think given that MKL-DNN has also adopted the "OpenMP runtime library provided
by the compiler" approach, this issue is not contentious anymore and qualifies
for lazy consensus.

Thus if there is no objection within 72 hours (lazy consensus), let's drop
bundled LLVM OpenMP from master [2]. If we find any issues due to droppeing the
bundled LLVM OpenMP, we can always add it back prior to the next release.

Best regards
Leonard

[1]: 
https://github.com/intel/mkl-dnn/blob/433e086bf5d9e5ccfc9ec0b70322f931b6b1921d/doc/build/build_options.md#openmp
(This is the updated reference from Anton's previous comment, based on the
changes in MKLDNN done in the meantime 
https://github.com/apache/incubator-mxnet/pull/12160#issuecomment-415078066)
[2]: Alike https://github.com/apache/incubator-mxnet/pull/12160


On Fri, 2019-12-06 at 12:16 -0800, Pedro Larroy wrote:
> I will try to stay on the sidelines for now since previous conversations
> about OMP have not been productive here and I have spent way too much time
> on this already, I'm not the first one giving up on trying to help with
> this topic.
> 
> I would be glad if you guys can work together and find a solution. I will
> just put my understanding of the big picture hoping that it helps move it
> forward.
> 
> 
> Recently the intel omp library which seemed to have the best performance of
> the 3 was removed from MKL.
> 
> - There's 3 libraries in play, GNU Omp which is shipped with gcc (gomp),
> LLVM openmp in 3rdparty (llvm-omp), Intel OMP when using MKL, which is
> recently removed (iomp)
> 
> - IOMP seems to have the best performance, there's stability issues
> producing crashes sometimes but the impact seems relatively small for users
> and developers. In general seems linking with a different OMP version that
> the one shipped with the compiler is known to cause stability issues but
> it's done anyway.
> 
> - LLVM-OMP used when building with CMake, not used in the PIP releases or
> when building with Make. Has stability issues, hangs when running in debug
> mode during test execution and produces tons of assertions in debug mode.
> Might have some small performance gains but there is no clear cut data that
> showcases significant performance gains.
> 
> - GOMP is the version shipped with GCC and the PIP wheels without MKL, has
> no stability problems.
> 
> As a ballpark, IOMP might give 10% performance improvement in some cases.
> 
> We need to document well how users should tune and configure MXNet when
> using OMP.
> 
> As a developer, the safest bet is to use GOMP to be able to debug and
> develop without issues. As a user of CPU inference / training you want to
> run MKL so depends on how the Intel guys want to do things. My preference
> as an engineer is always stability > speed.
> 
> Related tickets:
> 
> https://github.com/apache/incubator-mxnet/issues/16891
> 
> https://github.com/apache/incubator-mxnet/issues/10856#issuecomment-562637931
> 
> 
> https://github.com/apache/incubator-mxnet/issues/11417
> 
> https://github.com/apache/incubator-mxnet/issues/15690
> 
> 
> 
> On Fri, Dec 6, 2019 at 12:39 AM Lausen, Leonard 
> wrote:
> 
> > Is this related to https://github.com/apache/incubator-mxnet/issues/10856?
> > 
> > I unlocked that Github issue based on the Apache Code of Conduct
> > https://www.apache.org/foundation/policies/conduct#specific-guidelines
> > 
> > 
> > On Sat, 2019-11-30 at 02:47 -0800, Pedro Larroy wrote:
> > > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6 (upstream_master)+$ ldd
> > > build/libmxnet.so| grep -i openmp
> > > libomp.so =>
> > > /home/piotr/mxnet_1.6/build/3rdparty/openmp/runtime/src/libomp.so
> > > (0x7fde0991d000)
> > > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6 (upstream_master)+$ python
> > > ~/deeplearning-benchmark/image_classification/infer_imagenet.py --use-rec
> > > --batch-size 256 --dtype float32 --num-data-workers 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-06 Thread Pedro Larroy
I will try to stay on the sidelines for now since previous conversations
about OMP have not been productive here and I have spent way too much time
on this already, I'm not the first one giving up on trying to help with
this topic.

I would be glad if you guys can work together and find a solution. I will
just put my understanding of the big picture hoping that it helps move it
forward.


Recently the intel omp library which seemed to have the best performance of
the 3 was removed from MKL.

- There's 3 libraries in play, GNU Omp which is shipped with gcc (gomp),
LLVM openmp in 3rdparty (llvm-omp), Intel OMP when using MKL, which is
recently removed (iomp)

- IOMP seems to have the best performance, there's stability issues
producing crashes sometimes but the impact seems relatively small for users
and developers. In general seems linking with a different OMP version that
the one shipped with the compiler is known to cause stability issues but
it's done anyway.

- LLVM-OMP used when building with CMake, not used in the PIP releases or
when building with Make. Has stability issues, hangs when running in debug
mode during test execution and produces tons of assertions in debug mode.
Might have some small performance gains but there is no clear cut data that
showcases significant performance gains.

- GOMP is the version shipped with GCC and the PIP wheels without MKL, has
no stability problems.

As a ballpark, IOMP might give 10% performance improvement in some cases.

We need to document well how users should tune and configure MXNet when
using OMP.

As a developer, the safest bet is to use GOMP to be able to debug and
develop without issues. As a user of CPU inference / training you want to
run MKL so depends on how the Intel guys want to do things. My preference
as an engineer is always stability > speed.

Related tickets:

https://github.com/apache/incubator-mxnet/issues/16891

https://github.com/apache/incubator-mxnet/issues/10856#issuecomment-562637931


https://github.com/apache/incubator-mxnet/issues/11417

https://github.com/apache/incubator-mxnet/issues/15690



On Fri, Dec 6, 2019 at 12:39 AM Lausen, Leonard 
wrote:

> Is this related to https://github.com/apache/incubator-mxnet/issues/10856?
>
> I unlocked that Github issue based on the Apache Code of Conduct
> https://www.apache.org/foundation/policies/conduct#specific-guidelines
>
>
> On Sat, 2019-11-30 at 02:47 -0800, Pedro Larroy wrote:
> > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6 (upstream_master)+$ ldd
> > build/libmxnet.so| grep -i openmp
> > libomp.so =>
> > /home/piotr/mxnet_1.6/build/3rdparty/openmp/runtime/src/libomp.so
> > (0x7fde0991d000)
> > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6 (upstream_master)+$ python
> > ~/deeplearning-benchmark/image_classification/infer_imagenet.py --use-rec
> > --batch-size 256 --dtype float32 --num-data-workers 40 --mode hybrid
> > --model resnet50_v2 --use-pretrained --kvstore local --log-interval 1
> > --rec-val ~/data/val-passthrough.rec --rec-val-idx
> > ~/data/val-passthrough.idx
> > INFO:root:Namespace(batch_norm=False, batch_size=256,
> > data_dir='~/.mxnet/datasets/imagenet', dataset_size=32, dtype='float32',
> > kvstore='local', last_gamma=False, log_interval=1, logging_dir='logs',
> > lr=0.1, lr_decay=0.1, lr_decay_epoch='40,60', lr_mode='step',
> > lr_poly_power=2, mode='hybrid', model='resnet50_v2', momentum=0.9,
> > num_epochs=3, num_gpus=0, num_workers=40,
> > rec_val='/home/piotr/data/val-passthrough.rec',
> > rec_val_idx='/home/piotr/data/val-passthrough.idx', save_dir='params',
> > save_frequency=0, top_k=0, use_pretrained=True, use_rec=True,
> use_se=False,
> > warmup_epochs=0, warmup_lr=0.0, wd=0.0001)
> > [10:42:02] ../src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2:
> > /home/piotr/data/val-passthrough.rec, use 36 threads for decoding..
> > INFO:root:Batch [0]
> > INFO:root:Top 1 accuracy: 0
> > INFO:root:warmup_throughput: 5 samples/sec warmup_time 43.150922
> > INFO:root:Batch [1]
> > INFO:root:Top 1 accuracy: 0
> > INFO:root:warmup_throughput: 6 samples/sec warmup_time 37.971927
> > INFO:root:Batch [2]
> > INFO:root:Top 1 accuracy: 0
> > INFO:root:warmup_throughput: 7 samples/sec warmup_time 35.755363
> >
> >
> >
> >
> >
> >
> >
> > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6_plat_omp
> (upstream_master)+$
> > git st
> > On branch upstream_master
> > Your branch is up to date with 'origin/upstream_master'.
> >
> > Changes not staged for commit:
> >   (use "git add/rm ..." to update what will be committed)
> >   (use "git checkout -- ..." to discard changes in working
> directory)
> >
> > deleted:3rdparty/openmp
> >
> > no changes added to commit (use "git add" and/or "git commit -a")
> > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6_plat_omp
> (upstream_master)+$
> > ldd build/libmxnet.so | grep -i omp
> > libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > (0x7f941241c000)
> >
> > (py3_venv) 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-06 Thread Lausen, Leonard
Is this related to https://github.com/apache/incubator-mxnet/issues/10856?

I unlocked that Github issue based on the Apache Code of Conduct 
https://www.apache.org/foundation/policies/conduct#specific-guidelines


On Sat, 2019-11-30 at 02:47 -0800, Pedro Larroy wrote:
> (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6 (upstream_master)+$ ldd
> build/libmxnet.so| grep -i openmp
> libomp.so =>
> /home/piotr/mxnet_1.6/build/3rdparty/openmp/runtime/src/libomp.so
> (0x7fde0991d000)
> (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6 (upstream_master)+$ python
> ~/deeplearning-benchmark/image_classification/infer_imagenet.py --use-rec
> --batch-size 256 --dtype float32 --num-data-workers 40 --mode hybrid
> --model resnet50_v2 --use-pretrained --kvstore local --log-interval 1
> --rec-val ~/data/val-passthrough.rec --rec-val-idx
> ~/data/val-passthrough.idx
> INFO:root:Namespace(batch_norm=False, batch_size=256,
> data_dir='~/.mxnet/datasets/imagenet', dataset_size=32, dtype='float32',
> kvstore='local', last_gamma=False, log_interval=1, logging_dir='logs',
> lr=0.1, lr_decay=0.1, lr_decay_epoch='40,60', lr_mode='step',
> lr_poly_power=2, mode='hybrid', model='resnet50_v2', momentum=0.9,
> num_epochs=3, num_gpus=0, num_workers=40,
> rec_val='/home/piotr/data/val-passthrough.rec',
> rec_val_idx='/home/piotr/data/val-passthrough.idx', save_dir='params',
> save_frequency=0, top_k=0, use_pretrained=True, use_rec=True, use_se=False,
> warmup_epochs=0, warmup_lr=0.0, wd=0.0001)
> [10:42:02] ../src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2:
> /home/piotr/data/val-passthrough.rec, use 36 threads for decoding..
> INFO:root:Batch [0]
> INFO:root:Top 1 accuracy: 0
> INFO:root:warmup_throughput: 5 samples/sec warmup_time 43.150922
> INFO:root:Batch [1]
> INFO:root:Top 1 accuracy: 0
> INFO:root:warmup_throughput: 6 samples/sec warmup_time 37.971927
> INFO:root:Batch [2]
> INFO:root:Top 1 accuracy: 0
> INFO:root:warmup_throughput: 7 samples/sec warmup_time 35.755363
> 
> 
> 
> 
> 
> 
> 
> (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6_plat_omp (upstream_master)+$
> git st
> On branch upstream_master
> Your branch is up to date with 'origin/upstream_master'.
> 
> Changes not staged for commit:
>   (use "git add/rm ..." to update what will be committed)
>   (use "git checkout -- ..." to discard changes in working directory)
> 
> deleted:3rdparty/openmp
> 
> no changes added to commit (use "git add" and/or "git commit -a")
> (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6_plat_omp (upstream_master)+$
> ldd build/libmxnet.so | grep -i omp
> libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> (0x7f941241c000)
> 
> (py3_venv) piotr@34-215-197-42:130:~/mxnet_1.6_plat_omp (upstream_master)+$
> python ~/deeplearning-benchmark/image_classification/infer_imagenet.py
> --use-rec --batch-size 256 --dtype float32 --num-data-workers 40 --mode
> hybrid --model resnet50_v2 --use-pretrained --kvstore local --log-interval
> 1 --rec-val ~/data/val-passthrough.rec --rec-val-idx
> ~/data/val-passthrough.idx
> INFO:root:warmup_throughput: 147 samples/sec warmup_time 1.735117
> INFO:root:Batch [16]
> INFO:root:Top 1 accuracy: 0
> INFO:root:warmup_throughput: 143 samples/sec warmup_time 1.785760
> INFO:root:Batch [17]
> INFO:root:Top 1 accuracy: 0
> INFO:root:warmup_throughput: 148 samples/sec warmup_time 1.729033


Please remove conflicting Open MP version from CMake builds

2019-11-30 Thread Pedro Larroy
(py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6 (upstream_master)+$ ldd
build/libmxnet.so| grep -i openmp
libomp.so =>
/home/piotr/mxnet_1.6/build/3rdparty/openmp/runtime/src/libomp.so
(0x7fde0991d000)
(py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6 (upstream_master)+$ python
~/deeplearning-benchmark/image_classification/infer_imagenet.py --use-rec
--batch-size 256 --dtype float32 --num-data-workers 40 --mode hybrid
--model resnet50_v2 --use-pretrained --kvstore local --log-interval 1
--rec-val ~/data/val-passthrough.rec --rec-val-idx
~/data/val-passthrough.idx
INFO:root:Namespace(batch_norm=False, batch_size=256,
data_dir='~/.mxnet/datasets/imagenet', dataset_size=32, dtype='float32',
kvstore='local', last_gamma=False, log_interval=1, logging_dir='logs',
lr=0.1, lr_decay=0.1, lr_decay_epoch='40,60', lr_mode='step',
lr_poly_power=2, mode='hybrid', model='resnet50_v2', momentum=0.9,
num_epochs=3, num_gpus=0, num_workers=40,
rec_val='/home/piotr/data/val-passthrough.rec',
rec_val_idx='/home/piotr/data/val-passthrough.idx', save_dir='params',
save_frequency=0, top_k=0, use_pretrained=True, use_rec=True, use_se=False,
warmup_epochs=0, warmup_lr=0.0, wd=0.0001)
[10:42:02] ../src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2:
/home/piotr/data/val-passthrough.rec, use 36 threads for decoding..
INFO:root:Batch [0]
INFO:root:Top 1 accuracy: 0
INFO:root:warmup_throughput: 5 samples/sec warmup_time 43.150922
INFO:root:Batch [1]
INFO:root:Top 1 accuracy: 0
INFO:root:warmup_throughput: 6 samples/sec warmup_time 37.971927
INFO:root:Batch [2]
INFO:root:Top 1 accuracy: 0
INFO:root:warmup_throughput: 7 samples/sec warmup_time 35.755363







(py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6_plat_omp (upstream_master)+$
git st
On branch upstream_master
Your branch is up to date with 'origin/upstream_master'.

Changes not staged for commit:
  (use "git add/rm ..." to update what will be committed)
  (use "git checkout -- ..." to discard changes in working directory)

deleted:3rdparty/openmp

no changes added to commit (use "git add" and/or "git commit -a")
(py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6_plat_omp (upstream_master)+$
ldd build/libmxnet.so | grep -i omp
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
(0x7f941241c000)

(py3_venv) piotr@34-215-197-42:130:~/mxnet_1.6_plat_omp (upstream_master)+$
python ~/deeplearning-benchmark/image_classification/infer_imagenet.py
--use-rec --batch-size 256 --dtype float32 --num-data-workers 40 --mode
hybrid --model resnet50_v2 --use-pretrained --kvstore local --log-interval
1 --rec-val ~/data/val-passthrough.rec --rec-val-idx
~/data/val-passthrough.idx
INFO:root:warmup_throughput: 147 samples/sec warmup_time 1.735117
INFO:root:Batch [16]
INFO:root:Top 1 accuracy: 0
INFO:root:warmup_throughput: 143 samples/sec warmup_time 1.785760
INFO:root:Batch [17]
INFO:root:Top 1 accuracy: 0
INFO:root:warmup_throughput: 148 samples/sec warmup_time 1.729033