Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-20 Thread Lai Wei
Hi Anirudh,

Thanks for jumping into this quickly, I followed up on the issue.

I was meant for sockeye developer/maintainers to help setup nightly tests
and raise issues early.

Thanks!

On Fri, Jun 21, 2019 at 10:10 AM Haibin Lin 
wrote:

> In GluonNLP we are testing with MXNET nightly build for each PR, and we did
> find some MXNet related issue caught by the CI.
> I recommend other toolkits also add integration tests with MXNet nightly.
> It helps identify issues early.
>
> Best,
> Haibin
>
> On Thu, Jun 20, 2019 at 18:52 Zhao, Patric  wrote:
>
> > Thanks to raise the issue and we will take a look ASAP.
> >
> > The downstream cases is not in the MXNet CI so it's hard to catch the
> > potential bugs or performance degradation for MXNet developers.
> >
> > In the future, I suggest adding the major downstream test cases, like
> from
> > sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the nightly test.
> > If it's still too heavy,  maybe testing it weekly or monthly :)
> >
> > Thanks,
> >
> > --Patric
> >
> > > -Original Message-
> > > From: Anirudh Subramanian [mailto:anirudh2...@gmail.com]
> > > Sent: Friday, June 21, 2019 9:31 AM
> > > To: dev@mxnet.incubator.apache.org
> > > Cc: d...@mxnet.apache.org
> > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
> > >
> > > Hi Lai,
> > >
> > > I have opened an issue:
> > > https://github.com/apache/incubator-mxnet/issues/15297
> > > I came to know about this issue only today and I have not been
> monitoring
> > > sockeye.
> > > I jumped onto this issue to make sure it wasn't caused by the dlpack
> > changes.
> > > Also, I don't  think sockeye CI checks against master, it is using
> 1.4.1.
> > >
> > > Anirudh
> > >
> > >
> > > On Thu, Jun 20, 2019 at 6:17 PM Lai Wei  wrote:
> > >
> > > > Hi,
> > > >
> > > > Could you share which test failed and what’s the crash? How to
> > > > reproduce it?
> > > >
> > > > I was able to install sockeye and run all tests passed. Using python
> > > > setup.py test
> > > >
> > > > I have tested both nightly pip package and 1.5.0.rc1
> > > >
> > > > It would be great to create an issue with reproducible steps and move
> > > > the discussion there.
> > > >
> > > > Also I see sockeye nightly build[1] has been failing for some time,
> if
> > > > it’s due to MXNet change, please raise this early so we can track and
> > > > solve it in time rather than block the release during vote time.
> > > >
> > > > [1] https://travis-ci.org/awslabs/sockeye
> > > >
> > > >
> > > > On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian
> > > >  > > > >
> > > > wrote:
> > > >
> > > > > I was able to reproduce a crash with the commit
> > > > > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit
> > > > > a862270beb2d796c1ba311183f7f4a766a18ad6c.
> > > > >
> > > > > Anirudh
> > > > >
> > > > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei 
> wrote:
> > > > >
> > > > > > Hi Przemyslaw,
> > > > > >
> > > > > > Is there an issue with more details to track the problem?
> > > > > >
> > > > > >
> > > > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak
> > > > > > 
> > > > > > wrote:
> > > > > >
> > > > > > > -1
> > > > > > >
> > > > > > > There is a crash in sockeye unit test (python setup.py test)
> > > > > > > observed starting with nightly 1.5 build from 6/13 and still
> > > > > > > occuring in
> > > > > 1.5rc1. I
> > > > > > > don't yet have the exact commit that is responsible for it, but
> > > > > > > it is either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack
> > > > > > > related) or
> > > > > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op
> > > optimization).
> > > > > > >
> > > > > > > On 2019/06/20 06:36:22, Lai Wei  wrote:
> > > > > > > > Dear MXNet community,
> > > > > > > >
> > > > > > > > This is the 3-day vote to release Apache MXNet (incubating)
> > > > > > > > version
> > > > > > > 1.5.0.
> > > > > > > > Voting on dev@ will start June 19, 23:59:59(PST)  and close
> on
> > > > June
> > > > > > 22,
> > > > > > > > 23:59:59.
> > > > > > > >
> > > > > > > > 1) Link to release notes:
> > > > > > > >
> > > > >
> https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Note
> > > > > s
> > > > > > > >
> > > > > > > >
> > > > > > > > 2) Link to release candidate:
> > > > > > > >
> > > > > > > >
> https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.r
> > > > > > > > c1
> > > > > > > >
> > > > > > > >
> > > > > > > > 3) Link to source and signatures on apache dist server:
> > > > > > > >
> > > > > > > >
> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.r
> > > > > > > > c1/
> > > > > > > >
> > > > > > > >
> > > > > > > > Please remember to TEST first before voting accordingly:
> > > > > > > >
> > > > > > > > +1 = approve
> > > > > > > > +0 = no opinion
> > > > > > > > -1 = disapprove (provide reason)
> > > > > > > > --
> > > > > > > > Best Regards
> > > > > > > >
> > > > > > > > Lai
> > > > > > > >
> > > > > > >
> > > > > > --
> > > > > > Best Regards
> > > > > >
> > > > > 

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-20 Thread Haibin Lin
In GluonNLP we are testing with MXNET nightly build for each PR, and we did
find some MXNet related issue caught by the CI.
I recommend other toolkits also add integration tests with MXNet nightly.
It helps identify issues early.

Best,
Haibin

On Thu, Jun 20, 2019 at 18:52 Zhao, Patric  wrote:

> Thanks to raise the issue and we will take a look ASAP.
>
> The downstream cases is not in the MXNet CI so it's hard to catch the
> potential bugs or performance degradation for MXNet developers.
>
> In the future, I suggest adding the major downstream test cases, like from
> sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the nightly test.
> If it's still too heavy,  maybe testing it weekly or monthly :)
>
> Thanks,
>
> --Patric
>
> > -Original Message-
> > From: Anirudh Subramanian [mailto:anirudh2...@gmail.com]
> > Sent: Friday, June 21, 2019 9:31 AM
> > To: dev@mxnet.incubator.apache.org
> > Cc: d...@mxnet.apache.org
> > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
> >
> > Hi Lai,
> >
> > I have opened an issue:
> > https://github.com/apache/incubator-mxnet/issues/15297
> > I came to know about this issue only today and I have not been monitoring
> > sockeye.
> > I jumped onto this issue to make sure it wasn't caused by the dlpack
> changes.
> > Also, I don't  think sockeye CI checks against master, it is using 1.4.1.
> >
> > Anirudh
> >
> >
> > On Thu, Jun 20, 2019 at 6:17 PM Lai Wei  wrote:
> >
> > > Hi,
> > >
> > > Could you share which test failed and what’s the crash? How to
> > > reproduce it?
> > >
> > > I was able to install sockeye and run all tests passed. Using python
> > > setup.py test
> > >
> > > I have tested both nightly pip package and 1.5.0.rc1
> > >
> > > It would be great to create an issue with reproducible steps and move
> > > the discussion there.
> > >
> > > Also I see sockeye nightly build[1] has been failing for some time, if
> > > it’s due to MXNet change, please raise this early so we can track and
> > > solve it in time rather than block the release during vote time.
> > >
> > > [1] https://travis-ci.org/awslabs/sockeye
> > >
> > >
> > > On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian
> > >  > > >
> > > wrote:
> > >
> > > > I was able to reproduce a crash with the commit
> > > > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit
> > > > a862270beb2d796c1ba311183f7f4a766a18ad6c.
> > > >
> > > > Anirudh
> > > >
> > > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei  wrote:
> > > >
> > > > > Hi Przemyslaw,
> > > > >
> > > > > Is there an issue with more details to track the problem?
> > > > >
> > > > >
> > > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak
> > > > > 
> > > > > wrote:
> > > > >
> > > > > > -1
> > > > > >
> > > > > > There is a crash in sockeye unit test (python setup.py test)
> > > > > > observed starting with nightly 1.5 build from 6/13 and still
> > > > > > occuring in
> > > > 1.5rc1. I
> > > > > > don't yet have the exact commit that is responsible for it, but
> > > > > > it is either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack
> > > > > > related) or
> > > > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op
> > optimization).
> > > > > >
> > > > > > On 2019/06/20 06:36:22, Lai Wei  wrote:
> > > > > > > Dear MXNet community,
> > > > > > >
> > > > > > > This is the 3-day vote to release Apache MXNet (incubating)
> > > > > > > version
> > > > > > 1.5.0.
> > > > > > > Voting on dev@ will start June 19, 23:59:59(PST)  and close on
> > > June
> > > > > 22,
> > > > > > > 23:59:59.
> > > > > > >
> > > > > > > 1) Link to release notes:
> > > > > > >
> > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Note
> > > > s
> > > > > > >
> > > > > > >
> > > > > > > 2) Link to release candidate:
> > > > > > >
> > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.r
> > > > > > > c1
> > > > > > >
> > > > > > >
> > > > > > > 3) Link to source and signatures on apache dist server:
> > > > > > >
> > > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.r
> > > > > > > c1/
> > > > > > >
> > > > > > >
> > > > > > > Please remember to TEST first before voting accordingly:
> > > > > > >
> > > > > > > +1 = approve
> > > > > > > +0 = no opinion
> > > > > > > -1 = disapprove (provide reason)
> > > > > > > --
> > > > > > > Best Regards
> > > > > > >
> > > > > > > Lai
> > > > > > >
> > > > > >
> > > > > --
> > > > > Best Regards
> > > > >
> > > > > Lai
> > > > >
> > > >
> > > --
> > > Best Regards
> > >
> > > Lai
> > >
>


RE: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-20 Thread Zhao, Patric
Thanks to raise the issue and we will take a look ASAP.

The downstream cases is not in the MXNet CI so it's hard to catch the potential 
bugs or performance degradation for MXNet developers.

In the future, I suggest adding the major downstream test cases, like from 
sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the nightly test.
If it's still too heavy,  maybe testing it weekly or monthly :)

Thanks,

--Patric

> -Original Message-
> From: Anirudh Subramanian [mailto:anirudh2...@gmail.com]
> Sent: Friday, June 21, 2019 9:31 AM
> To: dev@mxnet.incubator.apache.org
> Cc: d...@mxnet.apache.org
> Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
> 
> Hi Lai,
> 
> I have opened an issue:
> https://github.com/apache/incubator-mxnet/issues/15297
> I came to know about this issue only today and I have not been monitoring
> sockeye.
> I jumped onto this issue to make sure it wasn't caused by the dlpack changes.
> Also, I don't  think sockeye CI checks against master, it is using 1.4.1.
> 
> Anirudh
> 
> 
> On Thu, Jun 20, 2019 at 6:17 PM Lai Wei  wrote:
> 
> > Hi,
> >
> > Could you share which test failed and what’s the crash? How to
> > reproduce it?
> >
> > I was able to install sockeye and run all tests passed. Using python
> > setup.py test
> >
> > I have tested both nightly pip package and 1.5.0.rc1
> >
> > It would be great to create an issue with reproducible steps and move
> > the discussion there.
> >
> > Also I see sockeye nightly build[1] has been failing for some time, if
> > it’s due to MXNet change, please raise this early so we can track and
> > solve it in time rather than block the release during vote time.
> >
> > [1] https://travis-ci.org/awslabs/sockeye
> >
> >
> > On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian
> >  > >
> > wrote:
> >
> > > I was able to reproduce a crash with the commit
> > > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit
> > > a862270beb2d796c1ba311183f7f4a766a18ad6c.
> > >
> > > Anirudh
> > >
> > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei  wrote:
> > >
> > > > Hi Przemyslaw,
> > > >
> > > > Is there an issue with more details to track the problem?
> > > >
> > > >
> > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak
> > > > 
> > > > wrote:
> > > >
> > > > > -1
> > > > >
> > > > > There is a crash in sockeye unit test (python setup.py test)
> > > > > observed starting with nightly 1.5 build from 6/13 and still
> > > > > occuring in
> > > 1.5rc1. I
> > > > > don't yet have the exact commit that is responsible for it, but
> > > > > it is either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack
> > > > > related) or
> > > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op
> optimization).
> > > > >
> > > > > On 2019/06/20 06:36:22, Lai Wei  wrote:
> > > > > > Dear MXNet community,
> > > > > >
> > > > > > This is the 3-day vote to release Apache MXNet (incubating)
> > > > > > version
> > > > > 1.5.0.
> > > > > > Voting on dev@ will start June 19, 23:59:59(PST)  and close on
> > June
> > > > 22,
> > > > > > 23:59:59.
> > > > > >
> > > > > > 1) Link to release notes:
> > > > > >
> > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Note
> > > s
> > > > > >
> > > > > >
> > > > > > 2) Link to release candidate:
> > > > > >
> > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.r
> > > > > > c1
> > > > > >
> > > > > >
> > > > > > 3) Link to source and signatures on apache dist server:
> > > > > >
> > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.r
> > > > > > c1/
> > > > > >
> > > > > >
> > > > > > Please remember to TEST first before voting accordingly:
> > > > > >
> > > > > > +1 = approve
> > > > > > +0 = no opinion
> > > > > > -1 = disapprove (provide reason)
> > > > > > --
> > > > > > Best Regards
> > > > > >
> > > > > > Lai
> > > > > >
> > > > >
> > > > --
> > > > Best Regards
> > > >
> > > > Lai
> > > >
> > >
> > --
> > Best Regards
> >
> > Lai
> >


Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-20 Thread Anirudh Subramanian
Hi Lai,

I have opened an issue:
https://github.com/apache/incubator-mxnet/issues/15297
I came to know about this issue only today and I have not been monitoring
sockeye.
I jumped onto this issue to make sure it wasn't caused by the dlpack
changes.
Also, I don't  think sockeye CI checks against master, it is using 1.4.1.

Anirudh


On Thu, Jun 20, 2019 at 6:17 PM Lai Wei  wrote:

> Hi,
>
> Could you share which test failed and what’s the crash? How to reproduce
> it?
>
> I was able to install sockeye and run all tests passed. Using
> python setup.py test
>
> I have tested both nightly pip package and 1.5.0.rc1
>
> It would be great to create an issue with reproducible steps and move the
> discussion there.
>
> Also I see sockeye nightly build[1] has been failing for some time, if it’s
> due to MXNet change, please raise this early so we can track and solve it
> in time rather than block the release during vote time.
>
> [1] https://travis-ci.org/awslabs/sockeye
>
>
> On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian  >
> wrote:
>
> > I was able to reproduce a crash with the commit
> > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit
> > a862270beb2d796c1ba311183f7f4a766a18ad6c.
> >
> > Anirudh
> >
> > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei  wrote:
> >
> > > Hi Przemyslaw,
> > >
> > > Is there an issue with more details to track the problem?
> > >
> > >
> > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak 
> > > wrote:
> > >
> > > > -1
> > > >
> > > > There is a crash in sockeye unit test (python setup.py test) observed
> > > > starting with nightly 1.5 build from 6/13 and still occuring in
> > 1.5rc1. I
> > > > don't yet have the exact commit that is responsible for it, but it is
> > > > either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or
> > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op optimization).
> > > >
> > > > On 2019/06/20 06:36:22, Lai Wei  wrote:
> > > > > Dear MXNet community,
> > > > >
> > > > > This is the 3-day vote to release Apache MXNet (incubating) version
> > > > 1.5.0.
> > > > > Voting on dev@ will start June 19, 23:59:59(PST)  and close on
> June
> > > 22,
> > > > > 23:59:59.
> > > > >
> > > > > 1) Link to release notes:
> > > > >
> > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > > >
> > > > >
> > > > > 2) Link to release candidate:
> > > > >
> > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1
> > > > >
> > > > >
> > > > > 3) Link to source and signatures on apache dist server:
> > > > >
> > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/
> > > > >
> > > > >
> > > > > Please remember to TEST first before voting accordingly:
> > > > >
> > > > > +1 = approve
> > > > > +0 = no opinion
> > > > > -1 = disapprove (provide reason)
> > > > > --
> > > > > Best Regards
> > > > >
> > > > > Lai
> > > > >
> > > >
> > > --
> > > Best Regards
> > >
> > > Lai
> > >
> >
> --
> Best Regards
>
> Lai
>


Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-20 Thread Lai Wei
Hi,

Could you share which test failed and what’s the crash? How to reproduce it?

I was able to install sockeye and run all tests passed. Using
python setup.py test

I have tested both nightly pip package and 1.5.0.rc1

It would be great to create an issue with reproducible steps and move the
discussion there.

Also I see sockeye nightly build[1] has been failing for some time, if it’s
due to MXNet change, please raise this early so we can track and solve it
in time rather than block the release during vote time.

[1] https://travis-ci.org/awslabs/sockeye


On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian 
wrote:

> I was able to reproduce a crash with the commit
> 09202f7f261954383aa387144524d38f83f18d06 but not with the commit
> a862270beb2d796c1ba311183f7f4a766a18ad6c.
>
> Anirudh
>
> On Thu, Jun 20, 2019 at 3:53 PM Lai Wei  wrote:
>
> > Hi Przemyslaw,
> >
> > Is there an issue with more details to track the problem?
> >
> >
> > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak 
> > wrote:
> >
> > > -1
> > >
> > > There is a crash in sockeye unit test (python setup.py test) observed
> > > starting with nightly 1.5 build from 6/13 and still occuring in
> 1.5rc1. I
> > > don't yet have the exact commit that is responsible for it, but it is
> > > either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or
> > > 09202f7f261954383aa387144524d38f83f18d06 (cached op optimization).
> > >
> > > On 2019/06/20 06:36:22, Lai Wei  wrote:
> > > > Dear MXNet community,
> > > >
> > > > This is the 3-day vote to release Apache MXNet (incubating) version
> > > 1.5.0.
> > > > Voting on dev@ will start June 19, 23:59:59(PST)  and close on June
> > 22,
> > > > 23:59:59.
> > > >
> > > > 1) Link to release notes:
> > > >
> https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > >
> > > >
> > > > 2) Link to release candidate:
> > > >
> > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1
> > > >
> > > >
> > > > 3) Link to source and signatures on apache dist server:
> > > >
> > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/
> > > >
> > > >
> > > > Please remember to TEST first before voting accordingly:
> > > >
> > > > +1 = approve
> > > > +0 = no opinion
> > > > -1 = disapprove (provide reason)
> > > > --
> > > > Best Regards
> > > >
> > > > Lai
> > > >
> > >
> > --
> > Best Regards
> >
> > Lai
> >
>
-- 
Best Regards

Lai


Re: OMP

2019-06-20 Thread Marco de Abreu
As already proposed, I think the easiest way to get a common understanding
is if we start with a few docker containers. Pedro, would it be possible
for you to wrap your benchmarks into a few containers that will produce
your shown results? That way, we can avoid possible misunderstandings and
also pinpoint the exact parts where people disagree or misunderstood each
other.

-Marco

Pedro Larroy  schrieb am Do., 20. Juni 2019,
21:47:

> I can confirm that we are linking with two versions of omp, I'm
> gaining more clarity into this topic, but I have still questions, the
> facts that I got so far are the folllowing:
>
> * #1: We are linking with two versions of omp, intel's omp and llvm
> openmp when building with MKL enabled.
> * #2: We have 3 different possible OMP versions: Intel OMP (comes with
> MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This
> one is used on the PR proposed by Anton).
>
> Questions:
>
>  * #1 Is it ok to have two versions of openmp linked at the same time?
>  * #2 Which implementation of OMP gives the best performance?  (See
> total training time of my measurement for a partial answer)
>  * #3 Should we have a build flag so we can choose the OMP version at
> runtime?
>  * #4 Which Compiler and build flags did Chris use to get 10x slowdown?
>  * #5 @Stas: is there a script to replicate your benchmarks easily? If
> so could you provide a link?  I think we would need to reproduce your
> benchmarks and verify which versions are being linked. It's possible
> that while compiling with MKL intel's omp was pulled in instead of
> GNU OpenMP.
>  * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we
> update the subrepo regularly?
>
> My conclusion so far:
>
>  * #1 We should avoid linking two versions of omp if possible and
> allow users to choose one in the build as we do for BLAS.
>  * #2 For performance reasons and more control vs different compiler
> versions seems it makes indeed sense to keep the LLVM OpenMP version
> in 3rdparty for now. So unless some more data is gathered, it makes
> sense not to remove it as of now.
>  * #3 We should provide build options to choose which openmp library
> is to be used from the three options available, including libgomp.
>  * #4 Refining the build we could also enable OpenMP in mac without
> additional contortions (doesn't work as of today):
> https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
>  * #5 We should add different omp versions to our benchmarks and track
> the performance, so this data is available for prescribing the best
> build options and for binary releases.
>
> This is also an interesting related gh issue posted in the mkl-dnn
> repository:  https://github.com/intel/mkl-dnn/issues/230
>
>
> I don't observe the order of magnitude divergence reported by Chris in
> vanilla Ubuntu 18.04 in samples / s but the full training finishes
> indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp.
>
> There's also differences in training time when using MKL and the ,
> it's actually a bit slower, I don't know if it's related to OMP.
>
> gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
>
> Anton's branch:  g...@github.com:lebeg/incubator-mxnet.git   branch 'omp'
> (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
> build/libmxnet.so |grep -i omp
> libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> (0x7fd99a51d000)
>
> time python train_mnist.py
>
> INFO:root:Epoch[18] Validation-accuracy=0.984176
> INFO:root:Epoch[19] Batch [0-100]   Speed: 41617.00 samples/sec
>  accuracy=1.00
> INFO:root:Epoch[19] Batch [100-200] Speed: 47990.69 samples/sec
>  accuracy=0.999531
> INFO:root:Epoch[19] Batch [200-300] Speed: 47517.01 samples/sec
>  accuracy=0.999687
> INFO:root:Epoch[19] Batch [300-400] Speed: 47430.53 samples/sec
>  accuracy=1.00
> INFO:root:Epoch[19] Batch [400-500] Speed: 47649.77 samples/sec
>  accuracy=0.999687
> INFO:root:Epoch[19] Batch [500-600] Speed: 51708.12 samples/sec
>  accuracy=0.999687
> INFO:root:Epoch[19] Batch [600-700] Speed: 57228.63 samples/sec
>  accuracy=0.999375
> INFO:root:Epoch[19] Batch [700-800] Speed: 50887.85 samples/sec
>  accuracy=0.999844
> INFO:root:Epoch[19] Batch [800-900] Speed: 53947.98 samples/sec
>  accuracy=0.999531
> INFO:root:Epoch[19] Train-accuracy=0.999717
> INFO:root:Epoch[19] Time cost=1.219
> INFO:root:Epoch[19] Validation-accuracy=0.983977
> 1011.98user 26.78system 0:31.54elapsed 3292%CPU (0avgtext+0avgdata
> 1146052maxresident)k
> 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps
>
> Master, MKL ON:
>
> (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification [master]> ldd
> ../../build/libmxnet.so | grep -i omp
> libomp.so =>
> /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
> (0x7f05ba38f000)
> libiomp5.so =>
>
> /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
> (0x7f05b09f4000)
>
> INFO:root:Epoch[18] 

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-20 Thread Anirudh Subramanian
I was able to reproduce a crash with the commit
09202f7f261954383aa387144524d38f83f18d06 but not with the commit
a862270beb2d796c1ba311183f7f4a766a18ad6c.

Anirudh

On Thu, Jun 20, 2019 at 3:53 PM Lai Wei  wrote:

> Hi Przemyslaw,
>
> Is there an issue with more details to track the problem?
>
>
> On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak 
> wrote:
>
> > -1
> >
> > There is a crash in sockeye unit test (python setup.py test) observed
> > starting with nightly 1.5 build from 6/13 and still occuring in 1.5rc1. I
> > don't yet have the exact commit that is responsible for it, but it is
> > either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or
> > 09202f7f261954383aa387144524d38f83f18d06 (cached op optimization).
> >
> > On 2019/06/20 06:36:22, Lai Wei  wrote:
> > > Dear MXNet community,
> > >
> > > This is the 3-day vote to release Apache MXNet (incubating) version
> > 1.5.0.
> > > Voting on dev@ will start June 19, 23:59:59(PST)  and close on June
> 22,
> > > 23:59:59.
> > >
> > > 1) Link to release notes:
> > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > >
> > >
> > > 2) Link to release candidate:
> > >
> > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1
> > >
> > >
> > > 3) Link to source and signatures on apache dist server:
> > >
> > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/
> > >
> > >
> > > Please remember to TEST first before voting accordingly:
> > >
> > > +1 = approve
> > > +0 = no opinion
> > > -1 = disapprove (provide reason)
> > > --
> > > Best Regards
> > >
> > > Lai
> > >
> >
> --
> Best Regards
>
> Lai
>


Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-20 Thread Lai Wei
Hi Przemyslaw,

Is there an issue with more details to track the problem?


On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak 
wrote:

> -1
>
> There is a crash in sockeye unit test (python setup.py test) observed
> starting with nightly 1.5 build from 6/13 and still occuring in 1.5rc1. I
> don't yet have the exact commit that is responsible for it, but it is
> either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or
> 09202f7f261954383aa387144524d38f83f18d06 (cached op optimization).
>
> On 2019/06/20 06:36:22, Lai Wei  wrote:
> > Dear MXNet community,
> >
> > This is the 3-day vote to release Apache MXNet (incubating) version
> 1.5.0.
> > Voting on dev@ will start June 19, 23:59:59(PST)  and close on June 22,
> > 23:59:59.
> >
> > 1) Link to release notes:
> > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> >
> >
> > 2) Link to release candidate:
> >
> > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1
> >
> >
> > 3) Link to source and signatures on apache dist server:
> >
> > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/
> >
> >
> > Please remember to TEST first before voting accordingly:
> >
> > +1 = approve
> > +0 = no opinion
> > -1 = disapprove (provide reason)
> > --
> > Best Regards
> >
> > Lai
> >
>
-- 
Best Regards

Lai


Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-20 Thread Przemysław Trędak
-1

There is a crash in sockeye unit test (python setup.py test) observed starting 
with nightly 1.5 build from 6/13 and still occuring in 1.5rc1. I don't yet have 
the exact commit that is responsible for it, but it is either 
a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or 
09202f7f261954383aa387144524d38f83f18d06 (cached op optimization).

On 2019/06/20 06:36:22, Lai Wei  wrote: 
> Dear MXNet community,
> 
> This is the 3-day vote to release Apache MXNet (incubating) version 1.5.0.
> Voting on dev@ will start June 19, 23:59:59(PST)  and close on June 22,
> 23:59:59.
> 
> 1) Link to release notes:
> https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> 
> 
> 2) Link to release candidate:
> 
> https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1
> 
> 
> 3) Link to source and signatures on apache dist server:
> 
> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/
> 
> 
> Please remember to TEST first before voting accordingly:
> 
> +1 = approve
> +0 = no opinion
> -1 = disapprove (provide reason)
> -- 
> Best Regards
> 
> Lai
> 


Re: OMP

2019-06-20 Thread Pedro Larroy
I can confirm that we are linking with two versions of omp, I'm
gaining more clarity into this topic, but I have still questions, the
facts that I got so far are the folllowing:

* #1: We are linking with two versions of omp, intel's omp and llvm
openmp when building with MKL enabled.
* #2: We have 3 different possible OMP versions: Intel OMP (comes with
MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This
one is used on the PR proposed by Anton).

Questions:

 * #1 Is it ok to have two versions of openmp linked at the same time?
 * #2 Which implementation of OMP gives the best performance?  (See
total training time of my measurement for a partial answer)
 * #3 Should we have a build flag so we can choose the OMP version at runtime?
 * #4 Which Compiler and build flags did Chris use to get 10x slowdown?
 * #5 @Stas: is there a script to replicate your benchmarks easily? If
so could you provide a link?  I think we would need to reproduce your
benchmarks and verify which versions are being linked. It's possible
that while compiling with MKL intel's omp was pulled in instead of
GNU OpenMP.
 * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we
update the subrepo regularly?

My conclusion so far:

 * #1 We should avoid linking two versions of omp if possible and
allow users to choose one in the build as we do for BLAS.
 * #2 For performance reasons and more control vs different compiler
versions seems it makes indeed sense to keep the LLVM OpenMP version
in 3rdparty for now. So unless some more data is gathered, it makes
sense not to remove it as of now.
 * #3 We should provide build options to choose which openmp library
is to be used from the three options available, including libgomp.
 * #4 Refining the build we could also enable OpenMP in mac without
additional contortions (doesn't work as of today):
https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
 * #5 We should add different omp versions to our benchmarks and track
the performance, so this data is available for prescribing the best
build options and for binary releases.

This is also an interesting related gh issue posted in the mkl-dnn
repository:  https://github.com/intel/mkl-dnn/issues/230


I don't observe the order of magnitude divergence reported by Chris in
vanilla Ubuntu 18.04 in samples / s but the full training finishes
indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp.

There's also differences in training time when using MKL and the ,
it's actually a bit slower, I don't know if it's related to OMP.

gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)

Anton's branch:  g...@github.com:lebeg/incubator-mxnet.git   branch 'omp'
(py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
build/libmxnet.so |grep -i omp
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
(0x7fd99a51d000)

time python train_mnist.py

INFO:root:Epoch[18] Validation-accuracy=0.984176
INFO:root:Epoch[19] Batch [0-100]   Speed: 41617.00 samples/sec
 accuracy=1.00
INFO:root:Epoch[19] Batch [100-200] Speed: 47990.69 samples/sec
 accuracy=0.999531
INFO:root:Epoch[19] Batch [200-300] Speed: 47517.01 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [300-400] Speed: 47430.53 samples/sec
 accuracy=1.00
INFO:root:Epoch[19] Batch [400-500] Speed: 47649.77 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [500-600] Speed: 51708.12 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [600-700] Speed: 57228.63 samples/sec
 accuracy=0.999375
INFO:root:Epoch[19] Batch [700-800] Speed: 50887.85 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [800-900] Speed: 53947.98 samples/sec
 accuracy=0.999531
INFO:root:Epoch[19] Train-accuracy=0.999717
INFO:root:Epoch[19] Time cost=1.219
INFO:root:Epoch[19] Validation-accuracy=0.983977
1011.98user 26.78system 0:31.54elapsed 3292%CPU (0avgtext+0avgdata
1146052maxresident)k
0inputs+0outputs (0major+3496364minor)pagefaults 0swaps

Master, MKL ON:

(py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification [master]> ldd
../../build/libmxnet.so | grep -i omp
libomp.so =>
/home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
(0x7f05ba38f000)
libiomp5.so =>
/home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so
(0x7f05b09f4000)

INFO:root:Epoch[18] Validation-accuracy=0.982484
INFO:root:Epoch[19] Batch [0-100]   Speed: 36651.63 samples/sec
 accuracy=0.999691
INFO:root:Epoch[19] Batch [100-200] Speed: 45093.98 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [200-300] Speed: 45146.84 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [300-400] Speed: 45119.90 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [400-500] Speed: 44998.96 samples/sec
 accuracy=0.999531
INFO:root:Epoch[19] Batch [500-600] Speed: 45072.25 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [600-700] Speed: 44969.79 samples/sec
 

[VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-20 Thread Lai Wei
Dear MXNet community,

This is the 3-day vote to release Apache MXNet (incubating) version 1.5.0.
Voting on dev@ will start June 19, 23:59:59(PST)  and close on June 22,
23:59:59.

1) Link to release notes:
https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes


2) Link to release candidate:

https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1


3) Link to source and signatures on apache dist server:

https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/


Please remember to TEST first before voting accordingly:

+1 = approve
+0 = no opinion
-1 = disapprove (provide reason)
-- 
Best Regards

Lai