Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-08 Thread Marco de Abreu
Sorry for the vague phrasing, it is back to normal. This can be verified at
[1]. I agree with Kellen; we will actively be working with the maintainers
of dockcross to ensure their repository is brought back to a stable state
which also provides proper tagging.

+1 from my side now.

[1]:
http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/PR-10850/runs/1/nodes/67/steps/329/log/?start=0

On Tue, May 8, 2018 at 4:42 PM, kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Thanks Marco for the work-arounds and for getting this fixed in CI.  I
> personally don't see this as a release blocker as it's targeting a still
> experimental feature (Jetson pip wheels).  I also have a pretty high level
> of confidence that we can fix this by working with the crossdock org.  This
> would mean in the future this release cut would still work for users who
> are interested in building the 1.2 release for their Jetson devices.
>
> On Tue, May 8, 2018 at 3:46 PM, Steffen Rochel 
> wrote:
>
> > Should be back or is back to normal? Would you please verify and update
> > your vote on dev@ accordingly?
> > Currently you are on record as -1. Just trying to help Anirud to get
> proper
> > vote count.
> >
> > Thanks
> > Steffen (MXNet contributor hat on)
> >
> > On Tue, May 8, 2018 at 6:37 AM Marco de Abreu <
> > marco.g.ab...@googlemail.com>
> > wrote:
> >
> > > Yes, sorry for the inconvenience! We fixed the root cause and
> everything
> > > should be back to normal.
> > >
> > > -Marco
> > >
> > > Steffen Rochel  schrieb am Di., 8. Mai 2018,
> > > 14:59:
> > >
> > > > Marco - thanks for your efforts. Does this unblock the Apache MXNet
> > v1.2
> > > > release and change your vote?
> > > >
> > > > On Tue, May 8, 2018 at 3:00 AM Marco de Abreu <
> > > > marco.g.ab...@googlemail.com>
> > > > wrote:
> > > >
> > > > > Small update regarding the ARM64 builds. I have created two pull
> > > requests
> > > > > [1][2] which changed the repository to a mirror I created. This
> > mirror
> > > > was
> > > > > created using a cached version of the working Docker image,
> > effectively
> > > > > reverting the state back to a working one. At the same time, this
> > pins
> > > > the
> > > > > container to prevent any further problems.
> > > > >
> > > > > I would prefer to use the public repository instead of our own
> > mirror,
> > > > but
> > > > > for now, this is inevitable. If anybody would like to be added to
> the
> > > > > Dockerhub organization "mxnetci", feel free to let me know! To
> > prevent
> > > > > problems like these in future, I created a feature request at [3]
> to
> > > > ensure
> > > > > future releases of that Dockerimage are properly tagged.
> > Additionally,
> > > > the
> > > > > creator of the failing PR is aware and actively involved in
> creating
> > a
> > > > > permanent solution [4].
> > > > >
> > > > > Best regards,
> > > > > Marco
> > > > >
> > > > > [1]: https://github.com/apache/incubator-mxnet/pull/10850
> > > > > [2]: https://github.com/apache/incubator-mxnet/pull/10849
> > > > > [3]: https://github.com/dockcross/dockcross/issues/223
> > > > > [4]: https://github.com/dockcross/dockcross/pull/221
> > > > >
> > > > > On Tue, May 8, 2018 at 2:39 AM, Lai Wei 
> wrote:
> > > > >
> > > > > > Hi Anirudh,
> > > > > >
> > > > > > Update: Did an install on a fresh instance with USE_MKLDNN=1,
> works
> > > > fine
> > > > > > now. Pip install with --pre is also working fine.
> > > > > > Problem is the mkl-dnn I installed on the old instance.
> > > > > > Closing the issue <
> > > > > https://github.com/awslabs/keras-apache-mxnet/issues/75
> > > > > > >.
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > Best Regards
> > > > > >
> > > > > > Lai Wei
> > > > > >
> > > > > > https://www.linkedin.com/pub/lai-wei/2b/731/52b
> > > > > >
> > > > > > On Mon, May 7, 2018 at 2:48 PM, Lai Wei 
> > wrote:
> > > > > >
> > > > > > > Hi Anirudh,
> > > > > > >
> > > > > > > yes, also tried that,  didn't resolve. Looking into root cause
> > and
> > > > will
> > > > > > > update.
> > > > > > >
> > > > > > > Best Regards
> > > > > > >
> > > > > > > Lai Wei
> > > > > > >
> > > > > > > https://www.linkedin.com/pub/lai-wei/2b/731/52b
> > > > > > >
> > > > > > > On Mon, May 7, 2018 at 2:15 PM, Anirudh  >
> > > > wrote:
> > > > > > >
> > > > > > >> Hi Lai,
> > > > > > >>
> > > > > > >> I see that you used USE_MKL2017_EXPERIMENTAL=1, I am not sure
> if
> > > > this
> > > > > is
> > > > > > >> the right flag. Did you try USE_MKLDNN=1 ?
> > > > > > >>
> > > > > > >> Anirudh
> > > > > > >>
> > > > > > >> On Mon, May 7, 2018 at 1:22 PM, Lai Wei 
> > > > wrote:
> > > > > > >>
> > > > > > >> > Hi,
> > > > > > >> >
> > > > > > >> > I would like to raise an issue with mxnet-mkl. The
> keras-mxnet
> > > > > package
> > > > > > >> was
> > > > > > >> > working fine 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-08 Thread Steffen Rochel
Should be back or is back to normal? Would you please verify and update
your vote on dev@ accordingly?
Currently you are on record as -1. Just trying to help Anirud to get proper
vote count.

Thanks
Steffen (MXNet contributor hat on)

On Tue, May 8, 2018 at 6:37 AM Marco de Abreu 
wrote:

> Yes, sorry for the inconvenience! We fixed the root cause and everything
> should be back to normal.
>
> -Marco
>
> Steffen Rochel  schrieb am Di., 8. Mai 2018,
> 14:59:
>
> > Marco - thanks for your efforts. Does this unblock the Apache MXNet v1.2
> > release and change your vote?
> >
> > On Tue, May 8, 2018 at 3:00 AM Marco de Abreu <
> > marco.g.ab...@googlemail.com>
> > wrote:
> >
> > > Small update regarding the ARM64 builds. I have created two pull
> requests
> > > [1][2] which changed the repository to a mirror I created. This mirror
> > was
> > > created using a cached version of the working Docker image, effectively
> > > reverting the state back to a working one. At the same time, this pins
> > the
> > > container to prevent any further problems.
> > >
> > > I would prefer to use the public repository instead of our own mirror,
> > but
> > > for now, this is inevitable. If anybody would like to be added to the
> > > Dockerhub organization "mxnetci", feel free to let me know! To prevent
> > > problems like these in future, I created a feature request at [3] to
> > ensure
> > > future releases of that Dockerimage are properly tagged. Additionally,
> > the
> > > creator of the failing PR is aware and actively involved in creating a
> > > permanent solution [4].
> > >
> > > Best regards,
> > > Marco
> > >
> > > [1]: https://github.com/apache/incubator-mxnet/pull/10850
> > > [2]: https://github.com/apache/incubator-mxnet/pull/10849
> > > [3]: https://github.com/dockcross/dockcross/issues/223
> > > [4]: https://github.com/dockcross/dockcross/pull/221
> > >
> > > On Tue, May 8, 2018 at 2:39 AM, Lai Wei  wrote:
> > >
> > > > Hi Anirudh,
> > > >
> > > > Update: Did an install on a fresh instance with USE_MKLDNN=1, works
> > fine
> > > > now. Pip install with --pre is also working fine.
> > > > Problem is the mkl-dnn I installed on the old instance.
> > > > Closing the issue <
> > > https://github.com/awslabs/keras-apache-mxnet/issues/75
> > > > >.
> > > >
> > > > Thanks!
> > > >
> > > > Best Regards
> > > >
> > > > Lai Wei
> > > >
> > > > https://www.linkedin.com/pub/lai-wei/2b/731/52b
> > > >
> > > > On Mon, May 7, 2018 at 2:48 PM, Lai Wei  wrote:
> > > >
> > > > > Hi Anirudh,
> > > > >
> > > > > yes, also tried that,  didn't resolve. Looking into root cause and
> > will
> > > > > update.
> > > > >
> > > > > Best Regards
> > > > >
> > > > > Lai Wei
> > > > >
> > > > > https://www.linkedin.com/pub/lai-wei/2b/731/52b
> > > > >
> > > > > On Mon, May 7, 2018 at 2:15 PM, Anirudh 
> > wrote:
> > > > >
> > > > >> Hi Lai,
> > > > >>
> > > > >> I see that you used USE_MKL2017_EXPERIMENTAL=1, I am not sure if
> > this
> > > is
> > > > >> the right flag. Did you try USE_MKLDNN=1 ?
> > > > >>
> > > > >> Anirudh
> > > > >>
> > > > >> On Mon, May 7, 2018 at 1:22 PM, Lai Wei 
> > wrote:
> > > > >>
> > > > >> > Hi,
> > > > >> >
> > > > >> > I would like to raise an issue with mxnet-mkl. The keras-mxnet
> > > package
> > > > >> was
> > > > >> > working fine with mxnet-mkl 1.1.0 for training on CPU. However,
> > > > weights
> > > > >> are
> > > > >> > not updated when I use mxnet-mkl 1.2.0b20180507. I tried both
> 'pip
> > > > >> install
> > > > >> > mxnet-mkl --pre' and built from source from release branch
> > (v1.2.0)
> > > > with
> > > > >> > mkl flag.
> > > > >> >
> > > > >> > Please refer to this issue for more details:
> > > > >> > https://github.com/awslabs/keras-apache-mxnet/issues/75
> > > > >> >
> > > > >> > There is no code change on keras-mxnet side, so I guess some API
> > > broke
> > > > >> when
> > > > >> > using latest mxnet-mkl. Still working on finding the root cause.
> > > > >> >
> > > > >> > Thanks
> > > > >> >
> > > > >> >
> > > > >> > Best Regards
> > > > >> >
> > > > >> > Lai Wei
> > > > >> >
> > > > >> > https://www.linkedin.com/pub/lai-wei/2b/731/52b
> > > > >> >
> > > > >> > On Mon, May 7, 2018 at 10:38 AM, Haibin Lin <
> > > haibin.lin@gmail.com
> > > > >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > +1 binding. Build from source with CUDA, ran linear
> > classification
> > > > >> > example
> > > > >> > > and works fine.
> > > > >> > >
> > > > >> > > Best.
> > > > >> > > Haibin
> > > > >> > >
> > > > >> > >
> > > > >> > > On Sun, May 6, 2018 at 10:08 PM, Steffen Rochel <
> > > > >> steffenroc...@gmail.com
> > > > >> > >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > +1 (non-binding). Tested with selected notebooks from The
> > > Straight
> > > > >> > Dope.
> > > > >> > > > So many important enhancements everybody contributed and our
> > > users
> > > > >> 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-08 Thread Marco de Abreu
Yes, sorry for the inconvenience! We fixed the root cause and everything
should be back to normal.

-Marco

Steffen Rochel  schrieb am Di., 8. Mai 2018, 14:59:

> Marco - thanks for your efforts. Does this unblock the Apache MXNet v1.2
> release and change your vote?
>
> On Tue, May 8, 2018 at 3:00 AM Marco de Abreu <
> marco.g.ab...@googlemail.com>
> wrote:
>
> > Small update regarding the ARM64 builds. I have created two pull requests
> > [1][2] which changed the repository to a mirror I created. This mirror
> was
> > created using a cached version of the working Docker image, effectively
> > reverting the state back to a working one. At the same time, this pins
> the
> > container to prevent any further problems.
> >
> > I would prefer to use the public repository instead of our own mirror,
> but
> > for now, this is inevitable. If anybody would like to be added to the
> > Dockerhub organization "mxnetci", feel free to let me know! To prevent
> > problems like these in future, I created a feature request at [3] to
> ensure
> > future releases of that Dockerimage are properly tagged. Additionally,
> the
> > creator of the failing PR is aware and actively involved in creating a
> > permanent solution [4].
> >
> > Best regards,
> > Marco
> >
> > [1]: https://github.com/apache/incubator-mxnet/pull/10850
> > [2]: https://github.com/apache/incubator-mxnet/pull/10849
> > [3]: https://github.com/dockcross/dockcross/issues/223
> > [4]: https://github.com/dockcross/dockcross/pull/221
> >
> > On Tue, May 8, 2018 at 2:39 AM, Lai Wei  wrote:
> >
> > > Hi Anirudh,
> > >
> > > Update: Did an install on a fresh instance with USE_MKLDNN=1, works
> fine
> > > now. Pip install with --pre is also working fine.
> > > Problem is the mkl-dnn I installed on the old instance.
> > > Closing the issue <
> > https://github.com/awslabs/keras-apache-mxnet/issues/75
> > > >.
> > >
> > > Thanks!
> > >
> > > Best Regards
> > >
> > > Lai Wei
> > >
> > > https://www.linkedin.com/pub/lai-wei/2b/731/52b
> > >
> > > On Mon, May 7, 2018 at 2:48 PM, Lai Wei  wrote:
> > >
> > > > Hi Anirudh,
> > > >
> > > > yes, also tried that,  didn't resolve. Looking into root cause and
> will
> > > > update.
> > > >
> > > > Best Regards
> > > >
> > > > Lai Wei
> > > >
> > > > https://www.linkedin.com/pub/lai-wei/2b/731/52b
> > > >
> > > > On Mon, May 7, 2018 at 2:15 PM, Anirudh 
> wrote:
> > > >
> > > >> Hi Lai,
> > > >>
> > > >> I see that you used USE_MKL2017_EXPERIMENTAL=1, I am not sure if
> this
> > is
> > > >> the right flag. Did you try USE_MKLDNN=1 ?
> > > >>
> > > >> Anirudh
> > > >>
> > > >> On Mon, May 7, 2018 at 1:22 PM, Lai Wei 
> wrote:
> > > >>
> > > >> > Hi,
> > > >> >
> > > >> > I would like to raise an issue with mxnet-mkl. The keras-mxnet
> > package
> > > >> was
> > > >> > working fine with mxnet-mkl 1.1.0 for training on CPU. However,
> > > weights
> > > >> are
> > > >> > not updated when I use mxnet-mkl 1.2.0b20180507. I tried both 'pip
> > > >> install
> > > >> > mxnet-mkl --pre' and built from source from release branch
> (v1.2.0)
> > > with
> > > >> > mkl flag.
> > > >> >
> > > >> > Please refer to this issue for more details:
> > > >> > https://github.com/awslabs/keras-apache-mxnet/issues/75
> > > >> >
> > > >> > There is no code change on keras-mxnet side, so I guess some API
> > broke
> > > >> when
> > > >> > using latest mxnet-mkl. Still working on finding the root cause.
> > > >> >
> > > >> > Thanks
> > > >> >
> > > >> >
> > > >> > Best Regards
> > > >> >
> > > >> > Lai Wei
> > > >> >
> > > >> > https://www.linkedin.com/pub/lai-wei/2b/731/52b
> > > >> >
> > > >> > On Mon, May 7, 2018 at 10:38 AM, Haibin Lin <
> > haibin.lin@gmail.com
> > > >
> > > >> > wrote:
> > > >> >
> > > >> > > +1 binding. Build from source with CUDA, ran linear
> classification
> > > >> > example
> > > >> > > and works fine.
> > > >> > >
> > > >> > > Best.
> > > >> > > Haibin
> > > >> > >
> > > >> > >
> > > >> > > On Sun, May 6, 2018 at 10:08 PM, Steffen Rochel <
> > > >> steffenroc...@gmail.com
> > > >> > >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > +1 (non-binding). Tested with selected notebooks from The
> > Straight
> > > >> > Dope.
> > > >> > > > So many important enhancements everybody contributed and our
> > users
> > > >> are
> > > >> > > > waiting for. Hope we will see more votes.
> > > >> > > > Steffen
> > > >> > > > On Mon, May 7, 2018 at 1:07 AM Anirudh  >
> > > >> wrote:
> > > >> > > >
> > > >> > > > > Hi all,
> > > >> > > > >
> > > >> > > > > Since we don't have enough binding votes yet, I am extending
> > the
> > > >> vote
> > > >> > > > till
> > > >> > > > > tomorrow (Monday May 7th), 12:50 PM PDT.
> > > >> > > > >
> > > >> > > > > Anirudh
> > > >> > > > >
> > > >> > > > > On Sun, May 6, 2018 at 4:05 PM, Anirudh <
> > anirudh2...@gmail.com>
> > > >> > wrote:
> > > >> > > > >
> > > >> > > > > > Hi 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-08 Thread Steffen Rochel
Marco - thanks for your efforts. Does this unblock the Apache MXNet v1.2
release and change your vote?

On Tue, May 8, 2018 at 3:00 AM Marco de Abreu 
wrote:

> Small update regarding the ARM64 builds. I have created two pull requests
> [1][2] which changed the repository to a mirror I created. This mirror was
> created using a cached version of the working Docker image, effectively
> reverting the state back to a working one. At the same time, this pins the
> container to prevent any further problems.
>
> I would prefer to use the public repository instead of our own mirror, but
> for now, this is inevitable. If anybody would like to be added to the
> Dockerhub organization "mxnetci", feel free to let me know! To prevent
> problems like these in future, I created a feature request at [3] to ensure
> future releases of that Dockerimage are properly tagged. Additionally, the
> creator of the failing PR is aware and actively involved in creating a
> permanent solution [4].
>
> Best regards,
> Marco
>
> [1]: https://github.com/apache/incubator-mxnet/pull/10850
> [2]: https://github.com/apache/incubator-mxnet/pull/10849
> [3]: https://github.com/dockcross/dockcross/issues/223
> [4]: https://github.com/dockcross/dockcross/pull/221
>
> On Tue, May 8, 2018 at 2:39 AM, Lai Wei  wrote:
>
> > Hi Anirudh,
> >
> > Update: Did an install on a fresh instance with USE_MKLDNN=1, works fine
> > now. Pip install with --pre is also working fine.
> > Problem is the mkl-dnn I installed on the old instance.
> > Closing the issue <
> https://github.com/awslabs/keras-apache-mxnet/issues/75
> > >.
> >
> > Thanks!
> >
> > Best Regards
> >
> > Lai Wei
> >
> > https://www.linkedin.com/pub/lai-wei/2b/731/52b
> >
> > On Mon, May 7, 2018 at 2:48 PM, Lai Wei  wrote:
> >
> > > Hi Anirudh,
> > >
> > > yes, also tried that,  didn't resolve. Looking into root cause and will
> > > update.
> > >
> > > Best Regards
> > >
> > > Lai Wei
> > >
> > > https://www.linkedin.com/pub/lai-wei/2b/731/52b
> > >
> > > On Mon, May 7, 2018 at 2:15 PM, Anirudh  wrote:
> > >
> > >> Hi Lai,
> > >>
> > >> I see that you used USE_MKL2017_EXPERIMENTAL=1, I am not sure if this
> is
> > >> the right flag. Did you try USE_MKLDNN=1 ?
> > >>
> > >> Anirudh
> > >>
> > >> On Mon, May 7, 2018 at 1:22 PM, Lai Wei  wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > I would like to raise an issue with mxnet-mkl. The keras-mxnet
> package
> > >> was
> > >> > working fine with mxnet-mkl 1.1.0 for training on CPU. However,
> > weights
> > >> are
> > >> > not updated when I use mxnet-mkl 1.2.0b20180507. I tried both 'pip
> > >> install
> > >> > mxnet-mkl --pre' and built from source from release branch (v1.2.0)
> > with
> > >> > mkl flag.
> > >> >
> > >> > Please refer to this issue for more details:
> > >> > https://github.com/awslabs/keras-apache-mxnet/issues/75
> > >> >
> > >> > There is no code change on keras-mxnet side, so I guess some API
> broke
> > >> when
> > >> > using latest mxnet-mkl. Still working on finding the root cause.
> > >> >
> > >> > Thanks
> > >> >
> > >> >
> > >> > Best Regards
> > >> >
> > >> > Lai Wei
> > >> >
> > >> > https://www.linkedin.com/pub/lai-wei/2b/731/52b
> > >> >
> > >> > On Mon, May 7, 2018 at 10:38 AM, Haibin Lin <
> haibin.lin@gmail.com
> > >
> > >> > wrote:
> > >> >
> > >> > > +1 binding. Build from source with CUDA, ran linear classification
> > >> > example
> > >> > > and works fine.
> > >> > >
> > >> > > Best.
> > >> > > Haibin
> > >> > >
> > >> > >
> > >> > > On Sun, May 6, 2018 at 10:08 PM, Steffen Rochel <
> > >> steffenroc...@gmail.com
> > >> > >
> > >> > > wrote:
> > >> > >
> > >> > > > +1 (non-binding). Tested with selected notebooks from The
> Straight
> > >> > Dope.
> > >> > > > So many important enhancements everybody contributed and our
> users
> > >> are
> > >> > > > waiting for. Hope we will see more votes.
> > >> > > > Steffen
> > >> > > > On Mon, May 7, 2018 at 1:07 AM Anirudh 
> > >> wrote:
> > >> > > >
> > >> > > > > Hi all,
> > >> > > > >
> > >> > > > > Since we don't have enough binding votes yet, I am extending
> the
> > >> vote
> > >> > > > till
> > >> > > > > tomorrow (Monday May 7th), 12:50 PM PDT.
> > >> > > > >
> > >> > > > > Anirudh
> > >> > > > >
> > >> > > > > On Sun, May 6, 2018 at 4:05 PM, Anirudh <
> anirudh2...@gmail.com>
> > >> > wrote:
> > >> > > > >
> > >> > > > > > Hi Pedro,
> > >> > > > > >
> > >> > > > > > Thanks for the clarification. I was able to reproduce the
> > issue
> > >> > with
> > >> > > > > > USE_OPENMP=OFF. I wasn't able to reproduce the issue with
> > Make.
> > >> > Since
> > >> > > > the
> > >> > > > > > issue is not reproducible with make and the customers using
> > >> > > > > USE_OPENMP=OFF
> > >> > > > > > with cmake should be small, I agree with you that this
> should
> > >> not
> > >> > be
> > >> > > a
> > >> > > > > > blocker. I have added the 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-07 Thread Lai Wei
Hi Anirudh,

Update: Did an install on a fresh instance with USE_MKLDNN=1, works fine
now. Pip install with --pre is also working fine.
Problem is the mkl-dnn I installed on the old instance.
Closing the issue .

Thanks!

Best Regards

Lai Wei

https://www.linkedin.com/pub/lai-wei/2b/731/52b

On Mon, May 7, 2018 at 2:48 PM, Lai Wei  wrote:

> Hi Anirudh,
>
> yes, also tried that,  didn't resolve. Looking into root cause and will
> update.
>
> Best Regards
>
> Lai Wei
>
> https://www.linkedin.com/pub/lai-wei/2b/731/52b
>
> On Mon, May 7, 2018 at 2:15 PM, Anirudh  wrote:
>
>> Hi Lai,
>>
>> I see that you used USE_MKL2017_EXPERIMENTAL=1, I am not sure if this is
>> the right flag. Did you try USE_MKLDNN=1 ?
>>
>> Anirudh
>>
>> On Mon, May 7, 2018 at 1:22 PM, Lai Wei  wrote:
>>
>> > Hi,
>> >
>> > I would like to raise an issue with mxnet-mkl. The keras-mxnet package
>> was
>> > working fine with mxnet-mkl 1.1.0 for training on CPU. However, weights
>> are
>> > not updated when I use mxnet-mkl 1.2.0b20180507. I tried both 'pip
>> install
>> > mxnet-mkl --pre' and built from source from release branch (v1.2.0) with
>> > mkl flag.
>> >
>> > Please refer to this issue for more details:
>> > https://github.com/awslabs/keras-apache-mxnet/issues/75
>> >
>> > There is no code change on keras-mxnet side, so I guess some API broke
>> when
>> > using latest mxnet-mkl. Still working on finding the root cause.
>> >
>> > Thanks
>> >
>> >
>> > Best Regards
>> >
>> > Lai Wei
>> >
>> > https://www.linkedin.com/pub/lai-wei/2b/731/52b
>> >
>> > On Mon, May 7, 2018 at 10:38 AM, Haibin Lin 
>> > wrote:
>> >
>> > > +1 binding. Build from source with CUDA, ran linear classification
>> > example
>> > > and works fine.
>> > >
>> > > Best.
>> > > Haibin
>> > >
>> > >
>> > > On Sun, May 6, 2018 at 10:08 PM, Steffen Rochel <
>> steffenroc...@gmail.com
>> > >
>> > > wrote:
>> > >
>> > > > +1 (non-binding). Tested with selected notebooks from The Straight
>> > Dope.
>> > > > So many important enhancements everybody contributed and our users
>> are
>> > > > waiting for. Hope we will see more votes.
>> > > > Steffen
>> > > > On Mon, May 7, 2018 at 1:07 AM Anirudh 
>> wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > Since we don't have enough binding votes yet, I am extending the
>> vote
>> > > > till
>> > > > > tomorrow (Monday May 7th), 12:50 PM PDT.
>> > > > >
>> > > > > Anirudh
>> > > > >
>> > > > > On Sun, May 6, 2018 at 4:05 PM, Anirudh 
>> > wrote:
>> > > > >
>> > > > > > Hi Pedro,
>> > > > > >
>> > > > > > Thanks for the clarification. I was able to reproduce the issue
>> > with
>> > > > > > USE_OPENMP=OFF. I wasn't able to reproduce the issue with Make.
>> > Since
>> > > > the
>> > > > > > issue is not reproducible with make and the customers using
>> > > > > USE_OPENMP=OFF
>> > > > > > with cmake should be small, I agree with you that this should
>> not
>> > be
>> > > a
>> > > > > > blocker. I have added the issue to known issues in release
>> notes:
>> > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.
>> 0.rc2
>> > > > > >
>> > > > > > Anirudh
>> > > > > >
>> > > > > > On Sun, May 6, 2018 at 9:03 AM, Pedro Larroy <
>> > > > > pedro.larroy.li...@gmail.com
>> > > > > > > wrote:
>> > > > > >
>> > > > > >> Agreed, I was not aware that the problems where not present in
>> the
>> > > > > release
>> > > > > >> branch.
>> > > > > >>
>> > > > > >> On Fri, May 4, 2018 at 8:32 PM, Haibin Lin <
>> > > haibin.lin@gmail.com>
>> > > > > >> wrote:
>> > > > > >>
>> > > > > >> > I agree with Anirudh that the focus of the discussion should
>> be
>> > > > > limited
>> > > > > >> to
>> > > > > >> > the release branch, not the master branch. Anything that
>> breaks
>> > on
>> > > > > >> master
>> > > > > >> > but works on release branch should not block the release
>> itself.
>> > > > > >> >
>> > > > > >> >
>> > > > > >> > Best,
>> > > > > >> >
>> > > > > >> > Haibin
>> > > > > >> >
>> > > > > >> > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
>> > > > > >> > pedro.larroy.li...@gmail.com>
>> > > > > >> > wrote:
>> > > > > >> >
>> > > > > >> > > I see your point.
>> > > > > >> > >
>> > > > > >> > > I checked the failures on the v1.2.0 branch and I don't see
>> > > > > segfaults,
>> > > > > >> > just
>> > > > > >> > > minor failures due to flaky tests.
>> > > > > >> > >
>> > > > > >> > > I will trigger it repeatedly a few times until Sunday to
>> have
>> > a
>> > > > and
>> > > > > >> > change
>> > > > > >> > > my vote accordingly.
>> > > > > >> > >
>> > > > > >> > >
>> > > > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
>> > mxnet/job/v1.2.0/
>> > > > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>> > > organizations/jenkins/
>> > > > > >> > > incubator-mxnet/detail/v1.2.0/17/pipeline
>> > > > > >> > > 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-07 Thread Lai Wei
Hi Anirudh,

yes, also tried that,  didn't resolve. Looking into root cause and will
update.

Best Regards

Lai Wei

https://www.linkedin.com/pub/lai-wei/2b/731/52b

On Mon, May 7, 2018 at 2:15 PM, Anirudh  wrote:

> Hi Lai,
>
> I see that you used USE_MKL2017_EXPERIMENTAL=1, I am not sure if this is
> the right flag. Did you try USE_MKLDNN=1 ?
>
> Anirudh
>
> On Mon, May 7, 2018 at 1:22 PM, Lai Wei  wrote:
>
> > Hi,
> >
> > I would like to raise an issue with mxnet-mkl. The keras-mxnet package
> was
> > working fine with mxnet-mkl 1.1.0 for training on CPU. However, weights
> are
> > not updated when I use mxnet-mkl 1.2.0b20180507. I tried both 'pip
> install
> > mxnet-mkl --pre' and built from source from release branch (v1.2.0) with
> > mkl flag.
> >
> > Please refer to this issue for more details:
> > https://github.com/awslabs/keras-apache-mxnet/issues/75
> >
> > There is no code change on keras-mxnet side, so I guess some API broke
> when
> > using latest mxnet-mkl. Still working on finding the root cause.
> >
> > Thanks
> >
> >
> > Best Regards
> >
> > Lai Wei
> >
> > https://www.linkedin.com/pub/lai-wei/2b/731/52b
> >
> > On Mon, May 7, 2018 at 10:38 AM, Haibin Lin 
> > wrote:
> >
> > > +1 binding. Build from source with CUDA, ran linear classification
> > example
> > > and works fine.
> > >
> > > Best.
> > > Haibin
> > >
> > >
> > > On Sun, May 6, 2018 at 10:08 PM, Steffen Rochel <
> steffenroc...@gmail.com
> > >
> > > wrote:
> > >
> > > > +1 (non-binding). Tested with selected notebooks from The Straight
> > Dope.
> > > > So many important enhancements everybody contributed and our users
> are
> > > > waiting for. Hope we will see more votes.
> > > > Steffen
> > > > On Mon, May 7, 2018 at 1:07 AM Anirudh 
> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Since we don't have enough binding votes yet, I am extending the
> vote
> > > > till
> > > > > tomorrow (Monday May 7th), 12:50 PM PDT.
> > > > >
> > > > > Anirudh
> > > > >
> > > > > On Sun, May 6, 2018 at 4:05 PM, Anirudh 
> > wrote:
> > > > >
> > > > > > Hi Pedro,
> > > > > >
> > > > > > Thanks for the clarification. I was able to reproduce the issue
> > with
> > > > > > USE_OPENMP=OFF. I wasn't able to reproduce the issue with Make.
> > Since
> > > > the
> > > > > > issue is not reproducible with make and the customers using
> > > > > USE_OPENMP=OFF
> > > > > > with cmake should be small, I agree with you that this should not
> > be
> > > a
> > > > > > blocker. I have added the issue to known issues in release notes:
> > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
> > > > > >
> > > > > > Anirudh
> > > > > >
> > > > > > On Sun, May 6, 2018 at 9:03 AM, Pedro Larroy <
> > > > > pedro.larroy.li...@gmail.com
> > > > > > > wrote:
> > > > > >
> > > > > >> Agreed, I was not aware that the problems where not present in
> the
> > > > > release
> > > > > >> branch.
> > > > > >>
> > > > > >> On Fri, May 4, 2018 at 8:32 PM, Haibin Lin <
> > > haibin.lin@gmail.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >> > I agree with Anirudh that the focus of the discussion should
> be
> > > > > limited
> > > > > >> to
> > > > > >> > the release branch, not the master branch. Anything that
> breaks
> > on
> > > > > >> master
> > > > > >> > but works on release branch should not block the release
> itself.
> > > > > >> >
> > > > > >> >
> > > > > >> > Best,
> > > > > >> >
> > > > > >> > Haibin
> > > > > >> >
> > > > > >> > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
> > > > > >> > pedro.larroy.li...@gmail.com>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > I see your point.
> > > > > >> > >
> > > > > >> > > I checked the failures on the v1.2.0 branch and I don't see
> > > > > segfaults,
> > > > > >> > just
> > > > > >> > > minor failures due to flaky tests.
> > > > > >> > >
> > > > > >> > > I will trigger it repeatedly a few times until Sunday to
> have
> > a
> > > > and
> > > > > >> > change
> > > > > >> > > my vote accordingly.
> > > > > >> > >
> > > > > >> > >
> > > > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> > mxnet/job/v1.2.0/
> > > > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> > > organizations/jenkins/
> > > > > >> > > incubator-mxnet/detail/v1.2.0/17/pipeline
> > > > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> > > organizations/jenkins/
> > > > > >> > > incubator-mxnet/detail/v1.2.0/15/pipeline/
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Pedro.
> > > > > >> > >
> > > > > >> > > On Fri, May 4, 2018 at 7:16 PM, Anirudh <
> > anirudh2...@gmail.com>
> > > > > >> wrote:
> > > > > >> > >
> > > > > >> > > > Hi Pedro,
> > > > > >> > > >
> > > > > >> > > > Thank you for the suggestions. I will try to reproduce
> this
> > > > > without
> > > > > >> > fixed
> > > > > >> > > > seeds and also run it for a longer time duration.
> > > > > >> > > > Having said that, 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-07 Thread Anirudh
Hi Lai,

I see that you used USE_MKL2017_EXPERIMENTAL=1, I am not sure if this is
the right flag. Did you try USE_MKLDNN=1 ?

Anirudh

On Mon, May 7, 2018 at 1:22 PM, Lai Wei  wrote:

> Hi,
>
> I would like to raise an issue with mxnet-mkl. The keras-mxnet package was
> working fine with mxnet-mkl 1.1.0 for training on CPU. However, weights are
> not updated when I use mxnet-mkl 1.2.0b20180507. I tried both 'pip install
> mxnet-mkl --pre' and built from source from release branch (v1.2.0) with
> mkl flag.
>
> Please refer to this issue for more details:
> https://github.com/awslabs/keras-apache-mxnet/issues/75
>
> There is no code change on keras-mxnet side, so I guess some API broke when
> using latest mxnet-mkl. Still working on finding the root cause.
>
> Thanks
>
>
> Best Regards
>
> Lai Wei
>
> https://www.linkedin.com/pub/lai-wei/2b/731/52b
>
> On Mon, May 7, 2018 at 10:38 AM, Haibin Lin 
> wrote:
>
> > +1 binding. Build from source with CUDA, ran linear classification
> example
> > and works fine.
> >
> > Best.
> > Haibin
> >
> >
> > On Sun, May 6, 2018 at 10:08 PM, Steffen Rochel  >
> > wrote:
> >
> > > +1 (non-binding). Tested with selected notebooks from The Straight
> Dope.
> > > So many important enhancements everybody contributed and our users are
> > > waiting for. Hope we will see more votes.
> > > Steffen
> > > On Mon, May 7, 2018 at 1:07 AM Anirudh  wrote:
> > >
> > > > Hi all,
> > > >
> > > > Since we don't have enough binding votes yet, I am extending the vote
> > > till
> > > > tomorrow (Monday May 7th), 12:50 PM PDT.
> > > >
> > > > Anirudh
> > > >
> > > > On Sun, May 6, 2018 at 4:05 PM, Anirudh 
> wrote:
> > > >
> > > > > Hi Pedro,
> > > > >
> > > > > Thanks for the clarification. I was able to reproduce the issue
> with
> > > > > USE_OPENMP=OFF. I wasn't able to reproduce the issue with Make.
> Since
> > > the
> > > > > issue is not reproducible with make and the customers using
> > > > USE_OPENMP=OFF
> > > > > with cmake should be small, I agree with you that this should not
> be
> > a
> > > > > blocker. I have added the issue to known issues in release notes:
> > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
> > > > >
> > > > > Anirudh
> > > > >
> > > > > On Sun, May 6, 2018 at 9:03 AM, Pedro Larroy <
> > > > pedro.larroy.li...@gmail.com
> > > > > > wrote:
> > > > >
> > > > >> Agreed, I was not aware that the problems where not present in the
> > > > release
> > > > >> branch.
> > > > >>
> > > > >> On Fri, May 4, 2018 at 8:32 PM, Haibin Lin <
> > haibin.lin@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > I agree with Anirudh that the focus of the discussion should be
> > > > limited
> > > > >> to
> > > > >> > the release branch, not the master branch. Anything that breaks
> on
> > > > >> master
> > > > >> > but works on release branch should not block the release itself.
> > > > >> >
> > > > >> >
> > > > >> > Best,
> > > > >> >
> > > > >> > Haibin
> > > > >> >
> > > > >> > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
> > > > >> > pedro.larroy.li...@gmail.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > I see your point.
> > > > >> > >
> > > > >> > > I checked the failures on the v1.2.0 branch and I don't see
> > > > segfaults,
> > > > >> > just
> > > > >> > > minor failures due to flaky tests.
> > > > >> > >
> > > > >> > > I will trigger it repeatedly a few times until Sunday to have
> a
> > > and
> > > > >> > change
> > > > >> > > my vote accordingly.
> > > > >> > >
> > > > >> > >
> > > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> mxnet/job/v1.2.0/
> > > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> > organizations/jenkins/
> > > > >> > > incubator-mxnet/detail/v1.2.0/17/pipeline
> > > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> > organizations/jenkins/
> > > > >> > > incubator-mxnet/detail/v1.2.0/15/pipeline/
> > > > >> > >
> > > > >> > >
> > > > >> > > Pedro.
> > > > >> > >
> > > > >> > > On Fri, May 4, 2018 at 7:16 PM, Anirudh <
> anirudh2...@gmail.com>
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > Hi Pedro,
> > > > >> > > >
> > > > >> > > > Thank you for the suggestions. I will try to reproduce this
> > > > without
> > > > >> > fixed
> > > > >> > > > seeds and also run it for a longer time duration.
> > > > >> > > > Having said that, running unit tests over and over for a
> > couple
> > > of
> > > > >> days
> > > > >> > > > will likely cause
> > > > >> > > > problems  because there around 42 open issues for flaky
> tests:
> > > > >> > > > https://github.com/apache/incubator-mxnet/issues?q=is%
> > > > >> > > > 3Aopen+is%3Aissue+label%3AFlaky
> > > > >> > > > Also, the release branch has diverged from master around 3
> > weeks
> > > > >> back
> > > > >> > and
> > > > >> > > > it doesn't have many of the changes merged to the master.
> > > > >> > > > So, my question essentially is, what 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-07 Thread Anirudh
Hi Marco,

Thanks for raising this ! Can you please elaborate on where the arm cross
compilation for Jetson is documented and what is
the current user impact. Can we provide this workaround to use the
dockerfile before the changes in the ARM cross compilation documentation.
Did you happen to verify that the release branch is also impacted by the
dockcross change?

Anirudh


On Mon, May 7, 2018 at 1:16 PM, Marco de Abreu  wrote:

> Sorry everybody, but it seems like our ARM64/Jetson build was just broken
> by the creators of our base crosscompile Dockerfile called 'dockcross'.
> This is one of our base images, used to cross-compile ARM64 (Jetson
> specifically). The owners merged the PR two days ago at [1] which led to
> our build-pipeline for Jetson devices to break (the OpenBLAS dependency to
> be specific). Releasing the MXNet at the current state will mean that we
> release it non-buildable for Jetson devices.
>
> The reason this was not discovered by our CI yet was the matter of the fact
> that this is the base image which is cached on all of our slaves. We do
> this on purpose to ensure a consistent environment without our entire CI
> suddenly crashing because of a third party updates like this one. I have
> just discovered this problem on our test environment which is working
> without caches. To track this case, I have created an issue at [2].
> Unfortunately, this was unavoidable since the project does not maintain any
> tagging or versioning scheme for their Dockerfiles [3] - instead, they
> automatically push to production.
>
> -1 from my side until this has been resolved.
>
> -Marco
>
> [1]: https://github.com/dockcross/dockcross/pull/221
> [2]: https://github.com/apache/incubator-mxnet/issues/10837
> [3]: https://microbadger.com/images/dockcross/linux-arm64
>
>
> On Mon, May 7, 2018 at 7:38 PM, Haibin Lin 
> wrote:
>
> > +1 binding. Build from source with CUDA, ran linear classification
> example
> > and works fine.
> >
> > Best.
> > Haibin
> >
> >
> > On Sun, May 6, 2018 at 10:08 PM, Steffen Rochel  >
> > wrote:
> >
> > > +1 (non-binding). Tested with selected notebooks from The Straight
> Dope.
> > > So many important enhancements everybody contributed and our users are
> > > waiting for. Hope we will see more votes.
> > > Steffen
> > > On Mon, May 7, 2018 at 1:07 AM Anirudh  wrote:
> > >
> > > > Hi all,
> > > >
> > > > Since we don't have enough binding votes yet, I am extending the vote
> > > till
> > > > tomorrow (Monday May 7th), 12:50 PM PDT.
> > > >
> > > > Anirudh
> > > >
> > > > On Sun, May 6, 2018 at 4:05 PM, Anirudh 
> wrote:
> > > >
> > > > > Hi Pedro,
> > > > >
> > > > > Thanks for the clarification. I was able to reproduce the issue
> with
> > > > > USE_OPENMP=OFF. I wasn't able to reproduce the issue with Make.
> Since
> > > the
> > > > > issue is not reproducible with make and the customers using
> > > > USE_OPENMP=OFF
> > > > > with cmake should be small, I agree with you that this should not
> be
> > a
> > > > > blocker. I have added the issue to known issues in release notes:
> > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
> > > > >
> > > > > Anirudh
> > > > >
> > > > > On Sun, May 6, 2018 at 9:03 AM, Pedro Larroy <
> > > > pedro.larroy.li...@gmail.com
> > > > > > wrote:
> > > > >
> > > > >> Agreed, I was not aware that the problems where not present in the
> > > > release
> > > > >> branch.
> > > > >>
> > > > >> On Fri, May 4, 2018 at 8:32 PM, Haibin Lin <
> > haibin.lin@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > I agree with Anirudh that the focus of the discussion should be
> > > > limited
> > > > >> to
> > > > >> > the release branch, not the master branch. Anything that breaks
> on
> > > > >> master
> > > > >> > but works on release branch should not block the release itself.
> > > > >> >
> > > > >> >
> > > > >> > Best,
> > > > >> >
> > > > >> > Haibin
> > > > >> >
> > > > >> > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
> > > > >> > pedro.larroy.li...@gmail.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > I see your point.
> > > > >> > >
> > > > >> > > I checked the failures on the v1.2.0 branch and I don't see
> > > > segfaults,
> > > > >> > just
> > > > >> > > minor failures due to flaky tests.
> > > > >> > >
> > > > >> > > I will trigger it repeatedly a few times until Sunday to have
> a
> > > and
> > > > >> > change
> > > > >> > > my vote accordingly.
> > > > >> > >
> > > > >> > >
> > > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> mxnet/job/v1.2.0/
> > > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> > organizations/jenkins/
> > > > >> > > incubator-mxnet/detail/v1.2.0/17/pipeline
> > > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> > organizations/jenkins/
> > > > >> > > incubator-mxnet/detail/v1.2.0/15/pipeline/
> > > > >> > >
> > > > >> > >
> > > > >> > > 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-07 Thread Lai Wei
Hi,

I would like to raise an issue with mxnet-mkl. The keras-mxnet package was
working fine with mxnet-mkl 1.1.0 for training on CPU. However, weights are
not updated when I use mxnet-mkl 1.2.0b20180507. I tried both 'pip install
mxnet-mkl --pre' and built from source from release branch (v1.2.0) with
mkl flag.

Please refer to this issue for more details:
https://github.com/awslabs/keras-apache-mxnet/issues/75

There is no code change on keras-mxnet side, so I guess some API broke when
using latest mxnet-mkl. Still working on finding the root cause.

Thanks


Best Regards

Lai Wei

https://www.linkedin.com/pub/lai-wei/2b/731/52b

On Mon, May 7, 2018 at 10:38 AM, Haibin Lin 
wrote:

> +1 binding. Build from source with CUDA, ran linear classification example
> and works fine.
>
> Best.
> Haibin
>
>
> On Sun, May 6, 2018 at 10:08 PM, Steffen Rochel 
> wrote:
>
> > +1 (non-binding). Tested with selected notebooks from The Straight Dope.
> > So many important enhancements everybody contributed and our users are
> > waiting for. Hope we will see more votes.
> > Steffen
> > On Mon, May 7, 2018 at 1:07 AM Anirudh  wrote:
> >
> > > Hi all,
> > >
> > > Since we don't have enough binding votes yet, I am extending the vote
> > till
> > > tomorrow (Monday May 7th), 12:50 PM PDT.
> > >
> > > Anirudh
> > >
> > > On Sun, May 6, 2018 at 4:05 PM, Anirudh  wrote:
> > >
> > > > Hi Pedro,
> > > >
> > > > Thanks for the clarification. I was able to reproduce the issue with
> > > > USE_OPENMP=OFF. I wasn't able to reproduce the issue with Make. Since
> > the
> > > > issue is not reproducible with make and the customers using
> > > USE_OPENMP=OFF
> > > > with cmake should be small, I agree with you that this should not be
> a
> > > > blocker. I have added the issue to known issues in release notes:
> > > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
> > > >
> > > > Anirudh
> > > >
> > > > On Sun, May 6, 2018 at 9:03 AM, Pedro Larroy <
> > > pedro.larroy.li...@gmail.com
> > > > > wrote:
> > > >
> > > >> Agreed, I was not aware that the problems where not present in the
> > > release
> > > >> branch.
> > > >>
> > > >> On Fri, May 4, 2018 at 8:32 PM, Haibin Lin <
> haibin.lin@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > I agree with Anirudh that the focus of the discussion should be
> > > limited
> > > >> to
> > > >> > the release branch, not the master branch. Anything that breaks on
> > > >> master
> > > >> > but works on release branch should not block the release itself.
> > > >> >
> > > >> >
> > > >> > Best,
> > > >> >
> > > >> > Haibin
> > > >> >
> > > >> > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
> > > >> > pedro.larroy.li...@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > I see your point.
> > > >> > >
> > > >> > > I checked the failures on the v1.2.0 branch and I don't see
> > > segfaults,
> > > >> > just
> > > >> > > minor failures due to flaky tests.
> > > >> > >
> > > >> > > I will trigger it repeatedly a few times until Sunday to have a
> > and
> > > >> > change
> > > >> > > my vote accordingly.
> > > >> > >
> > > >> > >
> > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/v1.2.0/
> > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> organizations/jenkins/
> > > >> > > incubator-mxnet/detail/v1.2.0/17/pipeline
> > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> organizations/jenkins/
> > > >> > > incubator-mxnet/detail/v1.2.0/15/pipeline/
> > > >> > >
> > > >> > >
> > > >> > > Pedro.
> > > >> > >
> > > >> > > On Fri, May 4, 2018 at 7:16 PM, Anirudh 
> > > >> wrote:
> > > >> > >
> > > >> > > > Hi Pedro,
> > > >> > > >
> > > >> > > > Thank you for the suggestions. I will try to reproduce this
> > > without
> > > >> > fixed
> > > >> > > > seeds and also run it for a longer time duration.
> > > >> > > > Having said that, running unit tests over and over for a
> couple
> > of
> > > >> days
> > > >> > > > will likely cause
> > > >> > > > problems  because there around 42 open issues for flaky tests:
> > > >> > > > https://github.com/apache/incubator-mxnet/issues?q=is%
> > > >> > > > 3Aopen+is%3Aissue+label%3AFlaky
> > > >> > > > Also, the release branch has diverged from master around 3
> weeks
> > > >> back
> > > >> > and
> > > >> > > > it doesn't have many of the changes merged to the master.
> > > >> > > > So, my question essentially is, what will be your benchmark to
> > > >> accept
> > > >> > the
> > > >> > > > release ?
> > > >> > > > Is it that we run the test which you provided on 1.2 without
> > fixed
> > > >> > seeds
> > > >> > > > and for a longer duration without failures ?
> > > >> > > > Or is it that all unit tests should pass over a period of 2
> days
> > > >> > without
> > > >> > > > issues. This may require fixing all of the flaky tests which
> > would
> > > >> > delay
> > > >> > > > the release by considerable amount of 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-07 Thread Marco de Abreu
Sorry everybody, but it seems like our ARM64/Jetson build was just broken
by the creators of our base crosscompile Dockerfile called 'dockcross'.
This is one of our base images, used to cross-compile ARM64 (Jetson
specifically). The owners merged the PR two days ago at [1] which led to
our build-pipeline for Jetson devices to break (the OpenBLAS dependency to
be specific). Releasing the MXNet at the current state will mean that we
release it non-buildable for Jetson devices.

The reason this was not discovered by our CI yet was the matter of the fact
that this is the base image which is cached on all of our slaves. We do
this on purpose to ensure a consistent environment without our entire CI
suddenly crashing because of a third party updates like this one. I have
just discovered this problem on our test environment which is working
without caches. To track this case, I have created an issue at [2].
Unfortunately, this was unavoidable since the project does not maintain any
tagging or versioning scheme for their Dockerfiles [3] - instead, they
automatically push to production.

-1 from my side until this has been resolved.

-Marco

[1]: https://github.com/dockcross/dockcross/pull/221
[2]: https://github.com/apache/incubator-mxnet/issues/10837
[3]: https://microbadger.com/images/dockcross/linux-arm64


On Mon, May 7, 2018 at 7:38 PM, Haibin Lin  wrote:

> +1 binding. Build from source with CUDA, ran linear classification example
> and works fine.
>
> Best.
> Haibin
>
>
> On Sun, May 6, 2018 at 10:08 PM, Steffen Rochel 
> wrote:
>
> > +1 (non-binding). Tested with selected notebooks from The Straight Dope.
> > So many important enhancements everybody contributed and our users are
> > waiting for. Hope we will see more votes.
> > Steffen
> > On Mon, May 7, 2018 at 1:07 AM Anirudh  wrote:
> >
> > > Hi all,
> > >
> > > Since we don't have enough binding votes yet, I am extending the vote
> > till
> > > tomorrow (Monday May 7th), 12:50 PM PDT.
> > >
> > > Anirudh
> > >
> > > On Sun, May 6, 2018 at 4:05 PM, Anirudh  wrote:
> > >
> > > > Hi Pedro,
> > > >
> > > > Thanks for the clarification. I was able to reproduce the issue with
> > > > USE_OPENMP=OFF. I wasn't able to reproduce the issue with Make. Since
> > the
> > > > issue is not reproducible with make and the customers using
> > > USE_OPENMP=OFF
> > > > with cmake should be small, I agree with you that this should not be
> a
> > > > blocker. I have added the issue to known issues in release notes:
> > > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
> > > >
> > > > Anirudh
> > > >
> > > > On Sun, May 6, 2018 at 9:03 AM, Pedro Larroy <
> > > pedro.larroy.li...@gmail.com
> > > > > wrote:
> > > >
> > > >> Agreed, I was not aware that the problems where not present in the
> > > release
> > > >> branch.
> > > >>
> > > >> On Fri, May 4, 2018 at 8:32 PM, Haibin Lin <
> haibin.lin@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > I agree with Anirudh that the focus of the discussion should be
> > > limited
> > > >> to
> > > >> > the release branch, not the master branch. Anything that breaks on
> > > >> master
> > > >> > but works on release branch should not block the release itself.
> > > >> >
> > > >> >
> > > >> > Best,
> > > >> >
> > > >> > Haibin
> > > >> >
> > > >> > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
> > > >> > pedro.larroy.li...@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > I see your point.
> > > >> > >
> > > >> > > I checked the failures on the v1.2.0 branch and I don't see
> > > segfaults,
> > > >> > just
> > > >> > > minor failures due to flaky tests.
> > > >> > >
> > > >> > > I will trigger it repeatedly a few times until Sunday to have a
> > and
> > > >> > change
> > > >> > > my vote accordingly.
> > > >> > >
> > > >> > >
> > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/v1.2.0/
> > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> organizations/jenkins/
> > > >> > > incubator-mxnet/detail/v1.2.0/17/pipeline
> > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> organizations/jenkins/
> > > >> > > incubator-mxnet/detail/v1.2.0/15/pipeline/
> > > >> > >
> > > >> > >
> > > >> > > Pedro.
> > > >> > >
> > > >> > > On Fri, May 4, 2018 at 7:16 PM, Anirudh 
> > > >> wrote:
> > > >> > >
> > > >> > > > Hi Pedro,
> > > >> > > >
> > > >> > > > Thank you for the suggestions. I will try to reproduce this
> > > without
> > > >> > fixed
> > > >> > > > seeds and also run it for a longer time duration.
> > > >> > > > Having said that, running unit tests over and over for a
> couple
> > of
> > > >> days
> > > >> > > > will likely cause
> > > >> > > > problems  because there around 42 open issues for flaky tests:
> > > >> > > > https://github.com/apache/incubator-mxnet/issues?q=is%
> > > >> > > > 3Aopen+is%3Aissue+label%3AFlaky
> > > >> > > > Also, the release branch has diverged 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-06 Thread Steffen Rochel
+1 (non-binding). Tested with selected notebooks from The Straight Dope.
So many important enhancements everybody contributed and our users are
waiting for. Hope we will see more votes.
Steffen
On Mon, May 7, 2018 at 1:07 AM Anirudh  wrote:

> Hi all,
>
> Since we don't have enough binding votes yet, I am extending the vote till
> tomorrow (Monday May 7th), 12:50 PM PDT.
>
> Anirudh
>
> On Sun, May 6, 2018 at 4:05 PM, Anirudh  wrote:
>
> > Hi Pedro,
> >
> > Thanks for the clarification. I was able to reproduce the issue with
> > USE_OPENMP=OFF. I wasn't able to reproduce the issue with Make. Since the
> > issue is not reproducible with make and the customers using
> USE_OPENMP=OFF
> > with cmake should be small, I agree with you that this should not be a
> > blocker. I have added the issue to known issues in release notes:
> > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
> >
> > Anirudh
> >
> > On Sun, May 6, 2018 at 9:03 AM, Pedro Larroy <
> pedro.larroy.li...@gmail.com
> > > wrote:
> >
> >> Agreed, I was not aware that the problems where not present in the
> release
> >> branch.
> >>
> >> On Fri, May 4, 2018 at 8:32 PM, Haibin Lin 
> >> wrote:
> >>
> >> > I agree with Anirudh that the focus of the discussion should be
> limited
> >> to
> >> > the release branch, not the master branch. Anything that breaks on
> >> master
> >> > but works on release branch should not block the release itself.
> >> >
> >> >
> >> > Best,
> >> >
> >> > Haibin
> >> >
> >> > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
> >> > pedro.larroy.li...@gmail.com>
> >> > wrote:
> >> >
> >> > > I see your point.
> >> > >
> >> > > I checked the failures on the v1.2.0 branch and I don't see
> segfaults,
> >> > just
> >> > > minor failures due to flaky tests.
> >> > >
> >> > > I will trigger it repeatedly a few times until Sunday to have a and
> >> > change
> >> > > my vote accordingly.
> >> > >
> >> > >
> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/v1.2.0/
> >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> >> > > incubator-mxnet/detail/v1.2.0/17/pipeline
> >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> >> > > incubator-mxnet/detail/v1.2.0/15/pipeline/
> >> > >
> >> > >
> >> > > Pedro.
> >> > >
> >> > > On Fri, May 4, 2018 at 7:16 PM, Anirudh 
> >> wrote:
> >> > >
> >> > > > Hi Pedro,
> >> > > >
> >> > > > Thank you for the suggestions. I will try to reproduce this
> without
> >> > fixed
> >> > > > seeds and also run it for a longer time duration.
> >> > > > Having said that, running unit tests over and over for a couple of
> >> days
> >> > > > will likely cause
> >> > > > problems  because there around 42 open issues for flaky tests:
> >> > > > https://github.com/apache/incubator-mxnet/issues?q=is%
> >> > > > 3Aopen+is%3Aissue+label%3AFlaky
> >> > > > Also, the release branch has diverged from master around 3 weeks
> >> back
> >> > and
> >> > > > it doesn't have many of the changes merged to the master.
> >> > > > So, my question essentially is, what will be your benchmark to
> >> accept
> >> > the
> >> > > > release ?
> >> > > > Is it that we run the test which you provided on 1.2 without fixed
> >> > seeds
> >> > > > and for a longer duration without failures ?
> >> > > > Or is it that all unit tests should pass over a period of 2 days
> >> > without
> >> > > > issues. This may require fixing all of the flaky tests which would
> >> > delay
> >> > > > the release by considerable amount of time.
> >> > > > Or is it something else ?
> >> > > >
> >> > > > Anirudh
> >> > > >
> >> > > >
> >> > > > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy <
> >> > > pedro.larroy.li...@gmail.com
> >> > > > >
> >> > > > wrote:
> >> > > >
> >> > > > > Could you remove the fixed seeds and run it for a couple of
> hours
> >> > with
> >> > > an
> >> > > > > additional loop?  Also I would suggest running the unit tests
> over
> >> > and
> >> > > > over
> >> > > > > for a couple of days if possible.
> >> > > > >
> >> > > > >
> >> > > > > Pedro.
> >> > > > >
> >> > > > > On Thu, May 3, 2018 at 8:33 PM, Anirudh 
> >> > wrote:
> >> > > > >
> >> > > > > > Hi Pedro and Naveen,
> >> > > > > >
> >> > > > > > I am unable to reproduce this issue with MKLDNN on the master
> >> but
> >> > not
> >> > > > on
> >> > > > > > the 1.2.RC2 branch.
> >> > > > > >
> >> > > > > > Did the following on 1.2.RC2 branch:
> >> > > > > >
> >> > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas
> >> USE_DIST_KVSTORE=0
> >> > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> >> > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> >> > > > > > export MXNET_TEST_SEED=11
> >> > > > > > export MXNET_MODULE_SEED=812478194
> >> > > > > > export MXNET_TEST_COUNT=1
> >> > > > > > nosetests-2.7 -v tests/python/unittest/test_
> >> > > > > module.py:test_forward_reshape
> >> > > > > >
> 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-06 Thread Anirudh
Hi all,

Since we don't have enough binding votes yet, I am extending the vote till
tomorrow (Monday May 7th), 12:50 PM PDT.

Anirudh

On Sun, May 6, 2018 at 4:05 PM, Anirudh  wrote:

> Hi Pedro,
>
> Thanks for the clarification. I was able to reproduce the issue with
> USE_OPENMP=OFF. I wasn't able to reproduce the issue with Make. Since the
> issue is not reproducible with make and the customers using USE_OPENMP=OFF
> with cmake should be small, I agree with you that this should not be a
> blocker. I have added the issue to known issues in release notes:
> https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
>
> Anirudh
>
> On Sun, May 6, 2018 at 9:03 AM, Pedro Larroy  > wrote:
>
>> Agreed, I was not aware that the problems where not present in the release
>> branch.
>>
>> On Fri, May 4, 2018 at 8:32 PM, Haibin Lin 
>> wrote:
>>
>> > I agree with Anirudh that the focus of the discussion should be limited
>> to
>> > the release branch, not the master branch. Anything that breaks on
>> master
>> > but works on release branch should not block the release itself.
>> >
>> >
>> > Best,
>> >
>> > Haibin
>> >
>> > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
>> > pedro.larroy.li...@gmail.com>
>> > wrote:
>> >
>> > > I see your point.
>> > >
>> > > I checked the failures on the v1.2.0 branch and I don't see segfaults,
>> > just
>> > > minor failures due to flaky tests.
>> > >
>> > > I will trigger it repeatedly a few times until Sunday to have a and
>> > change
>> > > my vote accordingly.
>> > >
>> > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/v1.2.0/
>> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>> > > incubator-mxnet/detail/v1.2.0/17/pipeline
>> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>> > > incubator-mxnet/detail/v1.2.0/15/pipeline/
>> > >
>> > >
>> > > Pedro.
>> > >
>> > > On Fri, May 4, 2018 at 7:16 PM, Anirudh 
>> wrote:
>> > >
>> > > > Hi Pedro,
>> > > >
>> > > > Thank you for the suggestions. I will try to reproduce this without
>> > fixed
>> > > > seeds and also run it for a longer time duration.
>> > > > Having said that, running unit tests over and over for a couple of
>> days
>> > > > will likely cause
>> > > > problems  because there around 42 open issues for flaky tests:
>> > > > https://github.com/apache/incubator-mxnet/issues?q=is%
>> > > > 3Aopen+is%3Aissue+label%3AFlaky
>> > > > Also, the release branch has diverged from master around 3 weeks
>> back
>> > and
>> > > > it doesn't have many of the changes merged to the master.
>> > > > So, my question essentially is, what will be your benchmark to
>> accept
>> > the
>> > > > release ?
>> > > > Is it that we run the test which you provided on 1.2 without fixed
>> > seeds
>> > > > and for a longer duration without failures ?
>> > > > Or is it that all unit tests should pass over a period of 2 days
>> > without
>> > > > issues. This may require fixing all of the flaky tests which would
>> > delay
>> > > > the release by considerable amount of time.
>> > > > Or is it something else ?
>> > > >
>> > > > Anirudh
>> > > >
>> > > >
>> > > > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy <
>> > > pedro.larroy.li...@gmail.com
>> > > > >
>> > > > wrote:
>> > > >
>> > > > > Could you remove the fixed seeds and run it for a couple of hours
>> > with
>> > > an
>> > > > > additional loop?  Also I would suggest running the unit tests over
>> > and
>> > > > over
>> > > > > for a couple of days if possible.
>> > > > >
>> > > > >
>> > > > > Pedro.
>> > > > >
>> > > > > On Thu, May 3, 2018 at 8:33 PM, Anirudh 
>> > wrote:
>> > > > >
>> > > > > > Hi Pedro and Naveen,
>> > > > > >
>> > > > > > I am unable to reproduce this issue with MKLDNN on the master
>> but
>> > not
>> > > > on
>> > > > > > the 1.2.RC2 branch.
>> > > > > >
>> > > > > > Did the following on 1.2.RC2 branch:
>> > > > > >
>> > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas
>> USE_DIST_KVSTORE=0
>> > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
>> > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
>> > > > > > export MXNET_TEST_SEED=11
>> > > > > > export MXNET_MODULE_SEED=812478194
>> > > > > > export MXNET_TEST_COUNT=1
>> > > > > > nosetests-2.7 -v tests/python/unittest/test_
>> > > > > module.py:test_forward_reshape
>> > > > > >
>> > > > > > Was able to do the 10k runs successfully.
>> > > > > >
>> > > > > > Anirudh
>> > > > > >
>> > > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh 
>> > > wrote:
>> > > > > >
>> > > > > > > Hi Pedro and Naveen,
>> > > > > > >
>> > > > > > > Is this issue reproducible when MXNet is built with
>> USE_MKLDNN=0?
>> > > > > > > Also, there are a bunch of MKLDNN fixes that didn't go into
>> the
>> > > > release
>> > > > > > > branch. Is this issue reproducible on the release branch ?
>> > > > > > > In my opinion, since we 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-06 Thread Anirudh
Hi Pedro,

Thanks for the clarification. I was able to reproduce the issue with
USE_OPENMP=OFF. I wasn't able to reproduce the issue with Make. Since the
issue is not reproducible with make and the customers using USE_OPENMP=OFF
with cmake should be small, I agree with you that this should not be a
blocker. I have added the issue to known issues in release notes:
https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2

Anirudh

On Sun, May 6, 2018 at 9:03 AM, Pedro Larroy 
wrote:

> Agreed, I was not aware that the problems where not present in the release
> branch.
>
> On Fri, May 4, 2018 at 8:32 PM, Haibin Lin 
> wrote:
>
> > I agree with Anirudh that the focus of the discussion should be limited
> to
> > the release branch, not the master branch. Anything that breaks on master
> > but works on release branch should not block the release itself.
> >
> >
> > Best,
> >
> > Haibin
> >
> > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
> > pedro.larroy.li...@gmail.com>
> > wrote:
> >
> > > I see your point.
> > >
> > > I checked the failures on the v1.2.0 branch and I don't see segfaults,
> > just
> > > minor failures due to flaky tests.
> > >
> > > I will trigger it repeatedly a few times until Sunday to have a and
> > change
> > > my vote accordingly.
> > >
> > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/v1.2.0/
> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > incubator-mxnet/detail/v1.2.0/17/pipeline
> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > incubator-mxnet/detail/v1.2.0/15/pipeline/
> > >
> > >
> > > Pedro.
> > >
> > > On Fri, May 4, 2018 at 7:16 PM, Anirudh  wrote:
> > >
> > > > Hi Pedro,
> > > >
> > > > Thank you for the suggestions. I will try to reproduce this without
> > fixed
> > > > seeds and also run it for a longer time duration.
> > > > Having said that, running unit tests over and over for a couple of
> days
> > > > will likely cause
> > > > problems  because there around 42 open issues for flaky tests:
> > > > https://github.com/apache/incubator-mxnet/issues?q=is%
> > > > 3Aopen+is%3Aissue+label%3AFlaky
> > > > Also, the release branch has diverged from master around 3 weeks back
> > and
> > > > it doesn't have many of the changes merged to the master.
> > > > So, my question essentially is, what will be your benchmark to accept
> > the
> > > > release ?
> > > > Is it that we run the test which you provided on 1.2 without fixed
> > seeds
> > > > and for a longer duration without failures ?
> > > > Or is it that all unit tests should pass over a period of 2 days
> > without
> > > > issues. This may require fixing all of the flaky tests which would
> > delay
> > > > the release by considerable amount of time.
> > > > Or is it something else ?
> > > >
> > > > Anirudh
> > > >
> > > >
> > > > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy <
> > > pedro.larroy.li...@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Could you remove the fixed seeds and run it for a couple of hours
> > with
> > > an
> > > > > additional loop?  Also I would suggest running the unit tests over
> > and
> > > > over
> > > > > for a couple of days if possible.
> > > > >
> > > > >
> > > > > Pedro.
> > > > >
> > > > > On Thu, May 3, 2018 at 8:33 PM, Anirudh 
> > wrote:
> > > > >
> > > > > > Hi Pedro and Naveen,
> > > > > >
> > > > > > I am unable to reproduce this issue with MKLDNN on the master but
> > not
> > > > on
> > > > > > the 1.2.RC2 branch.
> > > > > >
> > > > > > Did the following on 1.2.RC2 branch:
> > > > > >
> > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas
> USE_DIST_KVSTORE=0
> > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > > > > export MXNET_TEST_SEED=11
> > > > > > export MXNET_MODULE_SEED=812478194
> > > > > > export MXNET_TEST_COUNT=1
> > > > > > nosetests-2.7 -v tests/python/unittest/test_
> > > > > module.py:test_forward_reshape
> > > > > >
> > > > > > Was able to do the 10k runs successfully.
> > > > > >
> > > > > > Anirudh
> > > > > >
> > > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh 
> > > wrote:
> > > > > >
> > > > > > > Hi Pedro and Naveen,
> > > > > > >
> > > > > > > Is this issue reproducible when MXNet is built with
> USE_MKLDNN=0?
> > > > > > > Also, there are a bunch of MKLDNN fixes that didn't go into the
> > > > release
> > > > > > > branch. Is this issue reproducible on the release branch ?
> > > > > > > In my opinion, since we have marked MKLDNN as experimental
> > feature
> > > > for
> > > > > > the
> > > > > > > release, if it is confirmed to be a MKLDNN issue
> > > > > > > we don't need to block the release on it.
> > > > > > >
> > > > > > > Anirudh
> > > > > > >
> > > > > > > On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy <
> mnnav...@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-05 Thread Anirudh
Hi Pedro,

Thank you for raising this issue! I am not able to reproduce this on ubuntu
16.04 and cmake 3.5.1.
Can you please provide the reproduction steps for the issue.

Anirudh

On Sat, May 5, 2018 at 3:12 AM, Pedro Larroy 
wrote:

> Actually I have a linking problem in my ubuntu desktop that is fixed in
> master:
>
> lc::ThreadedIter std::allocator >
> >::Init(std::function (std::vector std::allocator >**)>,
> std::function)::{lambda()#1}&)':
> /usr/include/c++/5/thread:137: undefined reference to `pthread_create'
> 3rdparty/dmlc-core/libdmlc.a(data.cc.o): In function
> `std::thread::thread, std::allocator >
> >::Init(std::function (std::vector std::allocator >**)>,
> std::function ()>)::{lambda()#1}&>(dmlc::ThreadedIter, std::allocator >
> >::Init(std::function (std::vector std::allocator >**)>,
> std::function)::{lambda()#1}&)':
> /usr/include/c++/5/thread:137: undefined reference to `pthread_create'
> 3rdparty/dmlc-core/libdmlc.a(data.cc.o): In function
> `std::thread::thread >::Init(std::function int>**)>, std::function ()>)::{lambda()#1}&>(dmlc::ThreadedIter RowBlockContainer int> >::Init(std::function int>**)>, std::function)::{lambda()#1}&)':
> /usr/include/c++/5/thread:137: undefined reference to `pthread_create'
> 3rdparty/dmlc-core/libdmlc.a(data.cc.o): In function
> `std::thread::thread >::Init(std::function long>**)>, std::function ()>)::{lambda()#1}&>(dmlc::ThreadedIter RowBlockContainer long> >::Init(std::function long>**)>, std::function)::{lambda()#1}&)':
> /usr/include/c++/5/thread:137: undefined reference to `pthread_create'
> 3rdparty/dmlc-core/libdmlc.a(io.cc.o): In function
> `std::thread::thread::Init(std::function (dmlc::io::InputSplitBase::Chunk**)>, std::function ()>)::{lambda()#1}&>(dmlc::ThreadedIter InputSplitBase::Chunk>::Init(std::function (dmlc::io::InputSplitBase::Chunk**)>, std::function ()>)::{lambda()#1}&)':
> /usr/include/c++/5/thread:137: undefined reference to `pthread_create'
> collect2: error: ld returned 1 exit status
> ninja: build stopped: subcommand failed.
>
>
> Can we update dmlc-core on the release branch?  this was recently fixed:
> https://github.com/dmlc/dmlc-core/commit/b744643f386660ddc39467a04e3a98
> 853a7419b9
>
> On Sat, May 5, 2018 at 11:59 AM, Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> wrote:
>
> > Hi
> >
> > Looks like only gluon test lambda is failing intermittently, but looks
> > like a minor numerical issue.
> >
> > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/
> > jenkins/incubator-mxnet/detail/v1.2.0/20/pipeline
> >
> > I triggered a few builds yesterday and they all passed. I think Anirudh
> is
> > right.
> >
> > Changing my vote to +1 (non binding).
> >
> >
> > Pedro.
> >
> >
> >
> > On Sat, May 5, 2018 at 12:10 AM, Jun Wu  wrote:
> >
> >> +1
> >> I built from source and ran all the model quantization examples
> >> successfully.
> >>
> >> On Fri, May 4, 2018 at 3:05 PM, Anirudh  wrote:
> >>
> >> > Hi Pedro, Haibin, Indhu,
> >> >
> >> > Thank you for your inputs on the release. I ran the test:
> >> > `test_module.py:test_forward_reshape` for 250k times with different
> >> seeds.
> >> > I was unable to reproduce the issue on the release branch.
> >> > If everything goes well with CI tests by Pedro running till Sunday, I
> >> think
> >> > we should move forward with the release (given that we have enough
> +1s).
> >> > Is it possible to trigger the CI on the 1.2 branch repeatedly or at a
> >> fixed
> >> > schedule till Sunday?
> >> >
> >> > Anirudh
> >> >
> >> > On Fri, May 4, 2018 at 11:56 AM, Indhu 
> wrote:
> >> >
> >> > > +1
> >> > >
> >> > > I've been using CUDA build from this branch (built from source) on
> >> Ubuntu
> >> > > for couple of days now and I haven't seen any issue.
> >> > >
> >> > > The flaky tests need to be fixed but this release need not be
> blocked
> >> for
> >> > > that.
> >> > >
> >> > >
> >> > > On Fri, May 4, 2018 at 11:32 AM, Haibin Lin <
> haibin.lin@gmail.com
> >> >
> >> > > wrote:
> >> > >
> >> > > > I agree with Anirudh that the focus of the discussion should be
> >> limited
> >> > > to
> >> > > > the release branch, not the master branch. Anything that breaks on
> >> > master
> >> > > > but works on release branch should not block the release itself.
> >> > > >
> >> > > >
> >> > > > Best,
> >> > > >
> >> > > > Haibin
> >> > > >
> 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-05 Thread Marco de Abreu
We had 4 out of 20 runs fail:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/v1.2.0/26
- already tracked at https://github.com/apache/incubator-mxnet/issues/10280
since 03/27
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/v1.2.0/28
- already tracked at https://github.com/apache/incubator-mxnet/issues/9853
since 02/21
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/v1.2.0/31
- S3 timeout
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/v1.2.0/32
- already tracked at https://github.com/apache/incubator-mxnet/issues/10376
since 04/03

Best regards,
Marco


On Sat, May 5, 2018 at 12:12 PM, Pedro Larroy 
wrote:

> Actually I have a linking problem in my ubuntu desktop that is fixed in
> master:
>
> lc::ThreadedIter std::allocator >
> >::Init(std::function (std::vector std::allocator >**)>,
> std::function)::{lambda()#1}&)':
> /usr/include/c++/5/thread:137: undefined reference to `pthread_create'
> 3rdparty/dmlc-core/libdmlc.a(data.cc.o): In function
> `std::thread::thread, std::allocator >
> >::Init(std::function (std::vector std::allocator >**)>,
> std::function ()>)::{lambda()#1}&>(dmlc::ThreadedIter, std::allocator >
> >::Init(std::function (std::vector std::allocator >**)>,
> std::function)::{lambda()#1}&)':
> /usr/include/c++/5/thread:137: undefined reference to `pthread_create'
> 3rdparty/dmlc-core/libdmlc.a(data.cc.o): In function
> `std::thread::thread >::Init(std::function int>**)>, std::function ()>)::{lambda()#1}&>(dmlc::ThreadedIter RowBlockContainer int> >::Init(std::function int>**)>, std::function)::{lambda()#1}&)':
> /usr/include/c++/5/thread:137: undefined reference to `pthread_create'
> 3rdparty/dmlc-core/libdmlc.a(data.cc.o): In function
> `std::thread::thread >::Init(std::function long>**)>, std::function ()>)::{lambda()#1}&>(dmlc::ThreadedIter RowBlockContainer long> >::Init(std::function long>**)>, std::function)::{lambda()#1}&)':
> /usr/include/c++/5/thread:137: undefined reference to `pthread_create'
> 3rdparty/dmlc-core/libdmlc.a(io.cc.o): In function
> `std::thread::thread::Init(std::function (dmlc::io::InputSplitBase::Chunk**)>, std::function ()>)::{lambda()#1}&>(dmlc::ThreadedIter InputSplitBase::Chunk>::Init(std::function (dmlc::io::InputSplitBase::Chunk**)>, std::function ()>)::{lambda()#1}&)':
> /usr/include/c++/5/thread:137: undefined reference to `pthread_create'
> collect2: error: ld returned 1 exit status
> ninja: build stopped: subcommand failed.
>
>
> Can we update dmlc-core on the release branch?  this was recently fixed:
> https://github.com/dmlc/dmlc-core/commit/b744643f386660ddc39467a04e3a98
> 853a7419b9
>
> On Sat, May 5, 2018 at 11:59 AM, Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> wrote:
>
> > Hi
> >
> > Looks like only gluon test lambda is failing intermittently, but looks
> > like a minor numerical issue.
> >
> > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/
> > jenkins/incubator-mxnet/detail/v1.2.0/20/pipeline
> >
> > I triggered a few builds yesterday and they all passed. I think Anirudh
> is
> > right.
> >
> > Changing my vote to +1 (non binding).
> >
> >
> > Pedro.
> >
> >
> >
> > On Sat, May 5, 2018 at 12:10 AM, Jun Wu  wrote:
> >
> >> +1
> >> I built from source and ran all the model quantization examples
> >> successfully.
> >>
> >> On Fri, May 4, 2018 at 3:05 PM, Anirudh  wrote:
> >>
> >> > Hi Pedro, Haibin, Indhu,
> >> >
> >> > Thank you for your inputs on the release. I ran the test:
> >> > `test_module.py:test_forward_reshape` for 250k times with different
> >> seeds.
> >> > I was unable to reproduce the issue on the release branch.
> >> > If everything goes well with CI tests by Pedro running till Sunday, I
> >> think
> >> > we should move forward with the release (given that we have enough
> +1s).
> >> > Is it possible to trigger the CI on the 1.2 branch repeatedly or at a
> >> fixed
> >> > schedule till Sunday?
> >> >
> >> > Anirudh
> >> >
> >> > On Fri, May 4, 2018 at 11:56 AM, Indhu 
> wrote:
> >> >
> >> > > +1
> >> > >
> >> > > I've been using CUDA build from this branch (built from source) on
> >> Ubuntu
> >> > > for couple of days now and I haven't seen any issue.
> >> > >
> >> > > The flaky tests need to be fixed but 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-05 Thread Pedro Larroy
Actually I have a linking problem in my ubuntu desktop that is fixed in
master:

lc::ThreadedIter >
>::Init(std::function,
std::allocator >**)>,
std::function)::{lambda()#1}&)':
/usr/include/c++/5/thread:137: undefined reference to `pthread_create'
3rdparty/dmlc-core/libdmlc.a(data.cc.o): In function
`std::thread::thread >
>::Init(std::function,
std::allocator >**)>,
std::function)::{lambda()#1}&>(dmlc::ThreadedIter >
>::Init(std::function,
std::allocator >**)>,
std::function)::{lambda()#1}&)':
/usr/include/c++/5/thread:137: undefined reference to `pthread_create'
3rdparty/dmlc-core/libdmlc.a(data.cc.o): In function
`std::thread::thread::Init(std::function**)>, std::function)::{lambda()#1}&>(dmlc::ThreadedIter::Init(std::function**)>, std::function)::{lambda()#1}&)':
/usr/include/c++/5/thread:137: undefined reference to `pthread_create'
3rdparty/dmlc-core/libdmlc.a(data.cc.o): In function
`std::thread::thread::Init(std::function**)>, std::function)::{lambda()#1}&>(dmlc::ThreadedIter::Init(std::function**)>, std::function)::{lambda()#1}&)':
/usr/include/c++/5/thread:137: undefined reference to `pthread_create'
3rdparty/dmlc-core/libdmlc.a(io.cc.o): In function
`std::thread::thread(dmlc::ThreadedIter::Init(std::function, std::function)::{lambda()#1}&)':
/usr/include/c++/5/thread:137: undefined reference to `pthread_create'
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.


Can we update dmlc-core on the release branch?  this was recently fixed:
https://github.com/dmlc/dmlc-core/commit/b744643f386660ddc39467a04e3a98853a7419b9

On Sat, May 5, 2018 at 11:59 AM, Pedro Larroy 
wrote:

> Hi
>
> Looks like only gluon test lambda is failing intermittently, but looks
> like a minor numerical issue.
>
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/
> jenkins/incubator-mxnet/detail/v1.2.0/20/pipeline
>
> I triggered a few builds yesterday and they all passed. I think Anirudh is
> right.
>
> Changing my vote to +1 (non binding).
>
>
> Pedro.
>
>
>
> On Sat, May 5, 2018 at 12:10 AM, Jun Wu  wrote:
>
>> +1
>> I built from source and ran all the model quantization examples
>> successfully.
>>
>> On Fri, May 4, 2018 at 3:05 PM, Anirudh  wrote:
>>
>> > Hi Pedro, Haibin, Indhu,
>> >
>> > Thank you for your inputs on the release. I ran the test:
>> > `test_module.py:test_forward_reshape` for 250k times with different
>> seeds.
>> > I was unable to reproduce the issue on the release branch.
>> > If everything goes well with CI tests by Pedro running till Sunday, I
>> think
>> > we should move forward with the release (given that we have enough +1s).
>> > Is it possible to trigger the CI on the 1.2 branch repeatedly or at a
>> fixed
>> > schedule till Sunday?
>> >
>> > Anirudh
>> >
>> > On Fri, May 4, 2018 at 11:56 AM, Indhu  wrote:
>> >
>> > > +1
>> > >
>> > > I've been using CUDA build from this branch (built from source) on
>> Ubuntu
>> > > for couple of days now and I haven't seen any issue.
>> > >
>> > > The flaky tests need to be fixed but this release need not be blocked
>> for
>> > > that.
>> > >
>> > >
>> > > On Fri, May 4, 2018 at 11:32 AM, Haibin Lin > >
>> > > wrote:
>> > >
>> > > > I agree with Anirudh that the focus of the discussion should be
>> limited
>> > > to
>> > > > the release branch, not the master branch. Anything that breaks on
>> > master
>> > > > but works on release branch should not block the release itself.
>> > > >
>> > > >
>> > > > Best,
>> > > >
>> > > > Haibin
>> > > >
>> > > > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
>> > > > pedro.larroy.li...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > I see your point.
>> > > > >
>> > > > > I checked the failures on the v1.2.0 branch and I don't see
>> > segfaults,
>> > > > just
>> > > > > minor failures due to flaky tests.
>> > > > >
>> > > > > I will trigger it repeatedly a few times until Sunday to have a
>> and
>> > > > change
>> > > > > my vote accordingly.
>> > > > >
>> > > > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
>> > mxnet/job/v1.2.0/
>> > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>> > > > > incubator-mxnet/detail/v1.2.0/17/pipeline
>> > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>> > > 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-05 Thread Pedro Larroy
Hi

Looks like only gluon test lambda is failing intermittently, but looks like
a minor numerical issue.

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
incubator-mxnet/detail/v1.2.0/20/pipeline

I triggered a few builds yesterday and they all passed. I think Anirudh is
right.

Changing my vote to +1 (non binding).


Pedro.



On Sat, May 5, 2018 at 12:10 AM, Jun Wu  wrote:

> +1
> I built from source and ran all the model quantization examples
> successfully.
>
> On Fri, May 4, 2018 at 3:05 PM, Anirudh  wrote:
>
> > Hi Pedro, Haibin, Indhu,
> >
> > Thank you for your inputs on the release. I ran the test:
> > `test_module.py:test_forward_reshape` for 250k times with different
> seeds.
> > I was unable to reproduce the issue on the release branch.
> > If everything goes well with CI tests by Pedro running till Sunday, I
> think
> > we should move forward with the release (given that we have enough +1s).
> > Is it possible to trigger the CI on the 1.2 branch repeatedly or at a
> fixed
> > schedule till Sunday?
> >
> > Anirudh
> >
> > On Fri, May 4, 2018 at 11:56 AM, Indhu  wrote:
> >
> > > +1
> > >
> > > I've been using CUDA build from this branch (built from source) on
> Ubuntu
> > > for couple of days now and I haven't seen any issue.
> > >
> > > The flaky tests need to be fixed but this release need not be blocked
> for
> > > that.
> > >
> > >
> > > On Fri, May 4, 2018 at 11:32 AM, Haibin Lin 
> > > wrote:
> > >
> > > > I agree with Anirudh that the focus of the discussion should be
> limited
> > > to
> > > > the release branch, not the master branch. Anything that breaks on
> > master
> > > > but works on release branch should not block the release itself.
> > > >
> > > >
> > > > Best,
> > > >
> > > > Haibin
> > > >
> > > > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
> > > > pedro.larroy.li...@gmail.com>
> > > > wrote:
> > > >
> > > > > I see your point.
> > > > >
> > > > > I checked the failures on the v1.2.0 branch and I don't see
> > segfaults,
> > > > just
> > > > > minor failures due to flaky tests.
> > > > >
> > > > > I will trigger it repeatedly a few times until Sunday to have a and
> > > > change
> > > > > my vote accordingly.
> > > > >
> > > > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> > mxnet/job/v1.2.0/
> > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > > > incubator-mxnet/detail/v1.2.0/17/pipeline
> > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > > > incubator-mxnet/detail/v1.2.0/15/pipeline/
> > > > >
> > > > >
> > > > > Pedro.
> > > > >
> > > > > On Fri, May 4, 2018 at 7:16 PM, Anirudh 
> > wrote:
> > > > >
> > > > > > Hi Pedro,
> > > > > >
> > > > > > Thank you for the suggestions. I will try to reproduce this
> without
> > > > fixed
> > > > > > seeds and also run it for a longer time duration.
> > > > > > Having said that, running unit tests over and over for a couple
> of
> > > days
> > > > > > will likely cause
> > > > > > problems  because there around 42 open issues for flaky tests:
> > > > > > https://github.com/apache/incubator-mxnet/issues?q=is%
> > > > > > 3Aopen+is%3Aissue+label%3AFlaky
> > > > > > Also, the release branch has diverged from master around 3 weeks
> > back
> > > > and
> > > > > > it doesn't have many of the changes merged to the master.
> > > > > > So, my question essentially is, what will be your benchmark to
> > accept
> > > > the
> > > > > > release ?
> > > > > > Is it that we run the test which you provided on 1.2 without
> fixed
> > > > seeds
> > > > > > and for a longer duration without failures ?
> > > > > > Or is it that all unit tests should pass over a period of 2 days
> > > > without
> > > > > > issues. This may require fixing all of the flaky tests which
> would
> > > > delay
> > > > > > the release by considerable amount of time.
> > > > > > Or is it something else ?
> > > > > >
> > > > > > Anirudh
> > > > > >
> > > > > >
> > > > > > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy <
> > > > > pedro.larroy.li...@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Could you remove the fixed seeds and run it for a couple of
> hours
> > > > with
> > > > > an
> > > > > > > additional loop?  Also I would suggest running the unit tests
> > over
> > > > and
> > > > > > over
> > > > > > > for a couple of days if possible.
> > > > > > >
> > > > > > >
> > > > > > > Pedro.
> > > > > > >
> > > > > > > On Thu, May 3, 2018 at 8:33 PM, Anirudh  >
> > > > wrote:
> > > > > > >
> > > > > > > > Hi Pedro and Naveen,
> > > > > > > >
> > > > > > > > I am unable to reproduce this issue with MKLDNN on the master
> > but
> > > > not
> > > > > > on
> > > > > > > > the 1.2.RC2 branch.
> > > > > > > >
> > > > > > > > Did the following on 1.2.RC2 branch:
> > > > > > > >
> > > > > > > > make -j $(nproc) USE_OPENCV=1 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-05 Thread Marco de Abreu
I can start a bunch of builds. I'll send a link when they are done.

-Marco

Jun Wu  schrieb am Sa., 5. Mai 2018, 00:10:

> +1
> I built from source and ran all the model quantization examples
> successfully.
>
> On Fri, May 4, 2018 at 3:05 PM, Anirudh  wrote:
>
> > Hi Pedro, Haibin, Indhu,
> >
> > Thank you for your inputs on the release. I ran the test:
> > `test_module.py:test_forward_reshape` for 250k times with different
> seeds.
> > I was unable to reproduce the issue on the release branch.
> > If everything goes well with CI tests by Pedro running till Sunday, I
> think
> > we should move forward with the release (given that we have enough +1s).
> > Is it possible to trigger the CI on the 1.2 branch repeatedly or at a
> fixed
> > schedule till Sunday?
> >
> > Anirudh
> >
> > On Fri, May 4, 2018 at 11:56 AM, Indhu  wrote:
> >
> > > +1
> > >
> > > I've been using CUDA build from this branch (built from source) on
> Ubuntu
> > > for couple of days now and I haven't seen any issue.
> > >
> > > The flaky tests need to be fixed but this release need not be blocked
> for
> > > that.
> > >
> > >
> > > On Fri, May 4, 2018 at 11:32 AM, Haibin Lin 
> > > wrote:
> > >
> > > > I agree with Anirudh that the focus of the discussion should be
> limited
> > > to
> > > > the release branch, not the master branch. Anything that breaks on
> > master
> > > > but works on release branch should not block the release itself.
> > > >
> > > >
> > > > Best,
> > > >
> > > > Haibin
> > > >
> > > > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
> > > > pedro.larroy.li...@gmail.com>
> > > > wrote:
> > > >
> > > > > I see your point.
> > > > >
> > > > > I checked the failures on the v1.2.0 branch and I don't see
> > segfaults,
> > > > just
> > > > > minor failures due to flaky tests.
> > > > >
> > > > > I will trigger it repeatedly a few times until Sunday to have a and
> > > > change
> > > > > my vote accordingly.
> > > > >
> > > > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> > mxnet/job/v1.2.0/
> > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > > > incubator-mxnet/detail/v1.2.0/17/pipeline
> > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > > > incubator-mxnet/detail/v1.2.0/15/pipeline/
> > > > >
> > > > >
> > > > > Pedro.
> > > > >
> > > > > On Fri, May 4, 2018 at 7:16 PM, Anirudh 
> > wrote:
> > > > >
> > > > > > Hi Pedro,
> > > > > >
> > > > > > Thank you for the suggestions. I will try to reproduce this
> without
> > > > fixed
> > > > > > seeds and also run it for a longer time duration.
> > > > > > Having said that, running unit tests over and over for a couple
> of
> > > days
> > > > > > will likely cause
> > > > > > problems  because there around 42 open issues for flaky tests:
> > > > > > https://github.com/apache/incubator-mxnet/issues?q=is%
> > > > > > 3Aopen+is%3Aissue+label%3AFlaky
> > > > > > Also, the release branch has diverged from master around 3 weeks
> > back
> > > > and
> > > > > > it doesn't have many of the changes merged to the master.
> > > > > > So, my question essentially is, what will be your benchmark to
> > accept
> > > > the
> > > > > > release ?
> > > > > > Is it that we run the test which you provided on 1.2 without
> fixed
> > > > seeds
> > > > > > and for a longer duration without failures ?
> > > > > > Or is it that all unit tests should pass over a period of 2 days
> > > > without
> > > > > > issues. This may require fixing all of the flaky tests which
> would
> > > > delay
> > > > > > the release by considerable amount of time.
> > > > > > Or is it something else ?
> > > > > >
> > > > > > Anirudh
> > > > > >
> > > > > >
> > > > > > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy <
> > > > > pedro.larroy.li...@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Could you remove the fixed seeds and run it for a couple of
> hours
> > > > with
> > > > > an
> > > > > > > additional loop?  Also I would suggest running the unit tests
> > over
> > > > and
> > > > > > over
> > > > > > > for a couple of days if possible.
> > > > > > >
> > > > > > >
> > > > > > > Pedro.
> > > > > > >
> > > > > > > On Thu, May 3, 2018 at 8:33 PM, Anirudh  >
> > > > wrote:
> > > > > > >
> > > > > > > > Hi Pedro and Naveen,
> > > > > > > >
> > > > > > > > I am unable to reproduce this issue with MKLDNN on the master
> > but
> > > > not
> > > > > > on
> > > > > > > > the 1.2.RC2 branch.
> > > > > > > >
> > > > > > > > Did the following on 1.2.RC2 branch:
> > > > > > > >
> > > > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas
> > > USE_DIST_KVSTORE=0
> > > > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > > > > > > export MXNET_TEST_SEED=11
> > > > > > > > export MXNET_MODULE_SEED=812478194
> > > > > > > > export 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-04 Thread Jun Wu
+1
I built from source and ran all the model quantization examples
successfully.

On Fri, May 4, 2018 at 3:05 PM, Anirudh  wrote:

> Hi Pedro, Haibin, Indhu,
>
> Thank you for your inputs on the release. I ran the test:
> `test_module.py:test_forward_reshape` for 250k times with different seeds.
> I was unable to reproduce the issue on the release branch.
> If everything goes well with CI tests by Pedro running till Sunday, I think
> we should move forward with the release (given that we have enough +1s).
> Is it possible to trigger the CI on the 1.2 branch repeatedly or at a fixed
> schedule till Sunday?
>
> Anirudh
>
> On Fri, May 4, 2018 at 11:56 AM, Indhu  wrote:
>
> > +1
> >
> > I've been using CUDA build from this branch (built from source) on Ubuntu
> > for couple of days now and I haven't seen any issue.
> >
> > The flaky tests need to be fixed but this release need not be blocked for
> > that.
> >
> >
> > On Fri, May 4, 2018 at 11:32 AM, Haibin Lin 
> > wrote:
> >
> > > I agree with Anirudh that the focus of the discussion should be limited
> > to
> > > the release branch, not the master branch. Anything that breaks on
> master
> > > but works on release branch should not block the release itself.
> > >
> > >
> > > Best,
> > >
> > > Haibin
> > >
> > > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
> > > pedro.larroy.li...@gmail.com>
> > > wrote:
> > >
> > > > I see your point.
> > > >
> > > > I checked the failures on the v1.2.0 branch and I don't see
> segfaults,
> > > just
> > > > minor failures due to flaky tests.
> > > >
> > > > I will trigger it repeatedly a few times until Sunday to have a and
> > > change
> > > > my vote accordingly.
> > > >
> > > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> mxnet/job/v1.2.0/
> > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > > incubator-mxnet/detail/v1.2.0/17/pipeline
> > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > > incubator-mxnet/detail/v1.2.0/15/pipeline/
> > > >
> > > >
> > > > Pedro.
> > > >
> > > > On Fri, May 4, 2018 at 7:16 PM, Anirudh 
> wrote:
> > > >
> > > > > Hi Pedro,
> > > > >
> > > > > Thank you for the suggestions. I will try to reproduce this without
> > > fixed
> > > > > seeds and also run it for a longer time duration.
> > > > > Having said that, running unit tests over and over for a couple of
> > days
> > > > > will likely cause
> > > > > problems  because there around 42 open issues for flaky tests:
> > > > > https://github.com/apache/incubator-mxnet/issues?q=is%
> > > > > 3Aopen+is%3Aissue+label%3AFlaky
> > > > > Also, the release branch has diverged from master around 3 weeks
> back
> > > and
> > > > > it doesn't have many of the changes merged to the master.
> > > > > So, my question essentially is, what will be your benchmark to
> accept
> > > the
> > > > > release ?
> > > > > Is it that we run the test which you provided on 1.2 without fixed
> > > seeds
> > > > > and for a longer duration without failures ?
> > > > > Or is it that all unit tests should pass over a period of 2 days
> > > without
> > > > > issues. This may require fixing all of the flaky tests which would
> > > delay
> > > > > the release by considerable amount of time.
> > > > > Or is it something else ?
> > > > >
> > > > > Anirudh
> > > > >
> > > > >
> > > > > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy <
> > > > pedro.larroy.li...@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Could you remove the fixed seeds and run it for a couple of hours
> > > with
> > > > an
> > > > > > additional loop?  Also I would suggest running the unit tests
> over
> > > and
> > > > > over
> > > > > > for a couple of days if possible.
> > > > > >
> > > > > >
> > > > > > Pedro.
> > > > > >
> > > > > > On Thu, May 3, 2018 at 8:33 PM, Anirudh 
> > > wrote:
> > > > > >
> > > > > > > Hi Pedro and Naveen,
> > > > > > >
> > > > > > > I am unable to reproduce this issue with MKLDNN on the master
> but
> > > not
> > > > > on
> > > > > > > the 1.2.RC2 branch.
> > > > > > >
> > > > > > > Did the following on 1.2.RC2 branch:
> > > > > > >
> > > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas
> > USE_DIST_KVSTORE=0
> > > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > > > > > export MXNET_TEST_SEED=11
> > > > > > > export MXNET_MODULE_SEED=812478194
> > > > > > > export MXNET_TEST_COUNT=1
> > > > > > > nosetests-2.7 -v tests/python/unittest/test_
> > > > > > module.py:test_forward_reshape
> > > > > > >
> > > > > > > Was able to do the 10k runs successfully.
> > > > > > >
> > > > > > > Anirudh
> > > > > > >
> > > > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh  >
> > > > wrote:
> > > > > > >
> > > > > > > > Hi Pedro and Naveen,
> > > > > > > >
> > > > > > > > Is this issue reproducible when MXNet 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-04 Thread Anirudh
Hi Pedro, Haibin, Indhu,

Thank you for your inputs on the release. I ran the test:
`test_module.py:test_forward_reshape` for 250k times with different seeds.
I was unable to reproduce the issue on the release branch.
If everything goes well with CI tests by Pedro running till Sunday, I think
we should move forward with the release (given that we have enough +1s).
Is it possible to trigger the CI on the 1.2 branch repeatedly or at a fixed
schedule till Sunday?

Anirudh

On Fri, May 4, 2018 at 11:56 AM, Indhu  wrote:

> +1
>
> I've been using CUDA build from this branch (built from source) on Ubuntu
> for couple of days now and I haven't seen any issue.
>
> The flaky tests need to be fixed but this release need not be blocked for
> that.
>
>
> On Fri, May 4, 2018 at 11:32 AM, Haibin Lin 
> wrote:
>
> > I agree with Anirudh that the focus of the discussion should be limited
> to
> > the release branch, not the master branch. Anything that breaks on master
> > but works on release branch should not block the release itself.
> >
> >
> > Best,
> >
> > Haibin
> >
> > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
> > pedro.larroy.li...@gmail.com>
> > wrote:
> >
> > > I see your point.
> > >
> > > I checked the failures on the v1.2.0 branch and I don't see segfaults,
> > just
> > > minor failures due to flaky tests.
> > >
> > > I will trigger it repeatedly a few times until Sunday to have a and
> > change
> > > my vote accordingly.
> > >
> > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/v1.2.0/
> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > incubator-mxnet/detail/v1.2.0/17/pipeline
> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > incubator-mxnet/detail/v1.2.0/15/pipeline/
> > >
> > >
> > > Pedro.
> > >
> > > On Fri, May 4, 2018 at 7:16 PM, Anirudh  wrote:
> > >
> > > > Hi Pedro,
> > > >
> > > > Thank you for the suggestions. I will try to reproduce this without
> > fixed
> > > > seeds and also run it for a longer time duration.
> > > > Having said that, running unit tests over and over for a couple of
> days
> > > > will likely cause
> > > > problems  because there around 42 open issues for flaky tests:
> > > > https://github.com/apache/incubator-mxnet/issues?q=is%
> > > > 3Aopen+is%3Aissue+label%3AFlaky
> > > > Also, the release branch has diverged from master around 3 weeks back
> > and
> > > > it doesn't have many of the changes merged to the master.
> > > > So, my question essentially is, what will be your benchmark to accept
> > the
> > > > release ?
> > > > Is it that we run the test which you provided on 1.2 without fixed
> > seeds
> > > > and for a longer duration without failures ?
> > > > Or is it that all unit tests should pass over a period of 2 days
> > without
> > > > issues. This may require fixing all of the flaky tests which would
> > delay
> > > > the release by considerable amount of time.
> > > > Or is it something else ?
> > > >
> > > > Anirudh
> > > >
> > > >
> > > > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy <
> > > pedro.larroy.li...@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Could you remove the fixed seeds and run it for a couple of hours
> > with
> > > an
> > > > > additional loop?  Also I would suggest running the unit tests over
> > and
> > > > over
> > > > > for a couple of days if possible.
> > > > >
> > > > >
> > > > > Pedro.
> > > > >
> > > > > On Thu, May 3, 2018 at 8:33 PM, Anirudh 
> > wrote:
> > > > >
> > > > > > Hi Pedro and Naveen,
> > > > > >
> > > > > > I am unable to reproduce this issue with MKLDNN on the master but
> > not
> > > > on
> > > > > > the 1.2.RC2 branch.
> > > > > >
> > > > > > Did the following on 1.2.RC2 branch:
> > > > > >
> > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas
> USE_DIST_KVSTORE=0
> > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > > > > export MXNET_TEST_SEED=11
> > > > > > export MXNET_MODULE_SEED=812478194
> > > > > > export MXNET_TEST_COUNT=1
> > > > > > nosetests-2.7 -v tests/python/unittest/test_
> > > > > module.py:test_forward_reshape
> > > > > >
> > > > > > Was able to do the 10k runs successfully.
> > > > > >
> > > > > > Anirudh
> > > > > >
> > > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh 
> > > wrote:
> > > > > >
> > > > > > > Hi Pedro and Naveen,
> > > > > > >
> > > > > > > Is this issue reproducible when MXNet is built with
> USE_MKLDNN=0?
> > > > > > > Also, there are a bunch of MKLDNN fixes that didn't go into the
> > > > release
> > > > > > > branch. Is this issue reproducible on the release branch ?
> > > > > > > In my opinion, since we have marked MKLDNN as experimental
> > feature
> > > > for
> > > > > > the
> > > > > > > release, if it is confirmed to be a MKLDNN issue
> > > > > > > we don't need to block the release on it.
> > > > > > >
> > > 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-04 Thread Indhu
+1

I've been using CUDA build from this branch (built from source) on Ubuntu
for couple of days now and I haven't seen any issue.

The flaky tests need to be fixed but this release need not be blocked for
that.


On Fri, May 4, 2018 at 11:32 AM, Haibin Lin 
wrote:

> I agree with Anirudh that the focus of the discussion should be limited to
> the release branch, not the master branch. Anything that breaks on master
> but works on release branch should not block the release itself.
>
>
> Best,
>
> Haibin
>
> On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> wrote:
>
> > I see your point.
> >
> > I checked the failures on the v1.2.0 branch and I don't see segfaults,
> just
> > minor failures due to flaky tests.
> >
> > I will trigger it repeatedly a few times until Sunday to have a and
> change
> > my vote accordingly.
> >
> > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/v1.2.0/
> > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > incubator-mxnet/detail/v1.2.0/17/pipeline
> > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > incubator-mxnet/detail/v1.2.0/15/pipeline/
> >
> >
> > Pedro.
> >
> > On Fri, May 4, 2018 at 7:16 PM, Anirudh  wrote:
> >
> > > Hi Pedro,
> > >
> > > Thank you for the suggestions. I will try to reproduce this without
> fixed
> > > seeds and also run it for a longer time duration.
> > > Having said that, running unit tests over and over for a couple of days
> > > will likely cause
> > > problems  because there around 42 open issues for flaky tests:
> > > https://github.com/apache/incubator-mxnet/issues?q=is%
> > > 3Aopen+is%3Aissue+label%3AFlaky
> > > Also, the release branch has diverged from master around 3 weeks back
> and
> > > it doesn't have many of the changes merged to the master.
> > > So, my question essentially is, what will be your benchmark to accept
> the
> > > release ?
> > > Is it that we run the test which you provided on 1.2 without fixed
> seeds
> > > and for a longer duration without failures ?
> > > Or is it that all unit tests should pass over a period of 2 days
> without
> > > issues. This may require fixing all of the flaky tests which would
> delay
> > > the release by considerable amount of time.
> > > Or is it something else ?
> > >
> > > Anirudh
> > >
> > >
> > > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy <
> > pedro.larroy.li...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Could you remove the fixed seeds and run it for a couple of hours
> with
> > an
> > > > additional loop?  Also I would suggest running the unit tests over
> and
> > > over
> > > > for a couple of days if possible.
> > > >
> > > >
> > > > Pedro.
> > > >
> > > > On Thu, May 3, 2018 at 8:33 PM, Anirudh 
> wrote:
> > > >
> > > > > Hi Pedro and Naveen,
> > > > >
> > > > > I am unable to reproduce this issue with MKLDNN on the master but
> not
> > > on
> > > > > the 1.2.RC2 branch.
> > > > >
> > > > > Did the following on 1.2.RC2 branch:
> > > > >
> > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
> > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > > > export MXNET_TEST_SEED=11
> > > > > export MXNET_MODULE_SEED=812478194
> > > > > export MXNET_TEST_COUNT=1
> > > > > nosetests-2.7 -v tests/python/unittest/test_
> > > > module.py:test_forward_reshape
> > > > >
> > > > > Was able to do the 10k runs successfully.
> > > > >
> > > > > Anirudh
> > > > >
> > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh 
> > wrote:
> > > > >
> > > > > > Hi Pedro and Naveen,
> > > > > >
> > > > > > Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
> > > > > > Also, there are a bunch of MKLDNN fixes that didn't go into the
> > > release
> > > > > > branch. Is this issue reproducible on the release branch ?
> > > > > > In my opinion, since we have marked MKLDNN as experimental
> feature
> > > for
> > > > > the
> > > > > > release, if it is confirmed to be a MKLDNN issue
> > > > > > we don't need to block the release on it.
> > > > > >
> > > > > > Anirudh
> > > > > >
> > > > > > On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy  >
> > > > wrote:
> > > > > >
> > > > > >> Thanks for raising this issue Pedro.
> > > > > >>
> > > > > >> -1(binding)
> > > > > >>
> > > > > >> We were in a similar state for a while a year ago, a lot of
> effort
> > > > went
> > > > > to
> > > > > >> stabilize the tests and the CI. I have seen the PR builds are
> > > > > >> non-deterministic and you have to retry over and over (wasting
> > > > resources
> > > > > >> and time) and hope you get lucky.
> > > > > >>
> > > > > >> Look at the dashboard for master build
> > > > > >> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> > > mxnet/job/master/
> > > > > >>
> > > > > >> -Naveen
> > > > > >>
> > > > > >> On Thu, May 3, 2018 at 5:11 AM, Pedro 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-04 Thread Haibin Lin
I agree with Anirudh that the focus of the discussion should be limited to
the release branch, not the master branch. Anything that breaks on master
but works on release branch should not block the release itself.


Best,

Haibin

On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy 
wrote:

> I see your point.
>
> I checked the failures on the v1.2.0 branch and I don't see segfaults, just
> minor failures due to flaky tests.
>
> I will trigger it repeatedly a few times until Sunday to have a and change
> my vote accordingly.
>
> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/v1.2.0/
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> incubator-mxnet/detail/v1.2.0/17/pipeline
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> incubator-mxnet/detail/v1.2.0/15/pipeline/
>
>
> Pedro.
>
> On Fri, May 4, 2018 at 7:16 PM, Anirudh  wrote:
>
> > Hi Pedro,
> >
> > Thank you for the suggestions. I will try to reproduce this without fixed
> > seeds and also run it for a longer time duration.
> > Having said that, running unit tests over and over for a couple of days
> > will likely cause
> > problems  because there around 42 open issues for flaky tests:
> > https://github.com/apache/incubator-mxnet/issues?q=is%
> > 3Aopen+is%3Aissue+label%3AFlaky
> > Also, the release branch has diverged from master around 3 weeks back and
> > it doesn't have many of the changes merged to the master.
> > So, my question essentially is, what will be your benchmark to accept the
> > release ?
> > Is it that we run the test which you provided on 1.2 without fixed seeds
> > and for a longer duration without failures ?
> > Or is it that all unit tests should pass over a period of 2 days without
> > issues. This may require fixing all of the flaky tests which would delay
> > the release by considerable amount of time.
> > Or is it something else ?
> >
> > Anirudh
> >
> >
> > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy <
> pedro.larroy.li...@gmail.com
> > >
> > wrote:
> >
> > > Could you remove the fixed seeds and run it for a couple of hours with
> an
> > > additional loop?  Also I would suggest running the unit tests over and
> > over
> > > for a couple of days if possible.
> > >
> > >
> > > Pedro.
> > >
> > > On Thu, May 3, 2018 at 8:33 PM, Anirudh  wrote:
> > >
> > > > Hi Pedro and Naveen,
> > > >
> > > > I am unable to reproduce this issue with MKLDNN on the master but not
> > on
> > > > the 1.2.RC2 branch.
> > > >
> > > > Did the following on 1.2.RC2 branch:
> > > >
> > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
> > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > > export MXNET_TEST_SEED=11
> > > > export MXNET_MODULE_SEED=812478194
> > > > export MXNET_TEST_COUNT=1
> > > > nosetests-2.7 -v tests/python/unittest/test_
> > > module.py:test_forward_reshape
> > > >
> > > > Was able to do the 10k runs successfully.
> > > >
> > > > Anirudh
> > > >
> > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh 
> wrote:
> > > >
> > > > > Hi Pedro and Naveen,
> > > > >
> > > > > Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
> > > > > Also, there are a bunch of MKLDNN fixes that didn't go into the
> > release
> > > > > branch. Is this issue reproducible on the release branch ?
> > > > > In my opinion, since we have marked MKLDNN as experimental feature
> > for
> > > > the
> > > > > release, if it is confirmed to be a MKLDNN issue
> > > > > we don't need to block the release on it.
> > > > >
> > > > > Anirudh
> > > > >
> > > > > On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy 
> > > wrote:
> > > > >
> > > > >> Thanks for raising this issue Pedro.
> > > > >>
> > > > >> -1(binding)
> > > > >>
> > > > >> We were in a similar state for a while a year ago, a lot of effort
> > > went
> > > > to
> > > > >> stabilize the tests and the CI. I have seen the PR builds are
> > > > >> non-deterministic and you have to retry over and over (wasting
> > > resources
> > > > >> and time) and hope you get lucky.
> > > > >>
> > > > >> Look at the dashboard for master build
> > > > >> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> > mxnet/job/master/
> > > > >>
> > > > >> -Naveen
> > > > >>
> > > > >> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy <
> > > > >> pedro.larroy.li...@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > -1  nondeterminisitc failures on CI master:
> > > > >> > https://issues.apache.org/jira/browse/MXNET-396
> > > > >> >
> > > > >> > Was able to reproduce once in a fresh p3 instance with DLAMI
> > can't
> > > > >> > reproduce consistently.
> > > > >> >
> > > > >> > On Wed, May 2, 2018 at 9:51 PM, Anirudh 
> > > > wrote:
> > > > >> >
> > > > >> > > Hi all,
> > > > >> > >
> > > > >> > > As part of RC2 release, we have addressed bugs and some
> concerns
> > > > that
> 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-04 Thread Pedro Larroy
I see your point.

I checked the failures on the v1.2.0 branch and I don't see segfaults, just
minor failures due to flaky tests.

I will trigger it repeatedly a few times until Sunday to have a and change
my vote accordingly.

http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/v1.2.0/
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/v1.2.0/17/pipeline
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/v1.2.0/15/pipeline/


Pedro.

On Fri, May 4, 2018 at 7:16 PM, Anirudh  wrote:

> Hi Pedro,
>
> Thank you for the suggestions. I will try to reproduce this without fixed
> seeds and also run it for a longer time duration.
> Having said that, running unit tests over and over for a couple of days
> will likely cause
> problems  because there around 42 open issues for flaky tests:
> https://github.com/apache/incubator-mxnet/issues?q=is%
> 3Aopen+is%3Aissue+label%3AFlaky
> Also, the release branch has diverged from master around 3 weeks back and
> it doesn't have many of the changes merged to the master.
> So, my question essentially is, what will be your benchmark to accept the
> release ?
> Is it that we run the test which you provided on 1.2 without fixed seeds
> and for a longer duration without failures ?
> Or is it that all unit tests should pass over a period of 2 days without
> issues. This may require fixing all of the flaky tests which would delay
> the release by considerable amount of time.
> Or is it something else ?
>
> Anirudh
>
>
> On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy  >
> wrote:
>
> > Could you remove the fixed seeds and run it for a couple of hours with an
> > additional loop?  Also I would suggest running the unit tests over and
> over
> > for a couple of days if possible.
> >
> >
> > Pedro.
> >
> > On Thu, May 3, 2018 at 8:33 PM, Anirudh  wrote:
> >
> > > Hi Pedro and Naveen,
> > >
> > > I am unable to reproduce this issue with MKLDNN on the master but not
> on
> > > the 1.2.RC2 branch.
> > >
> > > Did the following on 1.2.RC2 branch:
> > >
> > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
> > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > export MXNET_TEST_SEED=11
> > > export MXNET_MODULE_SEED=812478194
> > > export MXNET_TEST_COUNT=1
> > > nosetests-2.7 -v tests/python/unittest/test_
> > module.py:test_forward_reshape
> > >
> > > Was able to do the 10k runs successfully.
> > >
> > > Anirudh
> > >
> > > On Thu, May 3, 2018 at 8:46 AM, Anirudh  wrote:
> > >
> > > > Hi Pedro and Naveen,
> > > >
> > > > Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
> > > > Also, there are a bunch of MKLDNN fixes that didn't go into the
> release
> > > > branch. Is this issue reproducible on the release branch ?
> > > > In my opinion, since we have marked MKLDNN as experimental feature
> for
> > > the
> > > > release, if it is confirmed to be a MKLDNN issue
> > > > we don't need to block the release on it.
> > > >
> > > > Anirudh
> > > >
> > > > On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy 
> > wrote:
> > > >
> > > >> Thanks for raising this issue Pedro.
> > > >>
> > > >> -1(binding)
> > > >>
> > > >> We were in a similar state for a while a year ago, a lot of effort
> > went
> > > to
> > > >> stabilize the tests and the CI. I have seen the PR builds are
> > > >> non-deterministic and you have to retry over and over (wasting
> > resources
> > > >> and time) and hope you get lucky.
> > > >>
> > > >> Look at the dashboard for master build
> > > >> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> mxnet/job/master/
> > > >>
> > > >> -Naveen
> > > >>
> > > >> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy <
> > > >> pedro.larroy.li...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > -1  nondeterminisitc failures on CI master:
> > > >> > https://issues.apache.org/jira/browse/MXNET-396
> > > >> >
> > > >> > Was able to reproduce once in a fresh p3 instance with DLAMI
> can't
> > > >> > reproduce consistently.
> > > >> >
> > > >> > On Wed, May 2, 2018 at 9:51 PM, Anirudh 
> > > wrote:
> > > >> >
> > > >> > > Hi all,
> > > >> > >
> > > >> > > As part of RC2 release, we have addressed bugs and some concerns
> > > that
> > > >> > were
> > > >> > > raised.
> > > >> > >
> > > >> > > I would like to propose a vote to release Apache MXNet
> > (incubating)
> > > >> > version
> > > >> > > 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and end at
> > > >> 12:50 PM
> > > >> > > PDT, Sunday, May 6th.
> > > >> > >
> > > >> > > Link to release notes:
> > > >> > > https://cwiki.apache.org/confluence/display/MXNET/
> > > >> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
> > > >> > >
> > > >> > > Link to release candidate 1.2.0.rc2:
> > > >> > > 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-04 Thread Anirudh
Hi Pedro,

Thank you for the suggestions. I will try to reproduce this without fixed
seeds and also run it for a longer time duration.
Having said that, running unit tests over and over for a couple of days
will likely cause
problems  because there around 42 open issues for flaky tests:
https://github.com/apache/incubator-mxnet/issues?q=is%3Aopen+is%3Aissue+label%3AFlaky
Also, the release branch has diverged from master around 3 weeks back and
it doesn't have many of the changes merged to the master.
So, my question essentially is, what will be your benchmark to accept the
release ?
Is it that we run the test which you provided on 1.2 without fixed seeds
and for a longer duration without failures ?
Or is it that all unit tests should pass over a period of 2 days without
issues. This may require fixing all of the flaky tests which would delay
the release by considerable amount of time.
Or is it something else ?

Anirudh


On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy 
wrote:

> Could you remove the fixed seeds and run it for a couple of hours with an
> additional loop?  Also I would suggest running the unit tests over and over
> for a couple of days if possible.
>
>
> Pedro.
>
> On Thu, May 3, 2018 at 8:33 PM, Anirudh  wrote:
>
> > Hi Pedro and Naveen,
> >
> > I am unable to reproduce this issue with MKLDNN on the master but not on
> > the 1.2.RC2 branch.
> >
> > Did the following on 1.2.RC2 branch:
> >
> > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
> > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > export MXNET_TEST_SEED=11
> > export MXNET_MODULE_SEED=812478194
> > export MXNET_TEST_COUNT=1
> > nosetests-2.7 -v tests/python/unittest/test_
> module.py:test_forward_reshape
> >
> > Was able to do the 10k runs successfully.
> >
> > Anirudh
> >
> > On Thu, May 3, 2018 at 8:46 AM, Anirudh  wrote:
> >
> > > Hi Pedro and Naveen,
> > >
> > > Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
> > > Also, there are a bunch of MKLDNN fixes that didn't go into the release
> > > branch. Is this issue reproducible on the release branch ?
> > > In my opinion, since we have marked MKLDNN as experimental feature for
> > the
> > > release, if it is confirmed to be a MKLDNN issue
> > > we don't need to block the release on it.
> > >
> > > Anirudh
> > >
> > > On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy 
> wrote:
> > >
> > >> Thanks for raising this issue Pedro.
> > >>
> > >> -1(binding)
> > >>
> > >> We were in a similar state for a while a year ago, a lot of effort
> went
> > to
> > >> stabilize the tests and the CI. I have seen the PR builds are
> > >> non-deterministic and you have to retry over and over (wasting
> resources
> > >> and time) and hope you get lucky.
> > >>
> > >> Look at the dashboard for master build
> > >> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/master/
> > >>
> > >> -Naveen
> > >>
> > >> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy <
> > >> pedro.larroy.li...@gmail.com>
> > >> wrote:
> > >>
> > >> > -1  nondeterminisitc failures on CI master:
> > >> > https://issues.apache.org/jira/browse/MXNET-396
> > >> >
> > >> > Was able to reproduce once in a fresh p3 instance with DLAMI  can't
> > >> > reproduce consistently.
> > >> >
> > >> > On Wed, May 2, 2018 at 9:51 PM, Anirudh 
> > wrote:
> > >> >
> > >> > > Hi all,
> > >> > >
> > >> > > As part of RC2 release, we have addressed bugs and some concerns
> > that
> > >> > were
> > >> > > raised.
> > >> > >
> > >> > > I would like to propose a vote to release Apache MXNet
> (incubating)
> > >> > version
> > >> > > 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and end at
> > >> 12:50 PM
> > >> > > PDT, Sunday, May 6th.
> > >> > >
> > >> > > Link to release notes:
> > >> > > https://cwiki.apache.org/confluence/display/MXNET/
> > >> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
> > >> > >
> > >> > > Link to release candidate 1.2.0.rc2:
> > >> > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
> > >> > >
> > >> > > Voting results for 1.2.0.rc2:
> > >> > > https://lists.apache.org/thread.html/
> ebe561c609a8e32351dfe4aafc8876
> > >> > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org%3E
> > >> > >
> > >> > > View this page, click on "Build from Source", and use the source
> > code
> > >> > > obtained from 1.2.0.rc2 tag:
> > >> > > https://mxnet.incubator.apache.org/install/index.html
> > >> > >
> > >> > > (Note: The README.md points to the 1.2.0 tag and does not work at
> > the
> > >> > > moment.)
> > >> > >
> > >> > > Please remember to test first before voting accordingly:
> > >> > >
> > >> > > +1 = approve
> > >> > > +0 = no opinion
> > >> > > -1 = disapprove (provide reason)
> > >> > >
> > >> > > Anirudh
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>


Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-04 Thread Pedro Larroy
Could you remove the fixed seeds and run it for a couple of hours with an
additional loop?  Also I would suggest running the unit tests over and over
for a couple of days if possible.


Pedro.

On Thu, May 3, 2018 at 8:33 PM, Anirudh  wrote:

> Hi Pedro and Naveen,
>
> I am unable to reproduce this issue with MKLDNN on the master but not on
> the 1.2.RC2 branch.
>
> Did the following on 1.2.RC2 branch:
>
> make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
> USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> export MXNET_TEST_SEED=11
> export MXNET_MODULE_SEED=812478194
> export MXNET_TEST_COUNT=1
> nosetests-2.7 -v tests/python/unittest/test_module.py:test_forward_reshape
>
> Was able to do the 10k runs successfully.
>
> Anirudh
>
> On Thu, May 3, 2018 at 8:46 AM, Anirudh  wrote:
>
> > Hi Pedro and Naveen,
> >
> > Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
> > Also, there are a bunch of MKLDNN fixes that didn't go into the release
> > branch. Is this issue reproducible on the release branch ?
> > In my opinion, since we have marked MKLDNN as experimental feature for
> the
> > release, if it is confirmed to be a MKLDNN issue
> > we don't need to block the release on it.
> >
> > Anirudh
> >
> > On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy  wrote:
> >
> >> Thanks for raising this issue Pedro.
> >>
> >> -1(binding)
> >>
> >> We were in a similar state for a while a year ago, a lot of effort went
> to
> >> stabilize the tests and the CI. I have seen the PR builds are
> >> non-deterministic and you have to retry over and over (wasting resources
> >> and time) and hope you get lucky.
> >>
> >> Look at the dashboard for master build
> >> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/master/
> >>
> >> -Naveen
> >>
> >> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy <
> >> pedro.larroy.li...@gmail.com>
> >> wrote:
> >>
> >> > -1  nondeterminisitc failures on CI master:
> >> > https://issues.apache.org/jira/browse/MXNET-396
> >> >
> >> > Was able to reproduce once in a fresh p3 instance with DLAMI  can't
> >> > reproduce consistently.
> >> >
> >> > On Wed, May 2, 2018 at 9:51 PM, Anirudh 
> wrote:
> >> >
> >> > > Hi all,
> >> > >
> >> > > As part of RC2 release, we have addressed bugs and some concerns
> that
> >> > were
> >> > > raised.
> >> > >
> >> > > I would like to propose a vote to release Apache MXNet (incubating)
> >> > version
> >> > > 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and end at
> >> 12:50 PM
> >> > > PDT, Sunday, May 6th.
> >> > >
> >> > > Link to release notes:
> >> > > https://cwiki.apache.org/confluence/display/MXNET/
> >> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
> >> > >
> >> > > Link to release candidate 1.2.0.rc2:
> >> > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
> >> > >
> >> > > Voting results for 1.2.0.rc2:
> >> > > https://lists.apache.org/thread.html/ebe561c609a8e32351dfe4aafc8876
> >> > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org%3E
> >> > >
> >> > > View this page, click on "Build from Source", and use the source
> code
> >> > > obtained from 1.2.0.rc2 tag:
> >> > > https://mxnet.incubator.apache.org/install/index.html
> >> > >
> >> > > (Note: The README.md points to the 1.2.0 tag and does not work at
> the
> >> > > moment.)
> >> > >
> >> > > Please remember to test first before voting accordingly:
> >> > >
> >> > > +1 = approve
> >> > > +0 = no opinion
> >> > > -1 = disapprove (provide reason)
> >> > >
> >> > > Anirudh
> >> > >
> >> >
> >>
> >
> >
>


Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-04 Thread Pedro Larroy
Hi Anirudh

I see too many random failures, segfaults and other problems. Qualitatively
I don't think we are in a situation to make a release. For this I would
expect to see master stable for most of the builds, and it's not the case
right now.

My vote is still -1 non binding.

If someone is willing and able to revert some of the changes that
destabilized master, then the situation would be different.

Failing CI on PRs  is creating problems for getting fixes and changes
merged.

Pedro.




On Thu, May 3, 2018 at 5:46 PM, Anirudh  wrote:

> Hi Pedro and Naveen,
>
> Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
> Also, there are a bunch of MKLDNN fixes that didn't go into the release
> branch. Is this issue reproducible on the release branch ?
> In my opinion, since we have marked MKLDNN as experimental feature for the
> release, if it is confirmed to be a MKLDNN issue
> we don't need to block the release on it.
>
> Anirudh
>
> On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy  wrote:
>
> > Thanks for raising this issue Pedro.
> >
> > -1(binding)
> >
> > We were in a similar state for a while a year ago, a lot of effort went
> to
> > stabilize the tests and the CI. I have seen the PR builds are
> > non-deterministic and you have to retry over and over (wasting resources
> > and time) and hope you get lucky.
> >
> > Look at the dashboard for master build
> > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/master/
> >
> > -Naveen
> >
> > On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy <
> pedro.larroy.li...@gmail.com
> > >
> > wrote:
> >
> > > -1  nondeterminisitc failures on CI master:
> > > https://issues.apache.org/jira/browse/MXNET-396
> > >
> > > Was able to reproduce once in a fresh p3 instance with DLAMI  can't
> > > reproduce consistently.
> > >
> > > On Wed, May 2, 2018 at 9:51 PM, Anirudh  wrote:
> > >
> > > > Hi all,
> > > >
> > > > As part of RC2 release, we have addressed bugs and some concerns that
> > > were
> > > > raised.
> > > >
> > > > I would like to propose a vote to release Apache MXNet (incubating)
> > > version
> > > > 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and end at
> 12:50
> > PM
> > > > PDT, Sunday, May 6th.
> > > >
> > > > Link to release notes:
> > > > https://cwiki.apache.org/confluence/display/MXNET/
> > > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
> > > >
> > > > Link to release candidate 1.2.0.rc2:
> > > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
> > > >
> > > > Voting results for 1.2.0.rc2:
> > > > https://lists.apache.org/thread.html/ebe561c609a8e32351dfe4aafc8876
> > > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org%3E
> > > >
> > > > View this page, click on "Build from Source", and use the source code
> > > > obtained from 1.2.0.rc2 tag:
> > > > https://mxnet.incubator.apache.org/install/index.html
> > > >
> > > > (Note: The README.md points to the 1.2.0 tag and does not work at the
> > > > moment.)
> > > >
> > > > Please remember to test first before voting accordingly:
> > > >
> > > > +1 = approve
> > > > +0 = no opinion
> > > > -1 = disapprove (provide reason)
> > > >
> > > > Anirudh
> > > >
> > >
> >
>


Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-03 Thread Naveen Swamy
+ 0

On Thu, May 3, 2018 at 12:44 PM, Anirudh  wrote:

> Hi Naveen,
>
> You raise a good point and I agree that by default MKLDNN by default should
> be switched off.
> Because of a bug in Cmakelists.txt which has been fixed as part of #10731,
> which is merged to master (but not on the release branch),
> the users won't have MKLDNN enabled even though MKLDNN is set to ON by
> default.
> Since cmake install instructions have not been published in mxnet.io and
> is
> lesser used compared to pip package and make from what I have seen,
>  would it be acceptable for you if this is added as a known issue and a
> workaround is provided in the release notes ?
> The impacted users(who will have to use the workaround) here would be the
> customers who are interested in MKLDNN and use cmake, users who aren't
> interested in using MKLDNN feature won't be impacted.
>
> Anirudh
>
> On Thu, May 3, 2018 at 12:16 PM, Marco de Abreu <
> marco.g.ab...@googlemail.com> wrote:
>
> > The MKLDNN tests are not really less stable than the other tests. It's
> > pretty much the same across all tests we have. So I wouldn't say there's
> a
> > need to fix them in a separate branch.
> >
> > On Thu, May 3, 2018 at 9:00 PM, Naveen Swamy  wrote:
> >
> > > I also meant(but forgot to send), we stabilize it on a separate branch
> > and
> > > then bring in the changes instead of blocking the PRs.
> > >
> > > On Thu, May 3, 2018 at 11:57 AM, Marco de Abreu <
> > > marco.g.ab...@googlemail.com> wrote:
> > >
> > > > I think the failing tests are really getting an issue. We now got
> > roughly
> > > > 50 test failure related issues [1], leading to a average failure rate
> > of
> > > > 50%. Considering the costs in terms of money and time per run, this
> is
> > > > adding up quite significantly.
> > > >
> > > > Didn't we just remove MKLML from our codebase to replace it with
> > MKLDNN?
> > > I
> > > > think removing something and marking the replacement as experimental
> > > could
> > > > be difficult from a user perspective. Personally, I don't really feel
> > > > comfortable solving the problem of known issues by marking something
> as
> > > > experimental. We're basically shifting the responsibility to our
> users
> > > that
> > > > way.
> > > >
> > > > I don't think we should stop testing MKLDNN in our CI. We already had
> > the
> > > > situation a few months ago where the solution to failed tests was to
> > > > disable them. We shouldn't go back to that.
> > > >
> > > > -Marco
> > > >
> > > > [1]:
> > > > https://github.com/apache/incubator-mxnet/issues?q=is%
> > > > 3Aopen+is%3Aissue+label%3ATest
> > > >
> > > > On Thu, May 3, 2018 at 8:46 PM, Naveen Swamy 
> > wrote:
> > > >
> > > > > USE_MKLDNN is set to ON in the cmake file by default, since its
> > > > > experimental can we turn OFF  so there is some determinism when
> users
> > > > build
> > > > > and test.
> > > > >
> > > > > https://github.com/apache/incubator-mxnet/blob/
> > > > > 60641ef1183bb4584c9356e84b6ca6d5fce58d6d/CMakeLists.txt#L23
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On a separate note, since MKLDNN is experimental can we stop
> building
> > > on
> > > > > master and cause PR's to queue up.
> > > > >
> > > > >
> > > > > On Thu, May 3, 2018 at 11:33 AM, Anirudh 
> > > wrote:
> > > > >
> > > > > > Correction: I was able to reproduce the issue with MKLDNN enabled
> > on
> > > > > > master, but not on 1.2 branch.
> > > > > >
> > > > > > On Thu, May 3, 2018 at 11:33 AM, Anirudh 
> > > > wrote:
> > > > > >
> > > > > > > Hi Pedro and Naveen,
> > > > > > >
> > > > > > > I am unable to reproduce this issue with MKLDNN on the master
> but
> > > not
> > > > > on
> > > > > > > the 1.2.RC2 branch.
> > > > > > >
> > > > > > > Did the following on 1.2.RC2 branch:
> > > > > > >
> > > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas
> > USE_DIST_KVSTORE=0
> > > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > > > > > export MXNET_TEST_SEED=11
> > > > > > > export MXNET_MODULE_SEED=812478194
> > > > > > > export MXNET_TEST_COUNT=1
> > > > > > > nosetests-2.7 -v tests/python/unittest/test_
> > > > > > module.py:test_forward_reshape
> > > > > > >
> > > > > > > Was able to do the 10k runs successfully.
> > > > > > >
> > > > > > > Anirudh
> > > > > > >
> > > > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh  >
> > > > wrote:
> > > > > > >
> > > > > > >> Hi Pedro and Naveen,
> > > > > > >>
> > > > > > >> Is this issue reproducible when MXNet is built with
> > USE_MKLDNN=0?
> > > > > > >> Also, there are a bunch of MKLDNN fixes that didn't go into
> the
> > > > > release
> > > > > > >> branch. Is this issue reproducible on the release branch ?
> > > > > > >> In my opinion, since we have marked MKLDNN as experimental
> > feature
> > > > for
> > > > > > >> the release, 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-03 Thread Anirudh
Hi Naveen,

You raise a good point and I agree that by default MKLDNN by default should
be switched off.
Because of a bug in Cmakelists.txt which has been fixed as part of #10731,
which is merged to master (but not on the release branch),
the users won't have MKLDNN enabled even though MKLDNN is set to ON by
default.
Since cmake install instructions have not been published in mxnet.io and is
lesser used compared to pip package and make from what I have seen,
 would it be acceptable for you if this is added as a known issue and a
workaround is provided in the release notes ?
The impacted users(who will have to use the workaround) here would be the
customers who are interested in MKLDNN and use cmake, users who aren't
interested in using MKLDNN feature won't be impacted.

Anirudh

On Thu, May 3, 2018 at 12:16 PM, Marco de Abreu <
marco.g.ab...@googlemail.com> wrote:

> The MKLDNN tests are not really less stable than the other tests. It's
> pretty much the same across all tests we have. So I wouldn't say there's a
> need to fix them in a separate branch.
>
> On Thu, May 3, 2018 at 9:00 PM, Naveen Swamy  wrote:
>
> > I also meant(but forgot to send), we stabilize it on a separate branch
> and
> > then bring in the changes instead of blocking the PRs.
> >
> > On Thu, May 3, 2018 at 11:57 AM, Marco de Abreu <
> > marco.g.ab...@googlemail.com> wrote:
> >
> > > I think the failing tests are really getting an issue. We now got
> roughly
> > > 50 test failure related issues [1], leading to a average failure rate
> of
> > > 50%. Considering the costs in terms of money and time per run, this is
> > > adding up quite significantly.
> > >
> > > Didn't we just remove MKLML from our codebase to replace it with
> MKLDNN?
> > I
> > > think removing something and marking the replacement as experimental
> > could
> > > be difficult from a user perspective. Personally, I don't really feel
> > > comfortable solving the problem of known issues by marking something as
> > > experimental. We're basically shifting the responsibility to our users
> > that
> > > way.
> > >
> > > I don't think we should stop testing MKLDNN in our CI. We already had
> the
> > > situation a few months ago where the solution to failed tests was to
> > > disable them. We shouldn't go back to that.
> > >
> > > -Marco
> > >
> > > [1]:
> > > https://github.com/apache/incubator-mxnet/issues?q=is%
> > > 3Aopen+is%3Aissue+label%3ATest
> > >
> > > On Thu, May 3, 2018 at 8:46 PM, Naveen Swamy 
> wrote:
> > >
> > > > USE_MKLDNN is set to ON in the cmake file by default, since its
> > > > experimental can we turn OFF  so there is some determinism when users
> > > build
> > > > and test.
> > > >
> > > > https://github.com/apache/incubator-mxnet/blob/
> > > > 60641ef1183bb4584c9356e84b6ca6d5fce58d6d/CMakeLists.txt#L23
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On a separate note, since MKLDNN is experimental can we stop building
> > on
> > > > master and cause PR's to queue up.
> > > >
> > > >
> > > > On Thu, May 3, 2018 at 11:33 AM, Anirudh 
> > wrote:
> > > >
> > > > > Correction: I was able to reproduce the issue with MKLDNN enabled
> on
> > > > > master, but not on 1.2 branch.
> > > > >
> > > > > On Thu, May 3, 2018 at 11:33 AM, Anirudh 
> > > wrote:
> > > > >
> > > > > > Hi Pedro and Naveen,
> > > > > >
> > > > > > I am unable to reproduce this issue with MKLDNN on the master but
> > not
> > > > on
> > > > > > the 1.2.RC2 branch.
> > > > > >
> > > > > > Did the following on 1.2.RC2 branch:
> > > > > >
> > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas
> USE_DIST_KVSTORE=0
> > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > > > > export MXNET_TEST_SEED=11
> > > > > > export MXNET_MODULE_SEED=812478194
> > > > > > export MXNET_TEST_COUNT=1
> > > > > > nosetests-2.7 -v tests/python/unittest/test_
> > > > > module.py:test_forward_reshape
> > > > > >
> > > > > > Was able to do the 10k runs successfully.
> > > > > >
> > > > > > Anirudh
> > > > > >
> > > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh 
> > > wrote:
> > > > > >
> > > > > >> Hi Pedro and Naveen,
> > > > > >>
> > > > > >> Is this issue reproducible when MXNet is built with
> USE_MKLDNN=0?
> > > > > >> Also, there are a bunch of MKLDNN fixes that didn't go into the
> > > > release
> > > > > >> branch. Is this issue reproducible on the release branch ?
> > > > > >> In my opinion, since we have marked MKLDNN as experimental
> feature
> > > for
> > > > > >> the release, if it is confirmed to be a MKLDNN issue
> > > > > >> we don't need to block the release on it.
> > > > > >>
> > > > > >> Anirudh
> > > > > >>
> > > > > >> On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy <
> mnnav...@gmail.com>
> > > > > wrote:
> > > > > >>
> > > > > >>> Thanks for raising this issue Pedro.
> > > > > >>>
> > > > > >>> -1(binding)
> > > > > 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-03 Thread Marco de Abreu
The MKLDNN tests are not really less stable than the other tests. It's
pretty much the same across all tests we have. So I wouldn't say there's a
need to fix them in a separate branch.

On Thu, May 3, 2018 at 9:00 PM, Naveen Swamy  wrote:

> I also meant(but forgot to send), we stabilize it on a separate branch and
> then bring in the changes instead of blocking the PRs.
>
> On Thu, May 3, 2018 at 11:57 AM, Marco de Abreu <
> marco.g.ab...@googlemail.com> wrote:
>
> > I think the failing tests are really getting an issue. We now got roughly
> > 50 test failure related issues [1], leading to a average failure rate of
> > 50%. Considering the costs in terms of money and time per run, this is
> > adding up quite significantly.
> >
> > Didn't we just remove MKLML from our codebase to replace it with MKLDNN?
> I
> > think removing something and marking the replacement as experimental
> could
> > be difficult from a user perspective. Personally, I don't really feel
> > comfortable solving the problem of known issues by marking something as
> > experimental. We're basically shifting the responsibility to our users
> that
> > way.
> >
> > I don't think we should stop testing MKLDNN in our CI. We already had the
> > situation a few months ago where the solution to failed tests was to
> > disable them. We shouldn't go back to that.
> >
> > -Marco
> >
> > [1]:
> > https://github.com/apache/incubator-mxnet/issues?q=is%
> > 3Aopen+is%3Aissue+label%3ATest
> >
> > On Thu, May 3, 2018 at 8:46 PM, Naveen Swamy  wrote:
> >
> > > USE_MKLDNN is set to ON in the cmake file by default, since its
> > > experimental can we turn OFF  so there is some determinism when users
> > build
> > > and test.
> > >
> > > https://github.com/apache/incubator-mxnet/blob/
> > > 60641ef1183bb4584c9356e84b6ca6d5fce58d6d/CMakeLists.txt#L23
> > >
> > >
> > >
> > >
> > >
> > >
> > > On a separate note, since MKLDNN is experimental can we stop building
> on
> > > master and cause PR's to queue up.
> > >
> > >
> > > On Thu, May 3, 2018 at 11:33 AM, Anirudh 
> wrote:
> > >
> > > > Correction: I was able to reproduce the issue with MKLDNN enabled on
> > > > master, but not on 1.2 branch.
> > > >
> > > > On Thu, May 3, 2018 at 11:33 AM, Anirudh 
> > wrote:
> > > >
> > > > > Hi Pedro and Naveen,
> > > > >
> > > > > I am unable to reproduce this issue with MKLDNN on the master but
> not
> > > on
> > > > > the 1.2.RC2 branch.
> > > > >
> > > > > Did the following on 1.2.RC2 branch:
> > > > >
> > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
> > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > > > export MXNET_TEST_SEED=11
> > > > > export MXNET_MODULE_SEED=812478194
> > > > > export MXNET_TEST_COUNT=1
> > > > > nosetests-2.7 -v tests/python/unittest/test_
> > > > module.py:test_forward_reshape
> > > > >
> > > > > Was able to do the 10k runs successfully.
> > > > >
> > > > > Anirudh
> > > > >
> > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh 
> > wrote:
> > > > >
> > > > >> Hi Pedro and Naveen,
> > > > >>
> > > > >> Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
> > > > >> Also, there are a bunch of MKLDNN fixes that didn't go into the
> > > release
> > > > >> branch. Is this issue reproducible on the release branch ?
> > > > >> In my opinion, since we have marked MKLDNN as experimental feature
> > for
> > > > >> the release, if it is confirmed to be a MKLDNN issue
> > > > >> we don't need to block the release on it.
> > > > >>
> > > > >> Anirudh
> > > > >>
> > > > >> On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy 
> > > > wrote:
> > > > >>
> > > > >>> Thanks for raising this issue Pedro.
> > > > >>>
> > > > >>> -1(binding)
> > > > >>>
> > > > >>> We were in a similar state for a while a year ago, a lot of
> effort
> > > went
> > > > >>> to
> > > > >>> stabilize the tests and the CI. I have seen the PR builds are
> > > > >>> non-deterministic and you have to retry over and over (wasting
> > > > resources
> > > > >>> and time) and hope you get lucky.
> > > > >>>
> > > > >>> Look at the dashboard for master build
> > > > >>> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> > > mxnet/job/master/
> > > > >>>
> > > > >>> -Naveen
> > > > >>>
> > > > >>> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy <
> > > > >>> pedro.larroy.li...@gmail.com>
> > > > >>> wrote:
> > > > >>>
> > > > >>> > -1  nondeterminisitc failures on CI master:
> > > > >>> > https://issues.apache.org/jira/browse/MXNET-396
> > > > >>> >
> > > > >>> > Was able to reproduce once in a fresh p3 instance with DLAMI
> > can't
> > > > >>> > reproduce consistently.
> > > > >>> >
> > > > >>> > On Wed, May 2, 2018 at 9:51 PM, Anirudh  >
> > > > wrote:
> > > > >>> >
> > > > >>> > > Hi all,
> > > > >>> > >
> > > > >>> > > As part of RC2 release, we have 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-03 Thread Naveen Swamy
I also meant(but forgot to send), we stabilize it on a separate branch and
then bring in the changes instead of blocking the PRs.

On Thu, May 3, 2018 at 11:57 AM, Marco de Abreu <
marco.g.ab...@googlemail.com> wrote:

> I think the failing tests are really getting an issue. We now got roughly
> 50 test failure related issues [1], leading to a average failure rate of
> 50%. Considering the costs in terms of money and time per run, this is
> adding up quite significantly.
>
> Didn't we just remove MKLML from our codebase to replace it with MKLDNN? I
> think removing something and marking the replacement as experimental could
> be difficult from a user perspective. Personally, I don't really feel
> comfortable solving the problem of known issues by marking something as
> experimental. We're basically shifting the responsibility to our users that
> way.
>
> I don't think we should stop testing MKLDNN in our CI. We already had the
> situation a few months ago where the solution to failed tests was to
> disable them. We shouldn't go back to that.
>
> -Marco
>
> [1]:
> https://github.com/apache/incubator-mxnet/issues?q=is%
> 3Aopen+is%3Aissue+label%3ATest
>
> On Thu, May 3, 2018 at 8:46 PM, Naveen Swamy  wrote:
>
> > USE_MKLDNN is set to ON in the cmake file by default, since its
> > experimental can we turn OFF  so there is some determinism when users
> build
> > and test.
> >
> > https://github.com/apache/incubator-mxnet/blob/
> > 60641ef1183bb4584c9356e84b6ca6d5fce58d6d/CMakeLists.txt#L23
> >
> >
> >
> >
> >
> >
> > On a separate note, since MKLDNN is experimental can we stop building on
> > master and cause PR's to queue up.
> >
> >
> > On Thu, May 3, 2018 at 11:33 AM, Anirudh  wrote:
> >
> > > Correction: I was able to reproduce the issue with MKLDNN enabled on
> > > master, but not on 1.2 branch.
> > >
> > > On Thu, May 3, 2018 at 11:33 AM, Anirudh 
> wrote:
> > >
> > > > Hi Pedro and Naveen,
> > > >
> > > > I am unable to reproduce this issue with MKLDNN on the master but not
> > on
> > > > the 1.2.RC2 branch.
> > > >
> > > > Did the following on 1.2.RC2 branch:
> > > >
> > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
> > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > > export MXNET_TEST_SEED=11
> > > > export MXNET_MODULE_SEED=812478194
> > > > export MXNET_TEST_COUNT=1
> > > > nosetests-2.7 -v tests/python/unittest/test_
> > > module.py:test_forward_reshape
> > > >
> > > > Was able to do the 10k runs successfully.
> > > >
> > > > Anirudh
> > > >
> > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh 
> wrote:
> > > >
> > > >> Hi Pedro and Naveen,
> > > >>
> > > >> Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
> > > >> Also, there are a bunch of MKLDNN fixes that didn't go into the
> > release
> > > >> branch. Is this issue reproducible on the release branch ?
> > > >> In my opinion, since we have marked MKLDNN as experimental feature
> for
> > > >> the release, if it is confirmed to be a MKLDNN issue
> > > >> we don't need to block the release on it.
> > > >>
> > > >> Anirudh
> > > >>
> > > >> On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy 
> > > wrote:
> > > >>
> > > >>> Thanks for raising this issue Pedro.
> > > >>>
> > > >>> -1(binding)
> > > >>>
> > > >>> We were in a similar state for a while a year ago, a lot of effort
> > went
> > > >>> to
> > > >>> stabilize the tests and the CI. I have seen the PR builds are
> > > >>> non-deterministic and you have to retry over and over (wasting
> > > resources
> > > >>> and time) and hope you get lucky.
> > > >>>
> > > >>> Look at the dashboard for master build
> > > >>> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> > mxnet/job/master/
> > > >>>
> > > >>> -Naveen
> > > >>>
> > > >>> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy <
> > > >>> pedro.larroy.li...@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>> > -1  nondeterminisitc failures on CI master:
> > > >>> > https://issues.apache.org/jira/browse/MXNET-396
> > > >>> >
> > > >>> > Was able to reproduce once in a fresh p3 instance with DLAMI
> can't
> > > >>> > reproduce consistently.
> > > >>> >
> > > >>> > On Wed, May 2, 2018 at 9:51 PM, Anirudh 
> > > wrote:
> > > >>> >
> > > >>> > > Hi all,
> > > >>> > >
> > > >>> > > As part of RC2 release, we have addressed bugs and some
> concerns
> > > that
> > > >>> > were
> > > >>> > > raised.
> > > >>> > >
> > > >>> > > I would like to propose a vote to release Apache MXNet
> > (incubating)
> > > >>> > version
> > > >>> > > 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and end
> at
> > > >>> 12:50 PM
> > > >>> > > PDT, Sunday, May 6th.
> > > >>> > >
> > > >>> > > Link to release notes:
> > > >>> > > https://cwiki.apache.org/confluence/display/MXNET/
> > > >>> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
> > > >>> > >
> 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-03 Thread Marco de Abreu
I think the failing tests are really getting an issue. We now got roughly
50 test failure related issues [1], leading to a average failure rate of
50%. Considering the costs in terms of money and time per run, this is
adding up quite significantly.

Didn't we just remove MKLML from our codebase to replace it with MKLDNN? I
think removing something and marking the replacement as experimental could
be difficult from a user perspective. Personally, I don't really feel
comfortable solving the problem of known issues by marking something as
experimental. We're basically shifting the responsibility to our users that
way.

I don't think we should stop testing MKLDNN in our CI. We already had the
situation a few months ago where the solution to failed tests was to
disable them. We shouldn't go back to that.

-Marco

[1]:
https://github.com/apache/incubator-mxnet/issues?q=is%3Aopen+is%3Aissue+label%3ATest

On Thu, May 3, 2018 at 8:46 PM, Naveen Swamy  wrote:

> USE_MKLDNN is set to ON in the cmake file by default, since its
> experimental can we turn OFF  so there is some determinism when users build
> and test.
>
> https://github.com/apache/incubator-mxnet/blob/
> 60641ef1183bb4584c9356e84b6ca6d5fce58d6d/CMakeLists.txt#L23
>
>
>
>
>
>
> On a separate note, since MKLDNN is experimental can we stop building on
> master and cause PR's to queue up.
>
>
> On Thu, May 3, 2018 at 11:33 AM, Anirudh  wrote:
>
> > Correction: I was able to reproduce the issue with MKLDNN enabled on
> > master, but not on 1.2 branch.
> >
> > On Thu, May 3, 2018 at 11:33 AM, Anirudh  wrote:
> >
> > > Hi Pedro and Naveen,
> > >
> > > I am unable to reproduce this issue with MKLDNN on the master but not
> on
> > > the 1.2.RC2 branch.
> > >
> > > Did the following on 1.2.RC2 branch:
> > >
> > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
> > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > export MXNET_TEST_SEED=11
> > > export MXNET_MODULE_SEED=812478194
> > > export MXNET_TEST_COUNT=1
> > > nosetests-2.7 -v tests/python/unittest/test_
> > module.py:test_forward_reshape
> > >
> > > Was able to do the 10k runs successfully.
> > >
> > > Anirudh
> > >
> > > On Thu, May 3, 2018 at 8:46 AM, Anirudh  wrote:
> > >
> > >> Hi Pedro and Naveen,
> > >>
> > >> Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
> > >> Also, there are a bunch of MKLDNN fixes that didn't go into the
> release
> > >> branch. Is this issue reproducible on the release branch ?
> > >> In my opinion, since we have marked MKLDNN as experimental feature for
> > >> the release, if it is confirmed to be a MKLDNN issue
> > >> we don't need to block the release on it.
> > >>
> > >> Anirudh
> > >>
> > >> On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy 
> > wrote:
> > >>
> > >>> Thanks for raising this issue Pedro.
> > >>>
> > >>> -1(binding)
> > >>>
> > >>> We were in a similar state for a while a year ago, a lot of effort
> went
> > >>> to
> > >>> stabilize the tests and the CI. I have seen the PR builds are
> > >>> non-deterministic and you have to retry over and over (wasting
> > resources
> > >>> and time) and hope you get lucky.
> > >>>
> > >>> Look at the dashboard for master build
> > >>> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> mxnet/job/master/
> > >>>
> > >>> -Naveen
> > >>>
> > >>> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy <
> > >>> pedro.larroy.li...@gmail.com>
> > >>> wrote:
> > >>>
> > >>> > -1  nondeterminisitc failures on CI master:
> > >>> > https://issues.apache.org/jira/browse/MXNET-396
> > >>> >
> > >>> > Was able to reproduce once in a fresh p3 instance with DLAMI  can't
> > >>> > reproduce consistently.
> > >>> >
> > >>> > On Wed, May 2, 2018 at 9:51 PM, Anirudh 
> > wrote:
> > >>> >
> > >>> > > Hi all,
> > >>> > >
> > >>> > > As part of RC2 release, we have addressed bugs and some concerns
> > that
> > >>> > were
> > >>> > > raised.
> > >>> > >
> > >>> > > I would like to propose a vote to release Apache MXNet
> (incubating)
> > >>> > version
> > >>> > > 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and end at
> > >>> 12:50 PM
> > >>> > > PDT, Sunday, May 6th.
> > >>> > >
> > >>> > > Link to release notes:
> > >>> > > https://cwiki.apache.org/confluence/display/MXNET/
> > >>> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
> > >>> > >
> > >>> > > Link to release candidate 1.2.0.rc2:
> > >>> > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
> > >>> > >
> > >>> > > Voting results for 1.2.0.rc2:
> > >>> > > https://lists.apache.org/thread.html/
> > ebe561c609a8e32351dfe4aafc8876
> > >>> > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org%3E
> > >>> > >
> > >>> > > View this page, click on "Build from Source", and use the source
> > code
> > >>> > > obtained from 1.2.0.rc2 tag:
> > >>> > > 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-03 Thread Naveen Swamy
USE_MKLDNN is set to ON in the cmake file by default, since its
experimental can we turn OFF  so there is some determinism when users build
and test.

https://github.com/apache/incubator-mxnet/blob/60641ef1183bb4584c9356e84b6ca6d5fce58d6d/CMakeLists.txt#L23






On a separate note, since MKLDNN is experimental can we stop building on
master and cause PR's to queue up.


On Thu, May 3, 2018 at 11:33 AM, Anirudh  wrote:

> Correction: I was able to reproduce the issue with MKLDNN enabled on
> master, but not on 1.2 branch.
>
> On Thu, May 3, 2018 at 11:33 AM, Anirudh  wrote:
>
> > Hi Pedro and Naveen,
> >
> > I am unable to reproduce this issue with MKLDNN on the master but not on
> > the 1.2.RC2 branch.
> >
> > Did the following on 1.2.RC2 branch:
> >
> > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
> > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > export MXNET_TEST_SEED=11
> > export MXNET_MODULE_SEED=812478194
> > export MXNET_TEST_COUNT=1
> > nosetests-2.7 -v tests/python/unittest/test_
> module.py:test_forward_reshape
> >
> > Was able to do the 10k runs successfully.
> >
> > Anirudh
> >
> > On Thu, May 3, 2018 at 8:46 AM, Anirudh  wrote:
> >
> >> Hi Pedro and Naveen,
> >>
> >> Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
> >> Also, there are a bunch of MKLDNN fixes that didn't go into the release
> >> branch. Is this issue reproducible on the release branch ?
> >> In my opinion, since we have marked MKLDNN as experimental feature for
> >> the release, if it is confirmed to be a MKLDNN issue
> >> we don't need to block the release on it.
> >>
> >> Anirudh
> >>
> >> On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy 
> wrote:
> >>
> >>> Thanks for raising this issue Pedro.
> >>>
> >>> -1(binding)
> >>>
> >>> We were in a similar state for a while a year ago, a lot of effort went
> >>> to
> >>> stabilize the tests and the CI. I have seen the PR builds are
> >>> non-deterministic and you have to retry over and over (wasting
> resources
> >>> and time) and hope you get lucky.
> >>>
> >>> Look at the dashboard for master build
> >>> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/master/
> >>>
> >>> -Naveen
> >>>
> >>> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy <
> >>> pedro.larroy.li...@gmail.com>
> >>> wrote:
> >>>
> >>> > -1  nondeterminisitc failures on CI master:
> >>> > https://issues.apache.org/jira/browse/MXNET-396
> >>> >
> >>> > Was able to reproduce once in a fresh p3 instance with DLAMI  can't
> >>> > reproduce consistently.
> >>> >
> >>> > On Wed, May 2, 2018 at 9:51 PM, Anirudh 
> wrote:
> >>> >
> >>> > > Hi all,
> >>> > >
> >>> > > As part of RC2 release, we have addressed bugs and some concerns
> that
> >>> > were
> >>> > > raised.
> >>> > >
> >>> > > I would like to propose a vote to release Apache MXNet (incubating)
> >>> > version
> >>> > > 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and end at
> >>> 12:50 PM
> >>> > > PDT, Sunday, May 6th.
> >>> > >
> >>> > > Link to release notes:
> >>> > > https://cwiki.apache.org/confluence/display/MXNET/
> >>> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
> >>> > >
> >>> > > Link to release candidate 1.2.0.rc2:
> >>> > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
> >>> > >
> >>> > > Voting results for 1.2.0.rc2:
> >>> > > https://lists.apache.org/thread.html/
> ebe561c609a8e32351dfe4aafc8876
> >>> > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org%3E
> >>> > >
> >>> > > View this page, click on "Build from Source", and use the source
> code
> >>> > > obtained from 1.2.0.rc2 tag:
> >>> > > https://mxnet.incubator.apache.org/install/index.html
> >>> > >
> >>> > > (Note: The README.md points to the 1.2.0 tag and does not work at
> the
> >>> > > moment.)
> >>> > >
> >>> > > Please remember to test first before voting accordingly:
> >>> > >
> >>> > > +1 = approve
> >>> > > +0 = no opinion
> >>> > > -1 = disapprove (provide reason)
> >>> > >
> >>> > > Anirudh
> >>> > >
> >>> >
> >>>
> >>
> >>
> >
>


Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-03 Thread Anirudh
Correction: I was able to reproduce the issue with MKLDNN enabled on
master, but not on 1.2 branch.

On Thu, May 3, 2018 at 11:33 AM, Anirudh  wrote:

> Hi Pedro and Naveen,
>
> I am unable to reproduce this issue with MKLDNN on the master but not on
> the 1.2.RC2 branch.
>
> Did the following on 1.2.RC2 branch:
>
> make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
> USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> export MXNET_TEST_SEED=11
> export MXNET_MODULE_SEED=812478194
> export MXNET_TEST_COUNT=1
> nosetests-2.7 -v tests/python/unittest/test_module.py:test_forward_reshape
>
> Was able to do the 10k runs successfully.
>
> Anirudh
>
> On Thu, May 3, 2018 at 8:46 AM, Anirudh  wrote:
>
>> Hi Pedro and Naveen,
>>
>> Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
>> Also, there are a bunch of MKLDNN fixes that didn't go into the release
>> branch. Is this issue reproducible on the release branch ?
>> In my opinion, since we have marked MKLDNN as experimental feature for
>> the release, if it is confirmed to be a MKLDNN issue
>> we don't need to block the release on it.
>>
>> Anirudh
>>
>> On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy  wrote:
>>
>>> Thanks for raising this issue Pedro.
>>>
>>> -1(binding)
>>>
>>> We were in a similar state for a while a year ago, a lot of effort went
>>> to
>>> stabilize the tests and the CI. I have seen the PR builds are
>>> non-deterministic and you have to retry over and over (wasting resources
>>> and time) and hope you get lucky.
>>>
>>> Look at the dashboard for master build
>>> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/master/
>>>
>>> -Naveen
>>>
>>> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy <
>>> pedro.larroy.li...@gmail.com>
>>> wrote:
>>>
>>> > -1  nondeterminisitc failures on CI master:
>>> > https://issues.apache.org/jira/browse/MXNET-396
>>> >
>>> > Was able to reproduce once in a fresh p3 instance with DLAMI  can't
>>> > reproduce consistently.
>>> >
>>> > On Wed, May 2, 2018 at 9:51 PM, Anirudh  wrote:
>>> >
>>> > > Hi all,
>>> > >
>>> > > As part of RC2 release, we have addressed bugs and some concerns that
>>> > were
>>> > > raised.
>>> > >
>>> > > I would like to propose a vote to release Apache MXNet (incubating)
>>> > version
>>> > > 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and end at
>>> 12:50 PM
>>> > > PDT, Sunday, May 6th.
>>> > >
>>> > > Link to release notes:
>>> > > https://cwiki.apache.org/confluence/display/MXNET/
>>> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
>>> > >
>>> > > Link to release candidate 1.2.0.rc2:
>>> > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
>>> > >
>>> > > Voting results for 1.2.0.rc2:
>>> > > https://lists.apache.org/thread.html/ebe561c609a8e32351dfe4aafc8876
>>> > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org%3E
>>> > >
>>> > > View this page, click on "Build from Source", and use the source code
>>> > > obtained from 1.2.0.rc2 tag:
>>> > > https://mxnet.incubator.apache.org/install/index.html
>>> > >
>>> > > (Note: The README.md points to the 1.2.0 tag and does not work at the
>>> > > moment.)
>>> > >
>>> > > Please remember to test first before voting accordingly:
>>> > >
>>> > > +1 = approve
>>> > > +0 = no opinion
>>> > > -1 = disapprove (provide reason)
>>> > >
>>> > > Anirudh
>>> > >
>>> >
>>>
>>
>>
>


Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-03 Thread Anirudh
Hi Pedro and Naveen,

I am unable to reproduce this issue with MKLDNN on the master but not on
the 1.2.RC2 branch.

Did the following on 1.2.RC2 branch:

make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
export MXNET_TEST_SEED=11
export MXNET_MODULE_SEED=812478194
export MXNET_TEST_COUNT=1
nosetests-2.7 -v tests/python/unittest/test_module.py:test_forward_reshape

Was able to do the 10k runs successfully.

Anirudh

On Thu, May 3, 2018 at 8:46 AM, Anirudh  wrote:

> Hi Pedro and Naveen,
>
> Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
> Also, there are a bunch of MKLDNN fixes that didn't go into the release
> branch. Is this issue reproducible on the release branch ?
> In my opinion, since we have marked MKLDNN as experimental feature for the
> release, if it is confirmed to be a MKLDNN issue
> we don't need to block the release on it.
>
> Anirudh
>
> On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy  wrote:
>
>> Thanks for raising this issue Pedro.
>>
>> -1(binding)
>>
>> We were in a similar state for a while a year ago, a lot of effort went to
>> stabilize the tests and the CI. I have seen the PR builds are
>> non-deterministic and you have to retry over and over (wasting resources
>> and time) and hope you get lucky.
>>
>> Look at the dashboard for master build
>> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/master/
>>
>> -Naveen
>>
>> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy <
>> pedro.larroy.li...@gmail.com>
>> wrote:
>>
>> > -1  nondeterminisitc failures on CI master:
>> > https://issues.apache.org/jira/browse/MXNET-396
>> >
>> > Was able to reproduce once in a fresh p3 instance with DLAMI  can't
>> > reproduce consistently.
>> >
>> > On Wed, May 2, 2018 at 9:51 PM, Anirudh  wrote:
>> >
>> > > Hi all,
>> > >
>> > > As part of RC2 release, we have addressed bugs and some concerns that
>> > were
>> > > raised.
>> > >
>> > > I would like to propose a vote to release Apache MXNet (incubating)
>> > version
>> > > 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and end at
>> 12:50 PM
>> > > PDT, Sunday, May 6th.
>> > >
>> > > Link to release notes:
>> > > https://cwiki.apache.org/confluence/display/MXNET/
>> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
>> > >
>> > > Link to release candidate 1.2.0.rc2:
>> > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
>> > >
>> > > Voting results for 1.2.0.rc2:
>> > > https://lists.apache.org/thread.html/ebe561c609a8e32351dfe4aafc8876
>> > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org%3E
>> > >
>> > > View this page, click on "Build from Source", and use the source code
>> > > obtained from 1.2.0.rc2 tag:
>> > > https://mxnet.incubator.apache.org/install/index.html
>> > >
>> > > (Note: The README.md points to the 1.2.0 tag and does not work at the
>> > > moment.)
>> > >
>> > > Please remember to test first before voting accordingly:
>> > >
>> > > +1 = approve
>> > > +0 = no opinion
>> > > -1 = disapprove (provide reason)
>> > >
>> > > Anirudh
>> > >
>> >
>>
>
>


Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-03 Thread Anirudh
Hi Pedro and Naveen,

Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
Also, there are a bunch of MKLDNN fixes that didn't go into the release
branch. Is this issue reproducible on the release branch ?
In my opinion, since we have marked MKLDNN as experimental feature for the
release, if it is confirmed to be a MKLDNN issue
we don't need to block the release on it.

Anirudh

On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy  wrote:

> Thanks for raising this issue Pedro.
>
> -1(binding)
>
> We were in a similar state for a while a year ago, a lot of effort went to
> stabilize the tests and the CI. I have seen the PR builds are
> non-deterministic and you have to retry over and over (wasting resources
> and time) and hope you get lucky.
>
> Look at the dashboard for master build
> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/master/
>
> -Naveen
>
> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy  >
> wrote:
>
> > -1  nondeterminisitc failures on CI master:
> > https://issues.apache.org/jira/browse/MXNET-396
> >
> > Was able to reproduce once in a fresh p3 instance with DLAMI  can't
> > reproduce consistently.
> >
> > On Wed, May 2, 2018 at 9:51 PM, Anirudh  wrote:
> >
> > > Hi all,
> > >
> > > As part of RC2 release, we have addressed bugs and some concerns that
> > were
> > > raised.
> > >
> > > I would like to propose a vote to release Apache MXNet (incubating)
> > version
> > > 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and end at 12:50
> PM
> > > PDT, Sunday, May 6th.
> > >
> > > Link to release notes:
> > > https://cwiki.apache.org/confluence/display/MXNET/
> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
> > >
> > > Link to release candidate 1.2.0.rc2:
> > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
> > >
> > > Voting results for 1.2.0.rc2:
> > > https://lists.apache.org/thread.html/ebe561c609a8e32351dfe4aafc8876
> > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org%3E
> > >
> > > View this page, click on "Build from Source", and use the source code
> > > obtained from 1.2.0.rc2 tag:
> > > https://mxnet.incubator.apache.org/install/index.html
> > >
> > > (Note: The README.md points to the 1.2.0 tag and does not work at the
> > > moment.)
> > >
> > > Please remember to test first before voting accordingly:
> > >
> > > +1 = approve
> > > +0 = no opinion
> > > -1 = disapprove (provide reason)
> > >
> > > Anirudh
> > >
> >
>


Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-03 Thread Naveen Swamy
Thanks for raising this issue Pedro.

-1(binding)

We were in a similar state for a while a year ago, a lot of effort went to
stabilize the tests and the CI. I have seen the PR builds are
non-deterministic and you have to retry over and over (wasting resources
and time) and hope you get lucky.

Look at the dashboard for master build
http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/master/

-Naveen

On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy 
wrote:

> -1  nondeterminisitc failures on CI master:
> https://issues.apache.org/jira/browse/MXNET-396
>
> Was able to reproduce once in a fresh p3 instance with DLAMI  can't
> reproduce consistently.
>
> On Wed, May 2, 2018 at 9:51 PM, Anirudh  wrote:
>
> > Hi all,
> >
> > As part of RC2 release, we have addressed bugs and some concerns that
> were
> > raised.
> >
> > I would like to propose a vote to release Apache MXNet (incubating)
> version
> > 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and end at 12:50 PM
> > PDT, Sunday, May 6th.
> >
> > Link to release notes:
> > https://cwiki.apache.org/confluence/display/MXNET/
> > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
> >
> > Link to release candidate 1.2.0.rc2:
> > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
> >
> > Voting results for 1.2.0.rc2:
> > https://lists.apache.org/thread.html/ebe561c609a8e32351dfe4aafc8876
> > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org%3E
> >
> > View this page, click on "Build from Source", and use the source code
> > obtained from 1.2.0.rc2 tag:
> > https://mxnet.incubator.apache.org/install/index.html
> >
> > (Note: The README.md points to the 1.2.0 tag and does not work at the
> > moment.)
> >
> > Please remember to test first before voting accordingly:
> >
> > +1 = approve
> > +0 = no opinion
> > -1 = disapprove (provide reason)
> >
> > Anirudh
> >
>


Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-03 Thread Pedro Larroy
-1  nondeterminisitc failures on CI master:
https://issues.apache.org/jira/browse/MXNET-396

Was able to reproduce once in a fresh p3 instance with DLAMI  can't
reproduce consistently.

On Wed, May 2, 2018 at 9:51 PM, Anirudh  wrote:

> Hi all,
>
> As part of RC2 release, we have addressed bugs and some concerns that were
> raised.
>
> I would like to propose a vote to release Apache MXNet (incubating) version
> 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and end at 12:50 PM
> PDT, Sunday, May 6th.
>
> Link to release notes:
> https://cwiki.apache.org/confluence/display/MXNET/
> Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
>
> Link to release candidate 1.2.0.rc2:
> https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
>
> Voting results for 1.2.0.rc2:
> https://lists.apache.org/thread.html/ebe561c609a8e32351dfe4aafc8876
> 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org%3E
>
> View this page, click on "Build from Source", and use the source code
> obtained from 1.2.0.rc2 tag:
> https://mxnet.incubator.apache.org/install/index.html
>
> (Note: The README.md points to the 1.2.0 tag and does not work at the
> moment.)
>
> Please remember to test first before voting accordingly:
>
> +1 = approve
> +0 = no opinion
> -1 = disapprove (provide reason)
>
> Anirudh
>