I see your point. I checked the failures on the v1.2.0 branch and I don't see segfaults, just minor failures due to flaky tests.
I will trigger it repeatedly a few times until Sunday to have a and change my vote accordingly. http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/v1.2.0/ http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/v1.2.0/17/pipeline http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/v1.2.0/15/pipeline/ Pedro. On Fri, May 4, 2018 at 7:16 PM, Anirudh <anirudh2...@gmail.com> wrote: > Hi Pedro, > > Thank you for the suggestions. I will try to reproduce this without fixed > seeds and also run it for a longer time duration. > Having said that, running unit tests over and over for a couple of days > will likely cause > problems because there around 42 open issues for flaky tests: > https://github.com/apache/incubator-mxnet/issues?q=is% > 3Aopen+is%3Aissue+label%3AFlaky > Also, the release branch has diverged from master around 3 weeks back and > it doesn't have many of the changes merged to the master. > So, my question essentially is, what will be your benchmark to accept the > release ? > Is it that we run the test which you provided on 1.2 without fixed seeds > and for a longer duration without failures ? > Or is it that all unit tests should pass over a period of 2 days without > issues. This may require fixing all of the flaky tests which would delay > the release by considerable amount of time. > Or is it something else ? > > Anirudh > > > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy <pedro.larroy.li...@gmail.com > > > wrote: > > > Could you remove the fixed seeds and run it for a couple of hours with an > > additional loop? Also I would suggest running the unit tests over and > over > > for a couple of days if possible. > > > > > > Pedro. > > > > On Thu, May 3, 2018 at 8:33 PM, Anirudh <anirudh2...@gmail.com> wrote: > > > > > Hi Pedro and Naveen, > > > > > > I am unable to reproduce this issue with MKLDNN on the master but not > on > > > the 1.2.RC2 branch. > > > > > > Did the following on 1.2.RC2 branch: > > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0 > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1 > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0 > > > export MXNET_TEST_SEED=11 > > > export MXNET_MODULE_SEED=812478194 > > > export MXNET_TEST_COUNT=10000 > > > nosetests-2.7 -v tests/python/unittest/test_ > > module.py:test_forward_reshape > > > > > > Was able to do the 10k runs successfully. > > > > > > Anirudh > > > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh <anirudh2...@gmail.com> wrote: > > > > > > > Hi Pedro and Naveen, > > > > > > > > Is this issue reproducible when MXNet is built with USE_MKLDNN=0? > > > > Also, there are a bunch of MKLDNN fixes that didn't go into the > release > > > > branch. Is this issue reproducible on the release branch ? > > > > In my opinion, since we have marked MKLDNN as experimental feature > for > > > the > > > > release, if it is confirmed to be a MKLDNN issue > > > > we don't need to block the release on it. > > > > > > > > Anirudh > > > > > > > > On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy <mnnav...@gmail.com> > > wrote: > > > > > > > >> Thanks for raising this issue Pedro. > > > >> > > > >> -1(binding) > > > >> > > > >> We were in a similar state for a while a year ago, a lot of effort > > went > > > to > > > >> stabilize the tests and the CI. I have seen the PR builds are > > > >> non-deterministic and you have to retry over and over (wasting > > resources > > > >> and time) and hope you get lucky. > > > >> > > > >> Look at the dashboard for master build > > > >> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator- > mxnet/job/master/ > > > >> > > > >> -Naveen > > > >> > > > >> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy < > > > >> pedro.larroy.li...@gmail.com> > > > >> wrote: > > > >> > > > >> > -1 nondeterminisitc failures on CI master: > > > >> > https://issues.apache.org/jira/browse/MXNET-396 > > > >> > > > > >> > Was able to reproduce once in a fresh p3 instance with DLAMI > can't > > > >> > reproduce consistently. > > > >> > > > > >> > On Wed, May 2, 2018 at 9:51 PM, Anirudh <anirudh2...@gmail.com> > > > wrote: > > > >> > > > > >> > > Hi all, > > > >> > > > > > >> > > As part of RC2 release, we have addressed bugs and some concerns > > > that > > > >> > were > > > >> > > raised. > > > >> > > > > > >> > > I would like to propose a vote to release Apache MXNet > > (incubating) > > > >> > version > > > >> > > 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and end at > > > >> 12:50 PM > > > >> > > PDT, Sunday, May 6th. > > > >> > > > > > >> > > Link to release notes: > > > >> > > https://cwiki.apache.org/confluence/display/MXNET/ > > > >> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes > > > >> > > > > > >> > > Link to release candidate 1.2.0.rc2: > > > >> > > https://github.com/apache/incubator-mxnet/releases/tag/ > 1.2.0.rc2 > > > >> > > > > > >> > > Voting results for 1.2.0.rc2: > > > >> > > https://lists.apache.org/thread.html/ > > ebe561c609a8e32351dfe4aafc8876 > > > >> > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org%3E > > > >> > > > > > >> > > View this page, click on "Build from Source", and use the source > > > code > > > >> > > obtained from 1.2.0.rc2 tag: > > > >> > > https://mxnet.incubator.apache.org/install/index.html > > > >> > > > > > >> > > (Note: The README.md points to the 1.2.0 tag and does not work > at > > > the > > > >> > > moment.) > > > >> > > > > > >> > > Please remember to test first before voting accordingly: > > > >> > > > > > >> > > +1 = approve > > > >> > > +0 = no opinion > > > >> > > -1 = disapprove (provide reason) > > > >> > > > > > >> > > Anirudh > > > >> > > > > > >> > > > > >> > > > > > > > > > > > > > >