The MKLDNN tests are not really less stable than the other tests. It's pretty much the same across all tests we have. So I wouldn't say there's a need to fix them in a separate branch.
On Thu, May 3, 2018 at 9:00 PM, Naveen Swamy <[email protected]> wrote: > I also meant(but forgot to send), we stabilize it on a separate branch and > then bring in the changes instead of blocking the PRs. > > On Thu, May 3, 2018 at 11:57 AM, Marco de Abreu < > [email protected]> wrote: > > > I think the failing tests are really getting an issue. We now got roughly > > 50 test failure related issues [1], leading to a average failure rate of > > 50%. Considering the costs in terms of money and time per run, this is > > adding up quite significantly. > > > > Didn't we just remove MKLML from our codebase to replace it with MKLDNN? > I > > think removing something and marking the replacement as experimental > could > > be difficult from a user perspective. Personally, I don't really feel > > comfortable solving the problem of known issues by marking something as > > experimental. We're basically shifting the responsibility to our users > that > > way. > > > > I don't think we should stop testing MKLDNN in our CI. We already had the > > situation a few months ago where the solution to failed tests was to > > disable them. We shouldn't go back to that. > > > > -Marco > > > > [1]: > > https://github.com/apache/incubator-mxnet/issues?q=is% > > 3Aopen+is%3Aissue+label%3ATest > > > > On Thu, May 3, 2018 at 8:46 PM, Naveen Swamy <[email protected]> wrote: > > > > > USE_MKLDNN is set to ON in the cmake file by default, since its > > > experimental can we turn OFF so there is some determinism when users > > build > > > and test. > > > > > > https://github.com/apache/incubator-mxnet/blob/ > > > 60641ef1183bb4584c9356e84b6ca6d5fce58d6d/CMakeLists.txt#L23 > > > > > > > > > > > > > > > > > > > > > On a separate note, since MKLDNN is experimental can we stop building > on > > > master and cause PR's to queue up. > > > > > > > > > On Thu, May 3, 2018 at 11:33 AM, Anirudh <[email protected]> > wrote: > > > > > > > Correction: I was able to reproduce the issue with MKLDNN enabled on > > > > master, but not on 1.2 branch. > > > > > > > > On Thu, May 3, 2018 at 11:33 AM, Anirudh <[email protected]> > > wrote: > > > > > > > > > Hi Pedro and Naveen, > > > > > > > > > > I am unable to reproduce this issue with MKLDNN on the master but > not > > > on > > > > > the 1.2.RC2 branch. > > > > > > > > > > Did the following on 1.2.RC2 branch: > > > > > > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0 > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1 > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0 > > > > > export MXNET_TEST_SEED=11 > > > > > export MXNET_MODULE_SEED=812478194 > > > > > export MXNET_TEST_COUNT=10000 > > > > > nosetests-2.7 -v tests/python/unittest/test_ > > > > module.py:test_forward_reshape > > > > > > > > > > Was able to do the 10k runs successfully. > > > > > > > > > > Anirudh > > > > > > > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh <[email protected]> > > wrote: > > > > > > > > > >> Hi Pedro and Naveen, > > > > >> > > > > >> Is this issue reproducible when MXNet is built with USE_MKLDNN=0? > > > > >> Also, there are a bunch of MKLDNN fixes that didn't go into the > > > release > > > > >> branch. Is this issue reproducible on the release branch ? > > > > >> In my opinion, since we have marked MKLDNN as experimental feature > > for > > > > >> the release, if it is confirmed to be a MKLDNN issue > > > > >> we don't need to block the release on it. > > > > >> > > > > >> Anirudh > > > > >> > > > > >> On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy <[email protected]> > > > > wrote: > > > > >> > > > > >>> Thanks for raising this issue Pedro. > > > > >>> > > > > >>> -1(binding) > > > > >>> > > > > >>> We were in a similar state for a while a year ago, a lot of > effort > > > went > > > > >>> to > > > > >>> stabilize the tests and the CI. I have seen the PR builds are > > > > >>> non-deterministic and you have to retry over and over (wasting > > > > resources > > > > >>> and time) and hope you get lucky. > > > > >>> > > > > >>> Look at the dashboard for master build > > > > >>> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator- > > > mxnet/job/master/ > > > > >>> > > > > >>> -Naveen > > > > >>> > > > > >>> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy < > > > > >>> [email protected]> > > > > >>> wrote: > > > > >>> > > > > >>> > -1 nondeterminisitc failures on CI master: > > > > >>> > https://issues.apache.org/jira/browse/MXNET-396 > > > > >>> > > > > > >>> > Was able to reproduce once in a fresh p3 instance with DLAMI > > can't > > > > >>> > reproduce consistently. > > > > >>> > > > > > >>> > On Wed, May 2, 2018 at 9:51 PM, Anirudh <[email protected] > > > > > > wrote: > > > > >>> > > > > > >>> > > Hi all, > > > > >>> > > > > > > >>> > > As part of RC2 release, we have addressed bugs and some > > concerns > > > > that > > > > >>> > were > > > > >>> > > raised. > > > > >>> > > > > > > >>> > > I would like to propose a vote to release Apache MXNet > > > (incubating) > > > > >>> > version > > > > >>> > > 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and end > > at > > > > >>> 12:50 PM > > > > >>> > > PDT, Sunday, May 6th. > > > > >>> > > > > > > >>> > > Link to release notes: > > > > >>> > > https://cwiki.apache.org/confluence/display/MXNET/ > > > > >>> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes > > > > >>> > > > > > > >>> > > Link to release candidate 1.2.0.rc2: > > > > >>> > > https://github.com/apache/incubator-mxnet/releases/tag/ > > 1.2.0.rc2 > > > > >>> > > > > > > >>> > > Voting results for 1.2.0.rc2: > > > > >>> > > https://lists.apache.org/thread.html/ > > > > ebe561c609a8e32351dfe4aafc8876 > > > > >>> > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org%3E > > > > >>> > > > > > > >>> > > View this page, click on "Build from Source", and use the > > source > > > > code > > > > >>> > > obtained from 1.2.0.rc2 tag: > > > > >>> > > https://mxnet.incubator.apache.org/install/index.html > > > > >>> > > > > > > >>> > > (Note: The README.md points to the 1.2.0 tag and does not > work > > at > > > > the > > > > >>> > > moment.) > > > > >>> > > > > > > >>> > > Please remember to test first before voting accordingly: > > > > >>> > > > > > > >>> > > +1 = approve > > > > >>> > > +0 = no opinion > > > > >>> > > -1 = disapprove (provide reason) > > > > >>> > > > > > > >>> > > Anirudh > > > > >>> > > > > > > >>> > > > > > >>> > > > > >> > > > > >> > > > > > > > > > > > > > > >
