Hi Anirudh, yes, also tried that, didn't resolve. Looking into root cause and will update.
Best Regards Lai Wei https://www.linkedin.com/pub/lai-wei/2b/731/52b On Mon, May 7, 2018 at 2:15 PM, Anirudh <anirudh2...@gmail.com> wrote: > Hi Lai, > > I see that you used USE_MKL2017_EXPERIMENTAL=1, I am not sure if this is > the right flag. Did you try USE_MKLDNN=1 ? > > Anirudh > > On Mon, May 7, 2018 at 1:22 PM, Lai Wei <roywei...@gmail.com> wrote: > > > Hi, > > > > I would like to raise an issue with mxnet-mkl. The keras-mxnet package > was > > working fine with mxnet-mkl 1.1.0 for training on CPU. However, weights > are > > not updated when I use mxnet-mkl 1.2.0b20180507. I tried both 'pip > install > > mxnet-mkl --pre' and built from source from release branch (v1.2.0) with > > mkl flag. > > > > Please refer to this issue for more details: > > https://github.com/awslabs/keras-apache-mxnet/issues/75 > > > > There is no code change on keras-mxnet side, so I guess some API broke > when > > using latest mxnet-mkl. Still working on finding the root cause. > > > > Thanks > > > > > > Best Regards > > > > Lai Wei > > > > https://www.linkedin.com/pub/lai-wei/2b/731/52b > > > > On Mon, May 7, 2018 at 10:38 AM, Haibin Lin <haibin.lin....@gmail.com> > > wrote: > > > > > +1 binding. Build from source with CUDA, ran linear classification > > example > > > and works fine. > > > > > > Best. > > > Haibin > > > > > > > > > On Sun, May 6, 2018 at 10:08 PM, Steffen Rochel < > steffenroc...@gmail.com > > > > > > wrote: > > > > > > > +1 (non-binding). Tested with selected notebooks from The Straight > > Dope. > > > > So many important enhancements everybody contributed and our users > are > > > > waiting for. Hope we will see more votes. > > > > Steffen > > > > On Mon, May 7, 2018 at 1:07 AM Anirudh <anirudh2...@gmail.com> > wrote: > > > > > > > > > Hi all, > > > > > > > > > > Since we don't have enough binding votes yet, I am extending the > vote > > > > till > > > > > tomorrow (Monday May 7th), 12:50 PM PDT. > > > > > > > > > > Anirudh > > > > > > > > > > On Sun, May 6, 2018 at 4:05 PM, Anirudh <anirudh2...@gmail.com> > > wrote: > > > > > > > > > > > Hi Pedro, > > > > > > > > > > > > Thanks for the clarification. I was able to reproduce the issue > > with > > > > > > USE_OPENMP=OFF. I wasn't able to reproduce the issue with Make. > > Since > > > > the > > > > > > issue is not reproducible with make and the customers using > > > > > USE_OPENMP=OFF > > > > > > with cmake should be small, I agree with you that this should not > > be > > > a > > > > > > blocker. I have added the issue to known issues in release notes: > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2 > > > > > > > > > > > > Anirudh > > > > > > > > > > > > On Sun, May 6, 2018 at 9:03 AM, Pedro Larroy < > > > > > pedro.larroy.li...@gmail.com > > > > > > > wrote: > > > > > > > > > > > >> Agreed, I was not aware that the problems where not present in > the > > > > > release > > > > > >> branch. > > > > > >> > > > > > >> On Fri, May 4, 2018 at 8:32 PM, Haibin Lin < > > > haibin.lin....@gmail.com> > > > > > >> wrote: > > > > > >> > > > > > >> > I agree with Anirudh that the focus of the discussion should > be > > > > > limited > > > > > >> to > > > > > >> > the release branch, not the master branch. Anything that > breaks > > on > > > > > >> master > > > > > >> > but works on release branch should not block the release > itself. > > > > > >> > > > > > > >> > > > > > > >> > Best, > > > > > >> > > > > > > >> > Haibin > > > > > >> > > > > > > >> > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy < > > > > > >> > pedro.larroy.li...@gmail.com> > > > > > >> > wrote: > > > > > >> > > > > > > >> > > I see your point. > > > > > >> > > > > > > > >> > > I checked the failures on the v1.2.0 branch and I don't see > > > > > segfaults, > > > > > >> > just > > > > > >> > > minor failures due to flaky tests. > > > > > >> > > > > > > > >> > > I will trigger it repeatedly a few times until Sunday to > have > > a > > > > and > > > > > >> > change > > > > > >> > > my vote accordingly. > > > > > >> > > > > > > > >> > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator- > > mxnet/job/v1.2.0/ > > > > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/ > > > organizations/jenkins/ > > > > > >> > > incubator-mxnet/detail/v1.2.0/17/pipeline > > > > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/ > > > organizations/jenkins/ > > > > > >> > > incubator-mxnet/detail/v1.2.0/15/pipeline/ > > > > > >> > > > > > > > >> > > > > > > > >> > > Pedro. > > > > > >> > > > > > > > >> > > On Fri, May 4, 2018 at 7:16 PM, Anirudh < > > anirudh2...@gmail.com> > > > > > >> wrote: > > > > > >> > > > > > > > >> > > > Hi Pedro, > > > > > >> > > > > > > > > >> > > > Thank you for the suggestions. I will try to reproduce > this > > > > > without > > > > > >> > fixed > > > > > >> > > > seeds and also run it for a longer time duration. > > > > > >> > > > Having said that, running unit tests over and over for a > > > couple > > > > of > > > > > >> days > > > > > >> > > > will likely cause > > > > > >> > > > problems because there around 42 open issues for flaky > > tests: > > > > > >> > > > https://github.com/apache/incubator-mxnet/issues?q=is% > > > > > >> > > > 3Aopen+is%3Aissue+label%3AFlaky > > > > > >> > > > Also, the release branch has diverged from master around 3 > > > weeks > > > > > >> back > > > > > >> > and > > > > > >> > > > it doesn't have many of the changes merged to the master. > > > > > >> > > > So, my question essentially is, what will be your > benchmark > > to > > > > > >> accept > > > > > >> > the > > > > > >> > > > release ? > > > > > >> > > > Is it that we run the test which you provided on 1.2 > without > > > > fixed > > > > > >> > seeds > > > > > >> > > > and for a longer duration without failures ? > > > > > >> > > > Or is it that all unit tests should pass over a period of > 2 > > > days > > > > > >> > without > > > > > >> > > > issues. This may require fixing all of the flaky tests > which > > > > would > > > > > >> > delay > > > > > >> > > > the release by considerable amount of time. > > > > > >> > > > Or is it something else ? > > > > > >> > > > > > > > > >> > > > Anirudh > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy < > > > > > >> > > pedro.larroy.li...@gmail.com > > > > > >> > > > > > > > > > >> > > > wrote: > > > > > >> > > > > > > > > >> > > > > Could you remove the fixed seeds and run it for a couple > > of > > > > > hours > > > > > >> > with > > > > > >> > > an > > > > > >> > > > > additional loop? Also I would suggest running the unit > > > tests > > > > > over > > > > > >> > and > > > > > >> > > > over > > > > > >> > > > > for a couple of days if possible. > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > Pedro. > > > > > >> > > > > > > > > > >> > > > > On Thu, May 3, 2018 at 8:33 PM, Anirudh < > > > > anirudh2...@gmail.com> > > > > > >> > wrote: > > > > > >> > > > > > > > > > >> > > > > > Hi Pedro and Naveen, > > > > > >> > > > > > > > > > > >> > > > > > I am unable to reproduce this issue with MKLDNN on the > > > > master > > > > > >> but > > > > > >> > not > > > > > >> > > > on > > > > > >> > > > > > the 1.2.RC2 branch. > > > > > >> > > > > > > > > > > >> > > > > > Did the following on 1.2.RC2 branch: > > > > > >> > > > > > > > > > > >> > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas > > > > > >> USE_DIST_KVSTORE=0 > > > > > >> > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1 > > > > > >> > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0 > > > > > >> > > > > > export MXNET_TEST_SEED=11 > > > > > >> > > > > > export MXNET_MODULE_SEED=812478194 > > > > > >> > > > > > export MXNET_TEST_COUNT=10000 > > > > > >> > > > > > nosetests-2.7 -v tests/python/unittest/test_ > > > > > >> > > > > module.py:test_forward_reshape > > > > > >> > > > > > > > > > > >> > > > > > Was able to do the 10k runs successfully. > > > > > >> > > > > > > > > > > >> > > > > > Anirudh > > > > > >> > > > > > > > > > > >> > > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh < > > > > > anirudh2...@gmail.com> > > > > > >> > > wrote: > > > > > >> > > > > > > > > > > >> > > > > > > Hi Pedro and Naveen, > > > > > >> > > > > > > > > > > > >> > > > > > > Is this issue reproducible when MXNet is built with > > > > > >> USE_MKLDNN=0? > > > > > >> > > > > > > Also, there are a bunch of MKLDNN fixes that didn't > go > > > > into > > > > > >> the > > > > > >> > > > release > > > > > >> > > > > > > branch. Is this issue reproducible on the release > > > branch ? > > > > > >> > > > > > > In my opinion, since we have marked MKLDNN as > > > experimental > > > > > >> > feature > > > > > >> > > > for > > > > > >> > > > > > the > > > > > >> > > > > > > release, if it is confirmed to be a MKLDNN issue > > > > > >> > > > > > > we don't need to block the release on it. > > > > > >> > > > > > > > > > > > >> > > > > > > Anirudh > > > > > >> > > > > > > > > > > > >> > > > > > > On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy < > > > > > >> mnnav...@gmail.com > > > > > >> > > > > > > > >> > > > > wrote: > > > > > >> > > > > > > > > > > > >> > > > > > >> Thanks for raising this issue Pedro. > > > > > >> > > > > > >> > > > > > >> > > > > > >> -1(binding) > > > > > >> > > > > > >> > > > > > >> > > > > > >> We were in a similar state for a while a year ago, > a > > > lot > > > > of > > > > > >> > effort > > > > > >> > > > > went > > > > > >> > > > > > to > > > > > >> > > > > > >> stabilize the tests and the CI. I have seen the PR > > > builds > > > > > are > > > > > >> > > > > > >> non-deterministic and you have to retry over and > over > > > > > >> (wasting > > > > > >> > > > > resources > > > > > >> > > > > > >> and time) and hope you get lucky. > > > > > >> > > > > > >> > > > > > >> > > > > > >> Look at the dashboard for master build > > > > > >> > > > > > >> http://jenkins.mxnet-ci. > amazon-ml.com/job/incubator- > > > > > >> > > > mxnet/job/master/ > > > > > >> > > > > > >> > > > > > >> > > > > > >> -Naveen > > > > > >> > > > > > >> > > > > > >> > > > > > >> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy < > > > > > >> > > > > > >> pedro.larroy.li...@gmail.com> > > > > > >> > > > > > >> wrote: > > > > > >> > > > > > >> > > > > > >> > > > > > >> > -1 nondeterminisitc failures on CI master: > > > > > >> > > > > > >> > https://issues.apache.org/jira/browse/MXNET-396 > > > > > >> > > > > > >> > > > > > > >> > > > > > >> > Was able to reproduce once in a fresh p3 instance > > > with > > > > > >> DLAMI > > > > > >> > > > can't > > > > > >> > > > > > >> > reproduce consistently. > > > > > >> > > > > > >> > > > > > > >> > > > > > >> > On Wed, May 2, 2018 at 9:51 PM, Anirudh < > > > > > >> > anirudh2...@gmail.com> > > > > > >> > > > > > wrote: > > > > > >> > > > > > >> > > > > > > >> > > > > > >> > > Hi all, > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > As part of RC2 release, we have addressed bugs > > and > > > > some > > > > > >> > > concerns > > > > > >> > > > > > that > > > > > >> > > > > > >> > were > > > > > >> > > > > > >> > > raised. > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > I would like to propose a vote to release > Apache > > > > MXNet > > > > > >> > > > > (incubating) > > > > > >> > > > > > >> > version > > > > > >> > > > > > >> > > 1.2.0.RC2. Voting will start now (Wednesday, > May > > > 2nd) > > > > > and > > > > > >> > end > > > > > >> > > at > > > > > >> > > > > > >> 12:50 PM > > > > > >> > > > > > >> > > PDT, Sunday, May 6th. > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > Link to release notes: > > > > > >> > > > > > >> > > https://cwiki.apache.org/ > > confluence/display/MXNET/ > > > > > >> > > > > > >> > > Apache+MXNet+%28incubating%29+ > > 1.2.0+Release+Notes > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > Link to release candidate 1.2.0.rc2: > > > > > >> > > > > > >> > > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/ > > > > > >> > > > 1.2.0.rc2 > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > Voting results for 1.2.0.rc2: > > > > > >> > > > > > >> > > https://lists.apache.org/thread.html/ > > > > > >> > > > > ebe561c609a8e32351dfe4aafc8876 > > > > > >> > > > > > >> > > 199560336472726b58c3455e85@%3C > > dev.mxnet.apache.org > > > > %3E > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > View this page, click on "Build from Source", > and > > > use > > > > > the > > > > > >> > > source > > > > > >> > > > > > code > > > > > >> > > > > > >> > > obtained from 1.2.0.rc2 tag: > > > > > >> > > > > > >> > > https://mxnet.incubator. > > > > apache.org/install/index.html > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > (Note: The README.md points to the 1.2.0 tag > and > > > does > > > > > not > > > > > >> > work > > > > > >> > > > at > > > > > >> > > > > > the > > > > > >> > > > > > >> > > moment.) > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > Please remember to test first before voting > > > > > accordingly: > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > +1 = approve > > > > > >> > > > > > >> > > +0 = no opinion > > > > > >> > > > > > >> > > -1 = disapprove (provide reason) > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > Anirudh > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > >