Unfortunately, merging the following PR Set correct update on kvstore flag in dist_device_sync mode (v1.3.x) https://github.com/apache/incubator-mxnet/pull/13121
Broke `dist-kvstore tests CPU` test stage: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/v1.3.x/82/pipeline A revert PR has been opened: Revert "Set correct update on kvstore flag in dist_device_sync mode (v1.3.x) (#13121) https://github.com/apache/incubator-mxnet/pull/13228 The test already passed, so the PR is good to go. The initial fix will not be considered for the release and will get a notion in the known issues section. Added a version bump to the release branch: news, readme update for v1.3.1 release https://github.com/apache/incubator-mxnet/pull/13225 Since patch releases are now done on branches the master branch needs a version update. Following PR for introducing the change: Bumped minor version to 1.4.0 as 1.3.1 will be continued in the v1.3x branch https://github.com/apache/incubator-mxnet/pull/13231 The confluence page 'Apache MXNet (incubating) 1.3.1 Release Notes' has been updated: https://cwiki.apache.org/confluence/x/eZGzBQ Best Anton сб, 10 нояб. 2018 г. в 11:59, Anton Chernov <mecher...@gmail.com>: > Due to various problems we had to postpone the tagging and vote for the > release till Monday, the 12th of November 2018. > > Following change has been updated and waiting to be merged: > > Disable flaky test test_operator.test_dropout (v1.3.x) > https://github.com/apache/incubator-mxnet/pull/13200 > > Indeed the MACOS tests timed out as well for the branch. The proposed > change contains thus only the build: > > [MXNET-908] Enable minimal OSX Travis build (v1.3.x) > https://github.com/apache/incubator-mxnet/pull/13179 > > > Best > Anton > > пт, 9 нояб. 2018 г. в 13:11, Anton Chernov <mecher...@gmail.com>: > >> I created the following PR to disable the test: >> >> Disable flaky test test_operator.test_dropout (v1.3.x) >> https://github.com/apache/incubator-mxnet/pull/13200 >> >> The second failure I suppose is related to: >> >> distributed kvstore bug in MXNet >> https://github.com/apache/incubator-mxnet/issues/12713 >> >> Which partially was fixed by >> >> Set correct update on kvstore flag in dist_device_sync mode (v1.3.x) >> https://github.com/apache/incubator-mxnet/pull/13121 >> >> But another part of the issue is still open and does not have a fix yet: >> >> "When distributed kvstore is used, by default gluon.Trainer doesn't work >> with mx.optimizer.LRScheduler if a worker has more than 1 GPU. To be more >> specific, the trainer updates once per GPU, the LRScheduler object is >> shared across GPUs and get a wrong update count." >> >> >> Best >> Anton >> >> >> пт, 9 нояб. 2018 г. в 11:48, Anton Chernov <mecher...@gmail.com>: >> >>> In case the tests for MACOS will time out as well we can disable them >>> and keep at least the build stage as in: >>> >>> Disable travis tests >>> https://github.com/apache/incubator-mxnet/pull/13137 >>> >>> Best >>> Anton >>> >>> пт, 9 нояб. 2018 г. в 11:17, Anton Chernov <mecher...@gmail.com>: >>> >>>> >>>> Hi Naveen, >>>> >>>> I believe that the timeout is not an issue for the branch. And I see >>>> great benefit in having tests for MACOS on the release branch. The travis >>>> build is not blocking anyway, so I don't see any risk in adding it. >>>> >>>> * test_dropout >>>> >>>> Currently, there is a problem with test_dropout that fails consistently >>>> on the branch: >>>> >>>> >>>> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/v1.3.x/97/pipeline >>>> >>>> Error reported: >>>> >>>> ====================================================================== >>>> FAIL: test_operator.test_dropout >>>> ---------------------------------------------------------------------- >>>> Traceback (most recent call last): >>>> File "C:\Anaconda3\envs\py3\lib\site-packages\nose\case.py", line >>>> 197, in runTest >>>> self.test(*self.arg) >>>> File >>>> "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\unittest\common.py", >>>> line 173, in test_new >>>> orig_test(*args, **kwargs) >>>> File >>>> "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\unittest\test_operator.py", >>>> line 5853, in test_dropout >>>> check_dropout_ratio(0.0, shape) >>>> File >>>> "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\unittest\test_operator.py", >>>> line 5797, in check_dropout_ratio >>>> assert exe.outputs[0].asnumpy().min() == min_value >>>> AssertionError: >>>> -------------------- >> begin captured logging << -------------------- >>>> common: INFO: Setting test np/mx/python random seeds, use >>>> MXNET_TEST_SEED=428273587 to reproduce. >>>> --------------------- >> end captured logging << --------------------- >>>> >>>> The test is enabled on master: >>>> >>>> Re-enables test_operator.test_dropout >>>> https://github.com/apache/incubator-mxnet/pull/12717 >>>> >>>> And there are no failures for it [1]. >>>> >>>> * KVStore tests >>>> >>>> Unfortunately, KVStore tests fail as well. >>>> >>>> >>>> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/v1.3.x/96/pipeline >>>> >>>> Error reported: >>>> >>>> AssertionError >>>> test_gluon_trainer_type() >>>> assert trainer._update_on_kvstore is update_on_kv\ >>>> File "dist_sync_kvstore.py", line 388, in test_gluon_trainer_type >>>> >>>> If nobody has a fix for these issues, I will disable the tests and add >>>> information to the known issues section. >>>> >>>> Best >>>> Anton >>>> >>>> [1] >>>> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/master/ >>>> >>>> чт, 8 нояб. 2018 г. в 21:44, Naveen Swamy <mnnav...@gmail.com>: >>>> >>>>> Anton, I don't think we need to add the Mac OS tests for 1.3.1 branch >>>>> since >>>>> travis CI is timing out and creates blockers, it also did not exist for >>>>> v1.3.0. >>>>> >>>>> >>>>> On Thu, Nov 8, 2018 at 10:04 AM Anton Chernov <mecher...@gmail.com> >>>>> wrote: >>>>> >>>>> > A PR to fix the tests: >>>>> > >>>>> > Remove test for non existing index copy operator (v1.3.x) >>>>> > https://github.com/apache/incubator-mxnet/pull/13180 >>>>> > >>>>> > >>>>> > Best >>>>> > Anton >>>>> > >>>>> > чт, 8 нояб. 2018 г. в 10:05, Anton Chernov <mecher...@gmail.com>: >>>>> > >>>>> > > An addition has been made to include MacOS tests for the v1.3.x >>>>> branch: >>>>> > > >>>>> > > [MXNET-908] Enable minimal OSX Travis build (v1.3.x) >>>>> > > https://github.com/apache/incubator-mxnet/pull/13179 >>>>> > > >>>>> > > It includes following PR's for master: >>>>> > > >>>>> > > [MXNET-908] Enable minimal OSX Travis build >>>>> > > https://github.com/apache/incubator-mxnet/pull/12462 >>>>> > > >>>>> > > [MXNET-908] Enable python tests in Travis >>>>> > > https://github.com/apache/incubator-mxnet/pull/12550 >>>>> > > >>>>> > > [MXNET-968] Fix MacOS python tests >>>>> > > https://github.com/apache/incubator-mxnet/pull/12590 >>>>> > > >>>>> > > >>>>> > > Best >>>>> > > Anton >>>>> > > >>>>> > > >>>>> > > чт, 8 нояб. 2018 г. в 9:38, Anton Chernov <mecher...@gmail.com>: >>>>> > > >>>>> > >> Thank you everyone for your support and suggestions. All proposed >>>>> PR's >>>>> > >> have been merged. We will tag the release candidate and start the >>>>> vote >>>>> > on >>>>> > >> Friday, the 9th of November 2018. >>>>> > >> >>>>> > >> Unfortunately after the merges the tests started to fail: >>>>> > >> >>>>> > >> >>>>> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/v1.3.x/ >>>>> > >> >>>>> > >> I will look into the failures, but any help as usual is very >>>>> > appreciated. >>>>> > >> >>>>> > >> The nightly tests are fine: >>>>> > >> >>>>> http://jenkins.mxnet-ci.amazon-ml.com/job/NightlyTests/job/v1.3.x/ >>>>> > >> >>>>> > >> >>>>> > >> Best >>>>> > >> Anton >>>>> > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> > >> ср, 7 нояб. 2018 г. в 17:19, Anton Chernov <mecher...@gmail.com>: >>>>> > >> >>>>> > >>> Yes, you are right about the versions wording, thanks for >>>>> > clarification. >>>>> > >>> >>>>> > >>> A performance improvement can be considered a bugfix as well. I >>>>> see no >>>>> > >>> big risks in including PR's by Haibin and Lin into the patch >>>>> release. >>>>> > >>> >>>>> > >>> @Haibin, if you can reopen the PR's they should be good to go >>>>> for the >>>>> > >>> relase, considering the importance of the improvements. >>>>> > >>> >>>>> > >>> I propose the following bugfixes for the release as well (already >>>>> > >>> created corresponding PR's): >>>>> > >>> >>>>> > >>> Fixed __setattr__ method of _MXClassPropertyMetaClass (v1.3.x) >>>>> > >>> https://github.com/apache/incubator-mxnet/pull/13157 >>>>> > >>> >>>>> > >>> fixed symbols naming in RNNCell, LSTMCell, GRUCell (v1.3.x) >>>>> > >>> https://github.com/apache/incubator-mxnet/pull/13158 >>>>> > >>> >>>>> > >>> We will be starting to merge the PR's shortly. If are no more >>>>> proposals >>>>> > >>> for backporting I would consider the list as set. >>>>> > >>> >>>>> > >>> Best >>>>> > >>> Anton >>>>> > >>> >>>>> > >>> ср, 7 нояб. 2018 г. в 17:01, Sheng Zha <szha....@gmail.com>: >>>>> > >>> >>>>> > >>>> Hi Anton, >>>>> > >>>> >>>>> > >>>> I hear your concern about a simultaneous 1.4.0 release and it >>>>> > certainly >>>>> > >>>> is a valid one. >>>>> > >>>> >>>>> > >>>> Regarding the release, let’s agree on the language first. >>>>> According to >>>>> > >>>> semver.org, 1.3.1 release is considered patch release, which >>>>> is for >>>>> > >>>> backward compatible bug fixes, while 1.4.0 release is >>>>> considered minor >>>>> > >>>> release, which is for backward compatible new features. A major >>>>> > release >>>>> > >>>> would mean 2.0. >>>>> > >>>> >>>>> > >>>> The three PRs suggested by Haibin and Lin are all introducing >>>>> new >>>>> > >>>> features. If they go into a patch release, it would require an >>>>> > exception >>>>> > >>>> accepted by the community. Also, if other violation happens it >>>>> could >>>>> > be >>>>> > >>>> ground for declining a release during votes. >>>>> > >>>> >>>>> > >>>> -sz >>>>> > >>>> >>>>> > >>>> > On Nov 7, 2018, at 2:25 AM, Anton Chernov < >>>>> mecher...@gmail.com> >>>>> > >>>> wrote: >>>>> > >>>> > >>>>> > >>>> > [MXNET-1179] Enforce deterministic algorithms in convolution >>>>> layers >>>>> > >>>> >>>>> > >>> >>>>> > >>>>> >>>>