I created the following PR to disable the test:

Disable flaky test test_operator.test_dropout (v1.3.x)
https://github.com/apache/incubator-mxnet/pull/13200

The second failure I suppose is related to:

distributed kvstore bug in MXNet
https://github.com/apache/incubator-mxnet/issues/12713

Which partially was fixed by

Set correct update on kvstore flag in dist_device_sync mode (v1.3.x)
https://github.com/apache/incubator-mxnet/pull/13121

But another part of the issue is still open and does not have a fix yet:

"When distributed kvstore is used, by default gluon.Trainer doesn't work
with mx.optimizer.LRScheduler if a worker has more than 1 GPU. To be more
specific, the trainer updates once per GPU, the LRScheduler object is
shared across GPUs and get a wrong update count."


Best
Anton


пт, 9 нояб. 2018 г. в 11:48, Anton Chernov <mecher...@gmail.com>:

> In case the tests for MACOS will time out as well we can disable them and
> keep at least the build stage as in:
>
> Disable travis tests
> https://github.com/apache/incubator-mxnet/pull/13137
>
> Best
> Anton
>
> пт, 9 нояб. 2018 г. в 11:17, Anton Chernov <mecher...@gmail.com>:
>
>>
>> Hi Naveen,
>>
>> I believe that the timeout is not an issue for the branch. And I see
>> great benefit in having tests for MACOS on the release branch. The travis
>> build is not blocking anyway, so I don't see any risk in adding it.
>>
>> * test_dropout
>>
>> Currently, there is a problem with test_dropout that fails consistently
>> on the branch:
>>
>>
>> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/v1.3.x/97/pipeline
>>
>> Error reported:
>>
>> ======================================================================
>> FAIL: test_operator.test_dropout
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>   File "C:\Anaconda3\envs\py3\lib\site-packages\nose\case.py", line 197,
>> in runTest
>>     self.test(*self.arg)
>>   File
>> "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\unittest\common.py",
>> line 173, in test_new
>>     orig_test(*args, **kwargs)
>>   File
>> "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\unittest\test_operator.py",
>> line 5853, in test_dropout
>>     check_dropout_ratio(0.0, shape)
>>   File
>> "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\unittest\test_operator.py",
>> line 5797, in check_dropout_ratio
>>     assert exe.outputs[0].asnumpy().min() == min_value
>> AssertionError:
>> -------------------- >> begin captured logging << --------------------
>> common: INFO: Setting test np/mx/python random seeds, use
>> MXNET_TEST_SEED=428273587 to reproduce.
>> --------------------- >> end captured logging << ---------------------
>>
>> The test is enabled on master:
>>
>> Re-enables test_operator.test_dropout
>> https://github.com/apache/incubator-mxnet/pull/12717
>>
>> And there are no failures for it [1].
>>
>> * KVStore tests
>>
>> Unfortunately, KVStore tests fail as well.
>>
>>
>> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/v1.3.x/96/pipeline
>>
>> Error reported:
>>
>> AssertionError
>> test_gluon_trainer_type()
>>     assert trainer._update_on_kvstore is update_on_kv\
>>   File "dist_sync_kvstore.py", line 388, in test_gluon_trainer_type
>>
>> If nobody has a fix for these issues, I will disable the tests and add
>> information to the known issues section.
>>
>> Best
>> Anton
>>
>> [1] http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/master/
>>
>> чт, 8 нояб. 2018 г. в 21:44, Naveen Swamy <mnnav...@gmail.com>:
>>
>>> Anton, I don't think we need to add the Mac OS tests for 1.3.1 branch
>>> since
>>> travis CI is timing out and creates blockers, it also did not exist for
>>> v1.3.0.
>>>
>>>
>>> On Thu, Nov 8, 2018 at 10:04 AM Anton Chernov <mecher...@gmail.com>
>>> wrote:
>>>
>>> > A PR to fix the tests:
>>> >
>>> > Remove test for non existing index copy operator (v1.3.x)
>>> > https://github.com/apache/incubator-mxnet/pull/13180
>>> >
>>> >
>>> > Best
>>> > Anton
>>> >
>>> > чт, 8 нояб. 2018 г. в 10:05, Anton Chernov <mecher...@gmail.com>:
>>> >
>>> > > An addition has been made to include MacOS tests for the v1.3.x
>>> branch:
>>> > >
>>> > > [MXNET-908] Enable minimal OSX Travis build (v1.3.x)
>>> > > https://github.com/apache/incubator-mxnet/pull/13179
>>> > >
>>> > > It includes following PR's for master:
>>> > >
>>> > > [MXNET-908] Enable minimal OSX Travis build
>>> > > https://github.com/apache/incubator-mxnet/pull/12462
>>> > >
>>> > > [MXNET-908] Enable python tests in Travis
>>> > > https://github.com/apache/incubator-mxnet/pull/12550
>>> > >
>>> > > [MXNET-968] Fix MacOS python tests
>>> > > https://github.com/apache/incubator-mxnet/pull/12590
>>> > >
>>> > >
>>> > > Best
>>> > > Anton
>>> > >
>>> > >
>>> > > чт, 8 нояб. 2018 г. в 9:38, Anton Chernov <mecher...@gmail.com>:
>>> > >
>>> > >> Thank you everyone for your support and suggestions. All proposed
>>> PR's
>>> > >> have been merged. We will tag the release candidate and start the
>>> vote
>>> > on
>>> > >> Friday, the 9th of November 2018.
>>> > >>
>>> > >> Unfortunately after the merges the tests started to fail:
>>> > >>
>>> > >>
>>> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/v1.3.x/
>>> > >>
>>> > >> I will look into the failures, but any help as usual is very
>>> > appreciated.
>>> > >>
>>> > >> The nightly tests are fine:
>>> > >> http://jenkins.mxnet-ci.amazon-ml.com/job/NightlyTests/job/v1.3.x/
>>> > >>
>>> > >>
>>> > >> Best
>>> > >> Anton
>>> > >>
>>> > >>
>>> > >>
>>> > >>
>>> > >> ср, 7 нояб. 2018 г. в 17:19, Anton Chernov <mecher...@gmail.com>:
>>> > >>
>>> > >>> Yes, you are right about the versions wording, thanks for
>>> > clarification.
>>> > >>>
>>> > >>> A performance improvement can be considered a bugfix as well. I
>>> see no
>>> > >>> big risks in including PR's by Haibin and Lin into the patch
>>> release.
>>> > >>>
>>> > >>> @Haibin, if you can reopen the PR's they should be good to go for
>>> the
>>> > >>> relase, considering the importance of the improvements.
>>> > >>>
>>> > >>> I propose the following bugfixes for the release as well (already
>>> > >>> created corresponding PR's):
>>> > >>>
>>> > >>> Fixed __setattr__ method of _MXClassPropertyMetaClass (v1.3.x)
>>> > >>> https://github.com/apache/incubator-mxnet/pull/13157
>>> > >>>
>>> > >>> fixed symbols naming in RNNCell, LSTMCell, GRUCell (v1.3.x)
>>> > >>> https://github.com/apache/incubator-mxnet/pull/13158
>>> > >>>
>>> > >>> We will be starting to merge the PR's shortly. If are no more
>>> proposals
>>> > >>> for backporting I would consider the list as set.
>>> > >>>
>>> > >>> Best
>>> > >>> Anton
>>> > >>>
>>> > >>> ср, 7 нояб. 2018 г. в 17:01, Sheng Zha <szha....@gmail.com>:
>>> > >>>
>>> > >>>> Hi Anton,
>>> > >>>>
>>> > >>>> I hear your concern about a simultaneous 1.4.0 release and it
>>> > certainly
>>> > >>>> is a valid one.
>>> > >>>>
>>> > >>>> Regarding the release, let’s agree on the language first.
>>> According to
>>> > >>>> semver.org, 1.3.1 release is considered patch release, which is
>>> for
>>> > >>>> backward compatible bug fixes, while 1.4.0 release is considered
>>> minor
>>> > >>>> release, which is for backward compatible new features. A major
>>> > release
>>> > >>>> would mean 2.0.
>>> > >>>>
>>> > >>>> The three PRs suggested by Haibin and Lin are all introducing new
>>> > >>>> features. If they go into a patch release, it would require an
>>> > exception
>>> > >>>> accepted by the community. Also, if other violation happens it
>>> could
>>> > be
>>> > >>>> ground for declining a release during votes.
>>> > >>>>
>>> > >>>> -sz
>>> > >>>>
>>> > >>>> > On Nov 7, 2018, at 2:25 AM, Anton Chernov <mecher...@gmail.com>
>>> > >>>> wrote:
>>> > >>>> >
>>> > >>>> > [MXNET-1179] Enforce deterministic algorithms in convolution
>>> layers
>>> > >>>>
>>> > >>>
>>> >
>>>
>>

Reply via email to