I have created an issue at https://github.com/apache/incubator-mxnet/issues/12715 and a PR to disable the test at https://github.com/apache/incubator-mxnet/pull/12716.
This test is pretty new and was submitted with a number of other problematic (and disabled) tests: https://github.com/apache/incubator-mxnet/issues/11164 It could be possible that the test is simply not stable enough. The PR that introduced that test is https://github.com/apache/incubator-mxnet/pull/10921 - it was merged two days ago. Best regards, Marco On Tue, Oct 2, 2018 at 8:43 AM Pedro Larroy <pedro.larroy.li...@gmail.com> wrote: > Thanks for checking Lin. If it happens again we will have to dig deeper. We > have just one executor in GPU so I wonder what could be the root cause of > this. > > On Mon, Oct 1, 2018 at 10:57 PM Lin Yuan <apefor...@gmail.com> wrote: > > > I could not reproduce the error on an EC2 g3x8 instance making it hard to > > debug. I also suspect it was due to resource usage limit on ci > Instance. > > > > On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy < > pedro.larroy.li...@gmail.com > > > > > wrote: > > > > > It doesn't look like flakiness to me at first sight. I think it might > be > > > related to resource usage / allocation / leak in the worst case. > > > > > > Could be that there was not enough memory GPU memory at the time of > test > > > execution. But I'm just speculating, hence my original question. > > > > > > Pedro. > > > > > > On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan <apefor...@gmail.com> wrote: > > > > > > > Hi Pedro, > > > > > > > > I also got this failure in my PR > > > > > > > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline > > > > > > > > I was not able to identify the root cause of it from changelist. Are > > you > > > > suggesting there is some flakiness in the master branch too? > > > > > > > > Thanks, > > > > > > > > Lin > > > > > > > > On Mon, Oct 1, 2018 at 4:55 PM Pedro Larroy < > > > pedro.larroy.li...@gmail.com> > > > > wrote: > > > > > > > > > Hi > > > > > > > > > > I saw this failure on CI: > > > > > > > > > > > > > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline > > > > > > > > > > Have you seen other cases where we fail to select the best CUDNN > > > > algorithm? > > > > > In which circumstances this could happen, and do you think is a > good > > > idea > > > > > to have one selected by default as a last resort? > > > > > > > > > > > > > > > Pedro. > > > > > > > > > > > > > > >