"I ran a similar test(test_slice_batchnorm) for 5K times and I couldn't reproduce the issue."
One thing to keep in mind is that the SelectAlgo call will cache results in a registry that is in static scope. To repro you'd likely have to create a new process each time you run the test. (Apologies if this is already how you're reproducing). SelectAlgo call: https://github.com/apache/incubator-mxnet/blob/403831ace46eab4447794df9411351e439e8983e/src/operator/nn/cudnn/cudnn_convolution-inl.h#L609 Static local / singleton registry pattern here: https://github.com/apache/incubator-mxnet/blob/024b5a916dd3a39a39031ce5e6565cd7d9d60fe2/src/operator/nn/cudnn/cudnn_algoreg.cc#L37 On Thu, Oct 4, 2018 at 8:58 PM Marco de Abreu <marco.g.ab...@googlemail.com.invalid> wrote: > For GPU, we don't run any tests in parallel. > > -Marco > > Naveen Swamy <mnnav...@gmail.com> schrieb am Do., 4. Okt. 2018, 19:54: > > > Looking at the error raised, you can see that the workspace size(GPU mem > > size) of 1GB isn't sufficient. I am wondering if it is due to tests > running > > in parallel on CI, if this is true(tests running in parallel) is it > > possible to reduce the parallelism ? > > Error: > > "mxnet.base.MXNetError: [05:40:12] > > src/operator/nn/./cudnn/cudnn_convolution-inl.h:870: Failed to find any > > forward convolution algorithm. with workspace size of 1073741824 bytes, > > please consider reducing batch/model size or increasing the workspace > size" > > > > I ran a similar test(test_slice_batchnorm) for 5K times and I couldn't > > reproduce the issue. I will look into it further to see if there are > other > > alternatives. > > > > > > On Thu, Oct 4, 2018 at 10:48 AM Piyush Ghai <ghai.piy...@gmail.com> > wrote: > > > > > Another build where test_slice_batchnorm_reshape_batchnorm fails : > > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline > > > < > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline > > > > > > > > > > — > > > Piyush > > > > > > > On Oct 3, 2018, at 9:32 AM, Pedro Larroy < > pedro.larroy.li...@gmail.com > > > > > > wrote: > > > > > > > > Seems is not the only test: > > > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12726/5/pipeline > > > > > > > > test_slice_batchnorm_reshape_batchnorm is also failing and hasn't > been > > > > touched for a while. It doesn't look like a problem with the test to > > me, > > > > (not a flaky test). Looks to me that should find and address the root > > > cause > > > > instead of disabling the test in this case. > > > > > > > > Pedro. > > > > > > > > On Tue, Oct 2, 2018 at 2:39 AM Marco de Abreu > > > > <marco.g.ab...@googlemail.com.invalid> wrote: > > > > > > > >> I have created an issue at > > > >> https://github.com/apache/incubator-mxnet/issues/12715 and a PR to > > > disable > > > >> the test at https://github.com/apache/incubator-mxnet/pull/12716. > > > >> > > > >> This test is pretty new and was submitted with a number of other > > > >> problematic (and disabled) tests: > > > >> https://github.com/apache/incubator-mxnet/issues/11164 It could be > > > >> possible > > > >> that the test is simply not stable enough. The PR that introduced > that > > > test > > > >> is https://github.com/apache/incubator-mxnet/pull/10921 - it was > > merged > > > >> two > > > >> days ago. > > > >> > > > >> Best regards, > > > >> Marco > > > >> > > > >> On Tue, Oct 2, 2018 at 8:43 AM Pedro Larroy < > > > pedro.larroy.li...@gmail.com> > > > >> wrote: > > > >> > > > >>> Thanks for checking Lin. If it happens again we will have to dig > > > deeper. > > > >> We > > > >>> have just one executor in GPU so I wonder what could be the root > > cause > > > of > > > >>> this. > > > >>> > > > >>> On Mon, Oct 1, 2018 at 10:57 PM Lin Yuan <apefor...@gmail.com> > > wrote: > > > >>> > > > >>>> I could not reproduce the error on an EC2 g3x8 instance making it > > hard > > > >> to > > > >>>> debug. I also suspect it was due to resource usage limit on ci > > > >>> Instance. > > > >>>> > > > >>>> On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy < > > > >>> pedro.larroy.li...@gmail.com > > > >>>>> > > > >>>> wrote: > > > >>>> > > > >>>>> It doesn't look like flakiness to me at first sight. I think it > > might > > > >>> be > > > >>>>> related to resource usage / allocation / leak in the worst case. > > > >>>>> > > > >>>>> Could be that there was not enough memory GPU memory at the time > of > > > >>> test > > > >>>>> execution. But I'm just speculating, hence my original question. > > > >>>>> > > > >>>>> Pedro. > > > >>>>> > > > >>>>> On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan <apefor...@gmail.com> > > wrote: > > > >>>>> > > > >>>>>> Hi Pedro, > > > >>>>>> > > > >>>>>> I also got this failure in my PR > > > >>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline > > > >>>>>> > > > >>>>>> I was not able to identify the root cause of it from changelist. > > > >> Are > > > >>>> you > > > >>>>>> suggesting there is some flakiness in the master branch too? > > > >>>>>> > > > >>>>>> Thanks, > > > >>>>>> > > > >>>>>> Lin > > > >>>>>> > > > >>>>>> On Mon, Oct 1, 2018 at 4:55 PM Pedro Larroy < > > > >>>>> pedro.larroy.li...@gmail.com> > > > >>>>>> wrote: > > > >>>>>> > > > >>>>>>> Hi > > > >>>>>>> > > > >>>>>>> I saw this failure on CI: > > > >>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline > > > >>>>>>> > > > >>>>>>> Have you seen other cases where we fail to select the best > CUDNN > > > >>>>>> algorithm? > > > >>>>>>> In which circumstances this could happen, and do you think is a > > > >>> good > > > >>>>> idea > > > >>>>>>> to have one selected by default as a last resort? > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> Pedro. > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > > > > > >