Re: CUDNN algorithm selection failure

kellen sunderland Thu, 04 Oct 2018 14:18:41 -0700

"I ran a similar test(test_slice_batchnorm) for 5K times and I couldn't
reproduce the issue."


One thing to keep in mind is that the SelectAlgo call will cache results in
a registry that is in static scope.  To repro you'd likely have to create a
new process each time you run the test.  (Apologies if this is already how
you're reproducing).

SelectAlgo call:
https://github.com/apache/incubator-mxnet/blob/403831ace46eab4447794df9411351e439e8983e/src/operator/nn/cudnn/cudnn_convolution-inl.h#L609

Static local / singleton registry pattern here:
https://github.com/apache/incubator-mxnet/blob/024b5a916dd3a39a39031ce5e6565cd7d9d60fe2/src/operator/nn/cudnn/cudnn_algoreg.cc#L37

On Thu, Oct 4, 2018 at 8:58 PM Marco de Abreu
<[email protected]> wrote:

> For GPU, we don't run any tests in parallel.
>
> -Marco
>
> Naveen Swamy <[email protected]> schrieb am Do., 4. Okt. 2018, 19:54:
>
> > Looking at the error raised, you can see that the workspace size(GPU mem
> > size) of 1GB isn't sufficient. I am wondering if it is due to tests
> running
> > in parallel on CI, if this is true(tests running in parallel) is it
> > possible to reduce the parallelism ?
> > Error:
> > "mxnet.base.MXNetError: [05:40:12]
> > src/operator/nn/./cudnn/cudnn_convolution-inl.h:870: Failed to find any
> > forward convolution algorithm.  with workspace size of 1073741824 bytes,
> > please consider reducing batch/model size or increasing the workspace
> size"
> >
> > I ran a similar test(test_slice_batchnorm) for 5K times and I couldn't
> > reproduce the issue. I will look into it further to see if there are
> other
> > alternatives.
> >
> >
> > On Thu, Oct 4, 2018 at 10:48 AM Piyush Ghai <[email protected]>
> wrote:
> >
> > > Another build where test_slice_batchnorm_reshape_batchnorm fails :
> > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline
> > > <
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline
> > > >
> > >
> > > —
> > > Piyush
> > >
> > > > On Oct 3, 2018, at 9:32 AM, Pedro Larroy <
> [email protected]
> > >
> > > wrote:
> > > >
> > > > Seems is not the only test:
> > > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12726/5/pipeline
> > > >
> > > > test_slice_batchnorm_reshape_batchnorm is also failing and hasn't
> been
> > > > touched for a while. It doesn't look like a problem with the test to
> > me,
> > > > (not a flaky test). Looks to me that should find and address the root
> > > cause
> > > > instead of disabling the test in this case.
> > > >
> > > > Pedro.
> > > >
> > > > On Tue, Oct 2, 2018 at 2:39 AM Marco de Abreu
> > > > <[email protected]> wrote:
> > > >
> > > >> I have created an issue at
> > > >> https://github.com/apache/incubator-mxnet/issues/12715 and a PR to
> > > disable
> > > >> the test at https://github.com/apache/incubator-mxnet/pull/12716.
> > > >>
> > > >> This test is pretty new and was submitted with a number of other
> > > >> problematic (and disabled) tests:
> > > >> https://github.com/apache/incubator-mxnet/issues/11164 It could be
> > > >> possible
> > > >> that the test is simply not stable enough. The PR that introduced
> that
> > > test
> > > >> is https://github.com/apache/incubator-mxnet/pull/10921 - it was
> > merged
> > > >> two
> > > >> days ago.
> > > >>
> > > >> Best regards,
> > > >> Marco
> > > >>
> > > >> On Tue, Oct 2, 2018 at 8:43 AM Pedro Larroy <
> > > [email protected]>
> > > >> wrote:
> > > >>
> > > >>> Thanks for checking Lin. If it happens again we will have to dig
> > > deeper.
> > > >> We
> > > >>> have just one executor in GPU so I wonder what could be the root
> > cause
> > > of
> > > >>> this.
> > > >>>
> > > >>> On Mon, Oct 1, 2018 at 10:57 PM Lin Yuan <[email protected]>
> > wrote:
> > > >>>
> > > >>>> I could not reproduce the error on an EC2 g3x8 instance making it
> > hard
> > > >> to
> > > >>>> debug. I also suspect it was due to resource usage limit on ci
> > > >>> Instance.
> > > >>>>
> > > >>>> On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy <
> > > >>> [email protected]
> > > >>>>>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> It doesn't look like flakiness to me at first sight. I think it
> > might
> > > >>> be
> > > >>>>> related to resource usage / allocation / leak in the worst case.
> > > >>>>>
> > > >>>>> Could be that there was not enough memory GPU memory at the time
> of
> > > >>> test
> > > >>>>> execution. But I'm just speculating, hence my original question.
> > > >>>>>
> > > >>>>> Pedro.
> > > >>>>>
> > > >>>>> On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan <[email protected]>
> > wrote:
> > > >>>>>
> > > >>>>>> Hi Pedro,
> > > >>>>>>
> > > >>>>>> I also got this failure in my PR
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline
> > > >>>>>>
> > > >>>>>> I was not able to identify the root cause of it from changelist.
> > > >> Are
> > > >>>> you
> > > >>>>>> suggesting there is some flakiness in the master branch too?
> > > >>>>>>
> > > >>>>>> Thanks,
> > > >>>>>>
> > > >>>>>> Lin
> > > >>>>>>
> > > >>>>>> On Mon, Oct 1, 2018 at 4:55 PM Pedro Larroy <
> > > >>>>> [email protected]>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Hi
> > > >>>>>>>
> > > >>>>>>> I saw this failure on CI:
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline
> > > >>>>>>>
> > > >>>>>>> Have you seen other cases where we fail to select the best
> CUDNN
> > > >>>>>> algorithm?
> > > >>>>>>> In which circumstances this could happen, and do you think is a
> > > >>> good
> > > >>>>> idea
> > > >>>>>>> to have one selected by default as a last resort?
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Pedro.
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> > >
> >
>

Re: CUDNN algorithm selection failure

Reply via email to