Re: CUDNN algorithm selection failure

Pedro Larroy Mon, 01 Oct 2018 23:43:55 -0700

Thanks for checking Lin. If it happens again we will have to dig deeper. We
have just one executor in GPU so I wonder what could be the root cause of
this.


On Mon, Oct 1, 2018 at 10:57 PM Lin Yuan <[email protected]> wrote:

> I could not reproduce the error on an EC2 g3x8 instance making it hard to
> debug. I also suspect it was due to resource usage limit on ci   Instance.
>
> On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy <[email protected]
> >
> wrote:
>
> > It doesn't look like flakiness to me at first sight. I think it might be
> > related to resource usage / allocation / leak in the worst case.
> >
> > Could be that there was not enough memory GPU memory at the time of test
> > execution. But I'm just speculating, hence my original question.
> >
> > Pedro.
> >
> > On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan <[email protected]> wrote:
> >
> > > Hi Pedro,
> > >
> > > I also got this failure in my PR
> > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline
> > >
> > > I was not able to identify the root cause of it from changelist. Are
> you
> > > suggesting there is some flakiness in the master branch too?
> > >
> > > Thanks,
> > >
> > > Lin
> > >
> > > On Mon, Oct 1, 2018 at 4:55 PM Pedro Larroy <
> > [email protected]>
> > > wrote:
> > >
> > > > Hi
> > > >
> > > > I saw this failure on CI:
> > > >
> > > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline
> > > >
> > > > Have you seen other cases where we fail to select the best CUDNN
> > > algorithm?
> > > > In which circumstances this could happen, and do you think is a good
> > idea
> > > > to have one selected by default as a last resort?
> > > >
> > > >
> > > > Pedro.
> > > >
> > >
> >
>

Re: CUDNN algorithm selection failure

Reply via email to