+1, Agree this should be done for both CUDA and CUDNN versions. At max CUDA
Version N and CUDA Version N - 1 should be supported in CI.

My question is what happens, when we are at a position, where we are on a
CUDA version N and removed support for CUDA version N - 1. Within a small
duration Nvidia comes up with a CUDA patch version N + 1, where  some perf
regressions and some bugs have been fixed. Should we just move to N + 1,
since version N will have all these issues for users and may also slow us
down on CI.

I am facing a issue with CUDA 10 and CUDA 10.1 which also seems to be
causing intermittent CI failures:
https://github.com/apache/incubator-mxnet/issues/15273 . There is already a
PR to bump up Nvidia version to 10.1 (
https://github.com/apache/incubator-mxnet/pull/14986/files).

I think for situations where there is a quick follow up release like 10.1
and MXNet users are impacted by certain issues, we should just bump up the
version and stop support for 10.0.
Would like to hear more from Nvidia folks (on this particular case of CUDA
10.0 vs CUDA 10.1 and what are the recommendations for existing customers).

Anirudh

On Mon, Jun 3, 2019 at 4:21 PM Dick Carter <[email protected]> wrote:

> Actually, I tried to say that support *doesn't necessarily* include N-1.
> I'm proposing that the supported versions are 1) covered by CI and 2) have
> been available in a usable form long enough that a semi-motivated user has
> been able to transition to it.  That might mean only N (e.g. per my
> proposal, only cuDNN v7).
>
> Regarding precedent for N / N-1,  when a new CUDA version comes out, users
> will transition to it at their own pace, thereby creating a N / N-1 support
> situation for some period.
>
>
> On 2019/06/03 22:43:20, Pedro Larroy <[email protected]>
> wrote:
> > Your proposal of having support for N and N-1 makes a lot of sense to
> > me. Are there use cases for supporting older CUDA versions?
> >
> >
> > Thanks.
> >
> > On Mon, Jun 3, 2019 at 3:06 PM Dick Carter <[email protected]> wrote:
> > >
> > > I'd like to revisit the discussion of:
> https://lists.apache.org/thread.html/27b84e4fc0e0728f2e4ad8b6827d7f996635021a5a4d47b5d3f4dbfb@%3Cdev.mxnet.apache.org%3E
> now that a year has passed.
> > >
> > > My motivation is:
> > >
> > > 1.  There's a lot of hard-to-read  '#if CUDNN_MAJOR' code referencing
> cuDNN versions back as far as v4(!?).  We need to clean this out before it
> hampers our ability to nimbly move the codebase forward.
> > >
> > > 2.  There seems to be a difference of opinion on whether we should be
> supporting version 'N-1' (e.g. cuDNN6).  Our current MXNet 1.5 candidate
> does not compile against cuDNN v6, so this should be either fixed or be
> up-front stated to the user community.  The breaking PR was
> https://github.com/apache/incubator-mxnet/pull/14476.
> > >
> > > Having read the prior discussion, my take on it is:
> > >
> > > - Users should be given an ample time period (1 year?) to move to a
> new CUDA/cuDNN version once it becomes 'usable.'
> > >
> > > - We should not claim to support a given version if it is no longer
> part of the MXNet CI.  User's should be warned of an impeding dropping of
> this 'testing support.'
> > >
> > > So these statements do not necessarily promise 'N-1' support.  I could
> see a transitioning of the CI from CUDA9-only -> CUDA9&10 -> CUDA10 only.
> Some period before CUDA9 is dropped from CI, the user community is warned.
> After that time, CUDA10 might be the only version tested by CI, and hence
> the only version supported (until the next CUDA version came around).
> > >
> > > Let me propose as a 'strawman' that we claim to support CUDA version 9
> and 10, with cuDNN version 7 only.  Those versions have been out for over
> 1.5 years.  So no CUDA 8 or cuDNN v6 support- over 1.5 years old with no
> coverage by our CI.
> > >
> > >     -Dick
> >
>

Reply via email to