Re: CUDA / CUDNN support revisited

kellen sunderland Wed, 19 Jun 2019 10:00:26 -0700

Just double checked CUDA 9, 10 and 10.1 all support SM3, so actually I
don't believe there's any need to drop SMs.


On Wed, Jun 19, 2019 at 9:56 AM kellen sunderland <
[email protected]> wrote:

> I think where we're all going to have agreement is that we shouldn't have
> code targeting CUDA versions earlier than CUDA 9, or cuDNN versions earlier
> than 6.  We can go ahead and remove any code that targets those old
> versions, and drop any SMs that are not supported by CUDA 9 / cuDNN 6.  Id
> suggest we also add some logging for users with prior versions letting them
> know they can still use MXNet 1.4.
>
> Where things get interesting is CUDA 9 / cuDNN 6 support.  I was
> originally a proponent of the N and N-1 route for simplicity.  Looking back
> at the choice, one complication I see is that there's competing concerns
> between semver library compatibility and feature releases on NVIDIA's
> part.  NVIDIA is releasing new libraries with a lot of new features on a
> regular basis, which is good, but for compatibility reasons they've begun
> to bump major versions less often, which is also probably also good.  For
> example if memory serves correctly cuDNN used to get an MV bump every 6
> months or so, but now the N-1 MV (6) was released in April of 2017.  As a
> project maintainer I would certainly like to drop support for library
> versions that are 2 years old in my latest release.  Supporting a 2 year
> wide range of dependency libraries in the CI for example is going to be a
> burden.
>
> From the MXNet users' perspective obviously having to update dependencies
> is a pain, but updating these libs are likely to give significant
> performance increases (occasional perf regressions aside).  I think a
> consistent thread I've heard from users is that training takes too long,
> inference costs too much, and they want their DL framework to abstract the
> complexity of using custom hardware like TCs or AVX with them having to put
> in a lot of effort.  Another consideration is that using old versions of
> MXNet is actually quite easy and convenient thanks to (IMO) some solid
> release practices and naming conventions.
>
> Given how easy it is to use old MXNet versions I think it's reasonable to
> target CUDA 10 and cuDNN 7 only in release 1.5 (and drop incompatible sm
> versions).
>
> On Wed, Jun 19, 2019 at 4:01 AM Marco de Abreu <[email protected]>
> wrote:
>
>> Good points anirudh. Generally I would understand N as being the major
>> versions. Speak we would maintain CUDA 9 and 10.1 in your given example
>> and
>> drop 10.0 as soon as we verified that 10.1 is working. CUDA 9 would only
>> be
>> dropped when 11 is released and tested.
>>
>> At the same time, we would always only supported the latest compatible
>> cudnn version. Or is there any reason somebody would use an old cudnn
>> version?
>>
>> Wdyt?
>>
>> -Marco
>>
>> Anirudh Subramanian <[email protected]> schrieb am Mi., 19. Juni
>> 2019,
>> 01:47:
>>
>> > +1, Agree this should be done for both CUDA and CUDNN versions. At max
>> CUDA
>> > Version N and CUDA Version N - 1 should be supported in CI.
>> >
>> > My question is what happens, when we are at a position, where we are on
>> a
>> > CUDA version N and removed support for CUDA version N - 1. Within a
>> small
>> > duration Nvidia comes up with a CUDA patch version N + 1, where  some
>> perf
>> > regressions and some bugs have been fixed. Should we just move to N + 1,
>> > since version N will have all these issues for users and may also slow
>> us
>> > down on CI.
>> >
>> > I am facing a issue with CUDA 10 and CUDA 10.1 which also seems to be
>> > causing intermittent CI failures:
>> > https://github.com/apache/incubator-mxnet/issues/15273 . There is
>> already
>> > a
>> > PR to bump up Nvidia version to 10.1 (
>> > https://github.com/apache/incubator-mxnet/pull/14986/files).
>> >
>> > I think for situations where there is a quick follow up release like
>> 10.1
>> > and MXNet users are impacted by certain issues, we should just bump up
>> the
>> > version and stop support for 10.0.
>> > Would like to hear more from Nvidia folks (on this particular case of
>> CUDA
>> > 10.0 vs CUDA 10.1 and what are the recommendations for existing
>> customers).
>> >
>> > Anirudh
>> >
>> > On Mon, Jun 3, 2019 at 4:21 PM Dick Carter <[email protected]>
>> wrote:
>> >
>> > > Actually, I tried to say that support *doesn't necessarily* include
>> N-1.
>> > > I'm proposing that the supported versions are 1) covered by CI and 2)
>> > have
>> > > been available in a usable form long enough that a semi-motivated user
>> > has
>> > > been able to transition to it.  That might mean only N (e.g. per my
>> > > proposal, only cuDNN v7).
>> > >
>> > > Regarding precedent for N / N-1,  when a new CUDA version comes out,
>> > users
>> > > will transition to it at their own pace, thereby creating a N / N-1
>> > support
>> > > situation for some period.
>> > >
>> > >
>> > > On 2019/06/03 22:43:20, Pedro Larroy <[email protected]>
>> > > wrote:
>> > > > Your proposal of having support for N and N-1 makes a lot of sense
>> to
>> > > > me. Are there use cases for supporting older CUDA versions?
>> > > >
>> > > >
>> > > > Thanks.
>> > > >
>> > > > On Mon, Jun 3, 2019 at 3:06 PM Dick Carter <[email protected]>
>> > wrote:
>> > > > >
>> > > > > I'd like to revisit the discussion of:
>> > >
>> >
>> https://lists.apache.org/thread.html/27b84e4fc0e0728f2e4ad8b6827d7f996635021a5a4d47b5d3f4dbfb@%3Cdev.mxnet.apache.org%3E
>> > > now that a year has passed.
>> > > > >
>> > > > > My motivation is:
>> > > > >
>> > > > > 1.  There's a lot of hard-to-read  '#if CUDNN_MAJOR' code
>> referencing
>> > > cuDNN versions back as far as v4(!?).  We need to clean this out
>> before
>> > it
>> > > hampers our ability to nimbly move the codebase forward.
>> > > > >
>> > > > > 2.  There seems to be a difference of opinion on whether we
>> should be
>> > > supporting version 'N-1' (e.g. cuDNN6).  Our current MXNet 1.5
>> candidate
>> > > does not compile against cuDNN v6, so this should be either fixed or
>> be
>> > > up-front stated to the user community.  The breaking PR was
>> > > https://github.com/apache/incubator-mxnet/pull/14476.
>> > > > >
>> > > > > Having read the prior discussion, my take on it is:
>> > > > >
>> > > > > - Users should be given an ample time period (1 year?) to move to
>> a
>> > > new CUDA/cuDNN version once it becomes 'usable.'
>> > > > >
>> > > > > - We should not claim to support a given version if it is no
>> longer
>> > > part of the MXNet CI.  User's should be warned of an impeding
>> dropping of
>> > > this 'testing support.'
>> > > > >
>> > > > > So these statements do not necessarily promise 'N-1' support.  I
>> > could
>> > > see a transitioning of the CI from CUDA9-only -> CUDA9&10 -> CUDA10
>> only.
>> > > Some period before CUDA9 is dropped from CI, the user community is
>> > warned.
>> > > After that time, CUDA10 might be the only version tested by CI, and
>> hence
>> > > the only version supported (until the next CUDA version came around).
>> > > > >
>> > > > > Let me propose as a 'strawman' that we claim to support CUDA
>> version
>> > 9
>> > > and 10, with cuDNN version 7 only.  Those versions have been out for
>> over
>> > > 1.5 years.  So no CUDA 8 or cuDNN v6 support- over 1.5 years old with
>> no
>> > > coverage by our CI.
>> > > > >
>> > > > >     -Dick
>> > > >
>> > >
>> >
>>
>

Re: CUDA / CUDNN support revisited

Reply via email to