Just double checked CUDA 9, 10 and 10.1 all support SM3, so actually I don't believe there's any need to drop SMs.
On Wed, Jun 19, 2019 at 9:56 AM kellen sunderland < kellen.sunderl...@gmail.com> wrote: > I think where we're all going to have agreement is that we shouldn't have > code targeting CUDA versions earlier than CUDA 9, or cuDNN versions earlier > than 6. We can go ahead and remove any code that targets those old > versions, and drop any SMs that are not supported by CUDA 9 / cuDNN 6. Id > suggest we also add some logging for users with prior versions letting them > know they can still use MXNet 1.4. > > Where things get interesting is CUDA 9 / cuDNN 6 support. I was > originally a proponent of the N and N-1 route for simplicity. Looking back > at the choice, one complication I see is that there's competing concerns > between semver library compatibility and feature releases on NVIDIA's > part. NVIDIA is releasing new libraries with a lot of new features on a > regular basis, which is good, but for compatibility reasons they've begun > to bump major versions less often, which is also probably also good. For > example if memory serves correctly cuDNN used to get an MV bump every 6 > months or so, but now the N-1 MV (6) was released in April of 2017. As a > project maintainer I would certainly like to drop support for library > versions that are 2 years old in my latest release. Supporting a 2 year > wide range of dependency libraries in the CI for example is going to be a > burden. > > From the MXNet users' perspective obviously having to update dependencies > is a pain, but updating these libs are likely to give significant > performance increases (occasional perf regressions aside). I think a > consistent thread I've heard from users is that training takes too long, > inference costs too much, and they want their DL framework to abstract the > complexity of using custom hardware like TCs or AVX with them having to put > in a lot of effort. Another consideration is that using old versions of > MXNet is actually quite easy and convenient thanks to (IMO) some solid > release practices and naming conventions. > > Given how easy it is to use old MXNet versions I think it's reasonable to > target CUDA 10 and cuDNN 7 only in release 1.5 (and drop incompatible sm > versions). > > On Wed, Jun 19, 2019 at 4:01 AM Marco de Abreu <marco.g.ab...@gmail.com> > wrote: > >> Good points anirudh. Generally I would understand N as being the major >> versions. Speak we would maintain CUDA 9 and 10.1 in your given example >> and >> drop 10.0 as soon as we verified that 10.1 is working. CUDA 9 would only >> be >> dropped when 11 is released and tested. >> >> At the same time, we would always only supported the latest compatible >> cudnn version. Or is there any reason somebody would use an old cudnn >> version? >> >> Wdyt? >> >> -Marco >> >> Anirudh Subramanian <anirudh2...@gmail.com> schrieb am Mi., 19. Juni >> 2019, >> 01:47: >> >> > +1, Agree this should be done for both CUDA and CUDNN versions. At max >> CUDA >> > Version N and CUDA Version N - 1 should be supported in CI. >> > >> > My question is what happens, when we are at a position, where we are on >> a >> > CUDA version N and removed support for CUDA version N - 1. Within a >> small >> > duration Nvidia comes up with a CUDA patch version N + 1, where some >> perf >> > regressions and some bugs have been fixed. Should we just move to N + 1, >> > since version N will have all these issues for users and may also slow >> us >> > down on CI. >> > >> > I am facing a issue with CUDA 10 and CUDA 10.1 which also seems to be >> > causing intermittent CI failures: >> > https://github.com/apache/incubator-mxnet/issues/15273 . There is >> already >> > a >> > PR to bump up Nvidia version to 10.1 ( >> > https://github.com/apache/incubator-mxnet/pull/14986/files). >> > >> > I think for situations where there is a quick follow up release like >> 10.1 >> > and MXNet users are impacted by certain issues, we should just bump up >> the >> > version and stop support for 10.0. >> > Would like to hear more from Nvidia folks (on this particular case of >> CUDA >> > 10.0 vs CUDA 10.1 and what are the recommendations for existing >> customers). >> > >> > Anirudh >> > >> > On Mon, Jun 3, 2019 at 4:21 PM Dick Carter <dickjc...@apache.org> >> wrote: >> > >> > > Actually, I tried to say that support *doesn't necessarily* include >> N-1. >> > > I'm proposing that the supported versions are 1) covered by CI and 2) >> > have >> > > been available in a usable form long enough that a semi-motivated user >> > has >> > > been able to transition to it. That might mean only N (e.g. per my >> > > proposal, only cuDNN v7). >> > > >> > > Regarding precedent for N / N-1, when a new CUDA version comes out, >> > users >> > > will transition to it at their own pace, thereby creating a N / N-1 >> > support >> > > situation for some period. >> > > >> > > >> > > On 2019/06/03 22:43:20, Pedro Larroy <pedro.larroy.li...@gmail.com> >> > > wrote: >> > > > Your proposal of having support for N and N-1 makes a lot of sense >> to >> > > > me. Are there use cases for supporting older CUDA versions? >> > > > >> > > > >> > > > Thanks. >> > > > >> > > > On Mon, Jun 3, 2019 at 3:06 PM Dick Carter <dickjc...@apache.org> >> > wrote: >> > > > > >> > > > > I'd like to revisit the discussion of: >> > > >> > >> https://lists.apache.org/thread.html/27b84e4fc0e0728f2e4ad8b6827d7f996635021a5a4d47b5d3f4dbfb@%3Cdev.mxnet.apache.org%3E >> > > now that a year has passed. >> > > > > >> > > > > My motivation is: >> > > > > >> > > > > 1. There's a lot of hard-to-read '#if CUDNN_MAJOR' code >> referencing >> > > cuDNN versions back as far as v4(!?). We need to clean this out >> before >> > it >> > > hampers our ability to nimbly move the codebase forward. >> > > > > >> > > > > 2. There seems to be a difference of opinion on whether we >> should be >> > > supporting version 'N-1' (e.g. cuDNN6). Our current MXNet 1.5 >> candidate >> > > does not compile against cuDNN v6, so this should be either fixed or >> be >> > > up-front stated to the user community. The breaking PR was >> > > https://github.com/apache/incubator-mxnet/pull/14476. >> > > > > >> > > > > Having read the prior discussion, my take on it is: >> > > > > >> > > > > - Users should be given an ample time period (1 year?) to move to >> a >> > > new CUDA/cuDNN version once it becomes 'usable.' >> > > > > >> > > > > - We should not claim to support a given version if it is no >> longer >> > > part of the MXNet CI. User's should be warned of an impeding >> dropping of >> > > this 'testing support.' >> > > > > >> > > > > So these statements do not necessarily promise 'N-1' support. I >> > could >> > > see a transitioning of the CI from CUDA9-only -> CUDA9&10 -> CUDA10 >> only. >> > > Some period before CUDA9 is dropped from CI, the user community is >> > warned. >> > > After that time, CUDA10 might be the only version tested by CI, and >> hence >> > > the only version supported (until the next CUDA version came around). >> > > > > >> > > > > Let me propose as a 'strawman' that we claim to support CUDA >> version >> > 9 >> > > and 10, with cuDNN version 7 only. Those versions have been out for >> over >> > > 1.5 years. So no CUDA 8 or cuDNN v6 support- over 1.5 years old with >> no >> > > coverage by our CI. >> > > > > >> > > > > -Dick >> > > > >> > > >> > >> >