Re: CUDA Support [DISCUSS]

Marco de Abreu Sat, 06 Jan 2018 12:27:33 -0800

Very good points, pracheer.

We have been thinking about running nightly integration tests which will
test the master branch on a wide base of settings (including IoT devices).
How about switching to Cuda 9 in terms of PR validation, and doing
extensive checks on cuda8/9 and another variety of environments during
nightly. PRs would be tested on the latest and most widely used
environments. I think this would be a viable solution as if issues arise in
cuda 8 but not in cuda 9, this is rather something we as a community should
investigate instead of just the PR creator as this could have a wide impact
and may also influence other parts of MXNet.


-Marco

On Sat, Jan 6, 2018 at 9:18 PM, pracheer gupta <[email protected]>
wrote:

> I agree with Naveen that we shouldn’t be forcing production systems to
> forcefully update immediately as soon as the new version comes out. If we
> want to get MXNet more adopted, we should think about convenience of the
> customers and the pain we might be forcing on them. Having said that, as
> Bhavin pointed out, given the resources it might be tricky to completely
> support n-1 version in earnest. In fact trying to support more
> configurations runs the risk of resulting in overall lower quality of
> support even for things that are important due to limited resources.
>
> Wondering if there is a compromise possible where we don’t create a
> “panic” for production systems to upgrade. For instance, how about just
> supporting all the permutations of software/hardware launches in last 2
> years (or may be 3)? This would give enough time to people to upgrade while
> reducing the amount of configurations we need to support?
>
> I also think that discussion in this thread seems to have two sides to it.
> One specifically for cuda8/9 support and other being a general thought
> process around how many options should mxnet community support.
>
> For cuda9, assuming it is truly backward compatible (read: no bugs at
> all), we might be able to force everyone to upgrade in sometime (6months? 1
> year?). Until then we should keep cuda8 in the ci system?
>
> Another possible solution is to be strategic for the time being and decide
> what is the best decision that might help us get better in the long term at
> the cost of short term pains: officially support only the latest (for next
> few months at least) until we are able to get the CI system to a really
> good place where it is convenient, easy to use and easy to add support for
> more configurations and then figure out the policy of how many software
> versions we should support?
>
> -Pracheer
>
>
> > On Jan 6, 2018, at 11:18 AM, Marco de Abreu <
> [email protected]> wrote:
> >
> > What do you think about finding out which version of cuda our users are
> > actually using and maybe finding out why they didn't upgrade if they are
> > still using an old version? Maybe there are some proper business reasons
> we
> > are not aware of.
> >
> > -Marco
> >
> > Am 06.01.2018 8:08 nachm. schrieb "Naveen Swamy" <[email protected]>:
> >
> >> I will have to disagree with abandoning a N-1 version of the dependent
> >> libraries as a general guideline for the project. there might be
> exceptions
> >> to this which should be discussed and agreed on and **well documented**
> on
> >> the Apache MXNet webpage.
> >>
> >> My reasoning is users who are running software in their production
> >> environment take time to pick up the latest software to deploy on to
> their
> >> production environments. From my experience for critical systems, they
> will
> >> carefully test and evaluate new software before deploying. The latest
> >> software sometimes have backward incompatible features that would break
> >> their system. In order to earn trust from users its important we don't
> >> start deprecating software as and when new libraries come up.
> >>
> >> What we could do is announce starting version MXNet 1.00... + N we would
> >> only support N+1 library with good reasoning like this one CUDA 9 being
> >> backward compatible and recommend users to upgrade as well. Ideally this
> >> would happen when we release new version of MXNet
> >>
> >> So I think we should support CUDA 8 at least till we release a new
> version
> >> of MXNet and pre-announce if we plan to drop.
> >>
> >> my 2 cents.
> >>
> >> Thanks, Naveen
> >>
> >>
> >>
> >> On Sat, Jan 6, 2018 at 9:48 AM, Bhavin Thaker <[email protected]>
> >> wrote:
> >>
> >>> Hi Marco,
> >>>
> >>> Here are the Years in which the GPU architectures were introduced:
> >>>
> >>>   - Tesla: 2008;
> >>>   - Fermi: 2010;
> >>>   - Kepler: 2012;
> >>>   - Maxwell: 2014;
> >>>   - Pascal:2016;
> >>>   - Volta: 2017;
> >>>
> >>> I see no need to support the 7+ year old Fermi architecture for
> >> fast-moving
> >>> Apache MXNet.
> >>>
> >>> Bhavin Thaker.
> >>>
> >>> On Sat, Jan 6, 2018 at 9:36 AM Marco de Abreu <
> >>> [email protected]>
> >>> wrote:
> >>>
> >>>> Just to provide some data. Dropping CUDA8 support would deprecate the
> >>>> Fermi-Architecture, effectively affecting the following devices:
> >>>>
> >>>> 2.0 Fermi <https://en.wikipedia.org/wiki/Fermi_(microarchitecture)>
> >>> GF100,
> >>>> GF110 GeForce GTX 590, GeForce GTX 580, GeForce GTX 570, GeForce GTX
> >> 480,
> >>>> GeForce GTX 470, GeForce GTX 465, GeForce GTX 480M Quadro 6000, Quadro
> >>>> 5000, Quadro 4000, Quadro 4000 for Mac, Quadro Plex 7000, Quadro
> 5010M,
> >>>> Quadro 5000M Tesla C2075, Tesla C2050/C2070, Tesla
> >>> M2050/M2070/M2075/M2090
> >>>> 2.1 GF104, GF106 GF108, GF114, GF116, GF117, GF119 GeForce GTX 560 Ti,
> >>>> GeForce GTX 550 Ti, GeForce GTX 460, GeForce GTS 450, GeForce GTS
> 450*,
> >>>> GeForce GT 640 (GDDR3), GeForce GT 630, GeForce GT 620, GeForce GT
> 610,
> >>>> GeForce GT 520, GeForce GT 440, GeForce GT 440*, GeForce GT 430,
> >> GeForce
> >>> GT
> >>>> 430*, GeForce GT 420*,
> >>>> GeForce GTX 675M, GeForce GTX 670M, GeForce GT 635M, GeForce GT 630M,
> >>>> GeForce GT 625M, GeForce GT 720M, GeForce GT 620M, GeForce 710M,
> >> GeForce
> >>>> 610M, GeForce 820M, GeForce GTX 580M, GeForce GTX 570M, GeForce GTX
> >> 560M,
> >>>> GeForce GT 555M, GeForce GT 550M, GeForce GT 540M, GeForce GT 525M,
> >>> GeForce
> >>>> GT 520MX, GeForce GT 520M, GeForce GTX 485M, GeForce GTX 470M, GeForce
> >>> GTX
> >>>> 460M, GeForce GT 445M, GeForce GT 435M, GeForce GT 420M, GeForce GT
> >> 415M,
> >>>> GeForce 710M, GeForce 410M Quadro 2000, Quadro 2000D, Quadro 600,
> >> Quadro
> >>>> 4000M, Quadro 3000M, Quadro 2000M, Quadro 1000M, NVS 310, NVS 315, NVS
> >>>> 5400M, NVS 5200M, NVS 4200M
> >>>>
> >>>> -Marco
> >>>>
> >>>> On Sat, Jan 6, 2018 at 6:31 PM, kellen sunderland <
> >>>> [email protected]> wrote:
> >>>>
> >>>>> I like that proposal Bhavin.  I'm also interested to see what the
> >> other
> >>>>> community members think.
> >>>>>
> >>>>> On Sat, Jan 6, 2018 at 6:27 PM, Bhavin Thaker <
> >> [email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Kellen,
> >>>>>>
> >>>>>> Here is my opinion and stand on this:
> >>>>>>
> >>>>>> I see no need to test on CUDA8 in Apache MXNet CI, especially when
> >>>> CUDA9
> >>>>> is
> >>>>>> backward compatible with earlier Nvidia hardware generations. There
> >>> is
> >>>>> time
> >>>>>> and resources cost to maintaining the various combinations in the
> >> CI
> >>>> and
> >>>>> so
> >>>>>> I am NOT in favor of running CUDA8 in CI unless there is a
> >> technical
> >>>>>> reason/requirement for it. This approach helps to encourage users
> >> to
> >>>> move
> >>>>>> to the latest CUDA version and thus keep the open-source
> >> community’s
> >>>>>> maintenance cost low for the generic option of CUDA9.
> >>>>>>
> >>>>>> For example: If a user opens a github issue/problem with Apache
> >> MXNet
> >>>> and
> >>>>>> CUDA8, I would ask the user to test it with CUDA9. If the problem
> >>>> happens
> >>>>>> only on CUDA8, then a volunteer in the community may work on it. If
> >>> the
> >>>>>> problem happens on CUDA9 as well, then, in my humble opinion, and
> >>> this
> >>>>>> problem must be fixed by the community. In short, I propose that
> >> the
> >>>>> MXNet
> >>>>>> CI run tests only with latest CUDA9 version and NOT CUDA8.
> >>>>>>
> >>>>>> I am eager to hear alternate viewpoints/corrections from folks
> >> other
> >>>> than
> >>>>>> Kellen and me.
> >>>>>>
> >>>>>> Bhavin Thaker.
> >>>>>>
> >>>>>> On Sat, Jan 6, 2018 at 8:24 AM kellen sunderland <
> >>>>>> [email protected]> wrote:
> >>>>>>
> >>>>>>> Thanks for the thoughts Bhavin, supporting the latest release
> >> would
> >>>>> also
> >>>>>> be
> >>>>>>> an option, and it would be easier from a support point of view.
> >>>>>>>
> >>>>>>> "2) I think your question probably is what should be tested by
> >> the
> >>>>> Apache
> >>>>>>> MXNet CI and NOT what is supported by Apache MXNet, correct?"
> >>>>>>>
> >>>>>>> I view these two things as being closely related, if not
> >>> equivalent.
> >>>>> If
> >>>>>> we
> >>>>>>> don't run at least basic tests of old versions of CUDA I think
> >>> there
> >>>>> will
> >>>>>>> be issues that slip through.  That being said we can rely on
> >> users
> >>> to
> >>>>>>> report these issues, and chances are we'll be able to provide
> >>>> backwards
> >>>>>>> compatible patches.  At a minimum I'd recommend we should run
> >> tests
> >>>> on
> >>>>>> all
> >>>>>>> supported CUDA versions before a release.
> >>>>>>>
> >>>>>>> -Kellen
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sat, Jan 6, 2018 at 5:05 PM, Bhavin Thaker <
> >>>> [email protected]>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Kellen,
> >>>>>>>>
> >>>>>>>> 1) Does Apache MXNet (Incubating) have a support matrix? I
> >> think
> >>>> the
> >>>>>>> answer
> >>>>>>>> is no, because I don’t know of where it is documented. One of
> >> the
> >>>>>> mentors
> >>>>>>>> told me earlier that the community uses and modifies the
> >>>> open-source
> >>>>>>>> project as per their individual  requirements or those of the
> >>>>>> community.
> >>>>>>> As
> >>>>>>>> far as I know, there is no single entity that is responsible
> >> for
> >>>>>>> supporting
> >>>>>>>> something in MXNet — corrections to my understanding are
> >> welcome.
> >>>>>>>>
> >>>>>>>> 2) I think your question probably is what should be tested by
> >> the
> >>>>>> Apache
> >>>>>>>> MXNet CI and NOT what is supported by Apache MXNet, correct?
> >>>>>>>>
> >>>>>>>> If yes, I propose testing only the latest CUDA9 and the
> >>> respective
> >>>>>> latest
> >>>>>>>> cuDNN version in the MXNet CI since CUDA9 is backward
> >> compatible
> >>>> with
> >>>>>>>> earlier Nvidia hardware generations.
> >>>>>>>>
> >>>>>>>> I would like to hear reasons why this would not work.
> >>>>>>>>
> >>>>>>>> I have commented on the github issue as well:
> >>>>>>>> https://github.com/apache/incubator-mxnet/issues/8805
> >>>>>>>>
> >>>>>>>> Bhavin Thaker.
> >>>>>>>>
> >>>>>>>> On Sat, Jan 6, 2018 at 3:30 AM kellen sunderland <
> >>>>>>>> [email protected]> wrote:
> >>>>>>>>
> >>>>>>>>> Hello all, I'd like to propose that we nail down exactly
> >> which
> >>>>>> versions
> >>>>>>>> of
> >>>>>>>>> CUDA we're supporting.  We can then ensure that we've got
> >> good
> >>>> test
> >>>>>>>>> coverage for those specific versions in CI.  At the moment
> >> it's
> >>>>>>> ambiguous
> >>>>>>>>> what our current policy is.  I.e. when do we drop support for
> >>> old
> >>>>>>>>> versions?  As a result we potentially cut a release promising
> >>> to
> >>>>>>> support
> >>>>>>>> a
> >>>>>>>>> certain version of CUDA, then retroactively drop support
> >> after
> >>> we
> >>>>>> find
> >>>>>>> an
> >>>>>>>>> issue.
> >>>>>>>>>
> >>>>>>>>> I'd like to propose that we officially support N, and N-1
> >>>> versions
> >>>>> of
> >>>>>>>> CUDA,
> >>>>>>>>> where N is the most recent major version release.  In
> >> addition
> >>> we
> >>>>> can
> >>>>>>> do
> >>>>>>>>> our best to support libraries that are available for download
> >>> for
> >>>>>> those
> >>>>>>>>> versions.  Supporting these CUDA versions would also dictate
> >>>> which
> >>>>>>>> hardware
> >>>>>>>>> we support in terms of compute capability (of course resource
> >>>>>>> constraints
> >>>>>>>>> would also play some role in our ability to support some
> >>>> hardware).
> >>>>>>>>>
> >>>>>>>>> As an example this would mean that currently we'd officially
> >>>>> support
> >>>>>>> CUDA
> >>>>>>>>> 9.* and 8.  This would imply we support CUDNN 5.1 through 7,
> >> as
> >>>>> those
> >>>>>>>>> libraries are available for CUDA 8, and 9.  It would also
> >> mean
> >>> we
> >>>>>>> support
> >>>>>>>>> 3.0-7.x (Kepler, Maxwell, Pascal, Volta) taking the more
> >>>>> restrictive
> >>>>>>>>> hardware requirements of CUDA 9 into account.
> >>>>>>>>>
> >>>>>>>>> What do you all think?  Would this be a reasonable support
> >>>>> strategy?
> >>>>>>> Are
> >>>>>>>>> these the versions you'd like to see covered in CI?
> >>>>>>>>>
> >>>>>>>>> -Kellen
> >>>>>>>>>
> >>>>>>>>> A relevant issue:
> >>>>>>> https://github.com/apache/incubator-mxnet/issues/8805
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>

Re: CUDA Support [DISCUSS]

Reply via email to