Very good points, pracheer. We have been thinking about running nightly integration tests which will test the master branch on a wide base of settings (including IoT devices). How about switching to Cuda 9 in terms of PR validation, and doing extensive checks on cuda8/9 and another variety of environments during nightly. PRs would be tested on the latest and most widely used environments. I think this would be a viable solution as if issues arise in cuda 8 but not in cuda 9, this is rather something we as a community should investigate instead of just the PR creator as this could have a wide impact and may also influence other parts of MXNet.
-Marco On Sat, Jan 6, 2018 at 9:18 PM, pracheer gupta <[email protected]> wrote: > I agree with Naveen that we shouldn’t be forcing production systems to > forcefully update immediately as soon as the new version comes out. If we > want to get MXNet more adopted, we should think about convenience of the > customers and the pain we might be forcing on them. Having said that, as > Bhavin pointed out, given the resources it might be tricky to completely > support n-1 version in earnest. In fact trying to support more > configurations runs the risk of resulting in overall lower quality of > support even for things that are important due to limited resources. > > Wondering if there is a compromise possible where we don’t create a > “panic” for production systems to upgrade. For instance, how about just > supporting all the permutations of software/hardware launches in last 2 > years (or may be 3)? This would give enough time to people to upgrade while > reducing the amount of configurations we need to support? > > I also think that discussion in this thread seems to have two sides to it. > One specifically for cuda8/9 support and other being a general thought > process around how many options should mxnet community support. > > For cuda9, assuming it is truly backward compatible (read: no bugs at > all), we might be able to force everyone to upgrade in sometime (6months? 1 > year?). Until then we should keep cuda8 in the ci system? > > Another possible solution is to be strategic for the time being and decide > what is the best decision that might help us get better in the long term at > the cost of short term pains: officially support only the latest (for next > few months at least) until we are able to get the CI system to a really > good place where it is convenient, easy to use and easy to add support for > more configurations and then figure out the policy of how many software > versions we should support? > > -Pracheer > > > > On Jan 6, 2018, at 11:18 AM, Marco de Abreu < > [email protected]> wrote: > > > > What do you think about finding out which version of cuda our users are > > actually using and maybe finding out why they didn't upgrade if they are > > still using an old version? Maybe there are some proper business reasons > we > > are not aware of. > > > > -Marco > > > > Am 06.01.2018 8:08 nachm. schrieb "Naveen Swamy" <[email protected]>: > > > >> I will have to disagree with abandoning a N-1 version of the dependent > >> libraries as a general guideline for the project. there might be > exceptions > >> to this which should be discussed and agreed on and **well documented** > on > >> the Apache MXNet webpage. > >> > >> My reasoning is users who are running software in their production > >> environment take time to pick up the latest software to deploy on to > their > >> production environments. From my experience for critical systems, they > will > >> carefully test and evaluate new software before deploying. The latest > >> software sometimes have backward incompatible features that would break > >> their system. In order to earn trust from users its important we don't > >> start deprecating software as and when new libraries come up. > >> > >> What we could do is announce starting version MXNet 1.00... + N we would > >> only support N+1 library with good reasoning like this one CUDA 9 being > >> backward compatible and recommend users to upgrade as well. Ideally this > >> would happen when we release new version of MXNet > >> > >> So I think we should support CUDA 8 at least till we release a new > version > >> of MXNet and pre-announce if we plan to drop. > >> > >> my 2 cents. > >> > >> Thanks, Naveen > >> > >> > >> > >> On Sat, Jan 6, 2018 at 9:48 AM, Bhavin Thaker <[email protected]> > >> wrote: > >> > >>> Hi Marco, > >>> > >>> Here are the Years in which the GPU architectures were introduced: > >>> > >>> - Tesla: 2008; > >>> - Fermi: 2010; > >>> - Kepler: 2012; > >>> - Maxwell: 2014; > >>> - Pascal:2016; > >>> - Volta: 2017; > >>> > >>> I see no need to support the 7+ year old Fermi architecture for > >> fast-moving > >>> Apache MXNet. > >>> > >>> Bhavin Thaker. > >>> > >>> On Sat, Jan 6, 2018 at 9:36 AM Marco de Abreu < > >>> [email protected]> > >>> wrote: > >>> > >>>> Just to provide some data. Dropping CUDA8 support would deprecate the > >>>> Fermi-Architecture, effectively affecting the following devices: > >>>> > >>>> 2.0 Fermi <https://en.wikipedia.org/wiki/Fermi_(microarchitecture)> > >>> GF100, > >>>> GF110 GeForce GTX 590, GeForce GTX 580, GeForce GTX 570, GeForce GTX > >> 480, > >>>> GeForce GTX 470, GeForce GTX 465, GeForce GTX 480M Quadro 6000, Quadro > >>>> 5000, Quadro 4000, Quadro 4000 for Mac, Quadro Plex 7000, Quadro > 5010M, > >>>> Quadro 5000M Tesla C2075, Tesla C2050/C2070, Tesla > >>> M2050/M2070/M2075/M2090 > >>>> 2.1 GF104, GF106 GF108, GF114, GF116, GF117, GF119 GeForce GTX 560 Ti, > >>>> GeForce GTX 550 Ti, GeForce GTX 460, GeForce GTS 450, GeForce GTS > 450*, > >>>> GeForce GT 640 (GDDR3), GeForce GT 630, GeForce GT 620, GeForce GT > 610, > >>>> GeForce GT 520, GeForce GT 440, GeForce GT 440*, GeForce GT 430, > >> GeForce > >>> GT > >>>> 430*, GeForce GT 420*, > >>>> GeForce GTX 675M, GeForce GTX 670M, GeForce GT 635M, GeForce GT 630M, > >>>> GeForce GT 625M, GeForce GT 720M, GeForce GT 620M, GeForce 710M, > >> GeForce > >>>> 610M, GeForce 820M, GeForce GTX 580M, GeForce GTX 570M, GeForce GTX > >> 560M, > >>>> GeForce GT 555M, GeForce GT 550M, GeForce GT 540M, GeForce GT 525M, > >>> GeForce > >>>> GT 520MX, GeForce GT 520M, GeForce GTX 485M, GeForce GTX 470M, GeForce > >>> GTX > >>>> 460M, GeForce GT 445M, GeForce GT 435M, GeForce GT 420M, GeForce GT > >> 415M, > >>>> GeForce 710M, GeForce 410M Quadro 2000, Quadro 2000D, Quadro 600, > >> Quadro > >>>> 4000M, Quadro 3000M, Quadro 2000M, Quadro 1000M, NVS 310, NVS 315, NVS > >>>> 5400M, NVS 5200M, NVS 4200M > >>>> > >>>> -Marco > >>>> > >>>> On Sat, Jan 6, 2018 at 6:31 PM, kellen sunderland < > >>>> [email protected]> wrote: > >>>> > >>>>> I like that proposal Bhavin. I'm also interested to see what the > >> other > >>>>> community members think. > >>>>> > >>>>> On Sat, Jan 6, 2018 at 6:27 PM, Bhavin Thaker < > >> [email protected]> > >>>>> wrote: > >>>>> > >>>>>> Hi Kellen, > >>>>>> > >>>>>> Here is my opinion and stand on this: > >>>>>> > >>>>>> I see no need to test on CUDA8 in Apache MXNet CI, especially when > >>>> CUDA9 > >>>>> is > >>>>>> backward compatible with earlier Nvidia hardware generations. There > >>> is > >>>>> time > >>>>>> and resources cost to maintaining the various combinations in the > >> CI > >>>> and > >>>>> so > >>>>>> I am NOT in favor of running CUDA8 in CI unless there is a > >> technical > >>>>>> reason/requirement for it. This approach helps to encourage users > >> to > >>>> move > >>>>>> to the latest CUDA version and thus keep the open-source > >> community’s > >>>>>> maintenance cost low for the generic option of CUDA9. > >>>>>> > >>>>>> For example: If a user opens a github issue/problem with Apache > >> MXNet > >>>> and > >>>>>> CUDA8, I would ask the user to test it with CUDA9. If the problem > >>>> happens > >>>>>> only on CUDA8, then a volunteer in the community may work on it. If > >>> the > >>>>>> problem happens on CUDA9 as well, then, in my humble opinion, and > >>> this > >>>>>> problem must be fixed by the community. In short, I propose that > >> the > >>>>> MXNet > >>>>>> CI run tests only with latest CUDA9 version and NOT CUDA8. > >>>>>> > >>>>>> I am eager to hear alternate viewpoints/corrections from folks > >> other > >>>> than > >>>>>> Kellen and me. > >>>>>> > >>>>>> Bhavin Thaker. > >>>>>> > >>>>>> On Sat, Jan 6, 2018 at 8:24 AM kellen sunderland < > >>>>>> [email protected]> wrote: > >>>>>> > >>>>>>> Thanks for the thoughts Bhavin, supporting the latest release > >> would > >>>>> also > >>>>>> be > >>>>>>> an option, and it would be easier from a support point of view. > >>>>>>> > >>>>>>> "2) I think your question probably is what should be tested by > >> the > >>>>> Apache > >>>>>>> MXNet CI and NOT what is supported by Apache MXNet, correct?" > >>>>>>> > >>>>>>> I view these two things as being closely related, if not > >>> equivalent. > >>>>> If > >>>>>> we > >>>>>>> don't run at least basic tests of old versions of CUDA I think > >>> there > >>>>> will > >>>>>>> be issues that slip through. That being said we can rely on > >> users > >>> to > >>>>>>> report these issues, and chances are we'll be able to provide > >>>> backwards > >>>>>>> compatible patches. At a minimum I'd recommend we should run > >> tests > >>>> on > >>>>>> all > >>>>>>> supported CUDA versions before a release. > >>>>>>> > >>>>>>> -Kellen > >>>>>>> > >>>>>>> > >>>>>>> On Sat, Jan 6, 2018 at 5:05 PM, Bhavin Thaker < > >>>> [email protected]> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hi Kellen, > >>>>>>>> > >>>>>>>> 1) Does Apache MXNet (Incubating) have a support matrix? I > >> think > >>>> the > >>>>>>> answer > >>>>>>>> is no, because I don’t know of where it is documented. One of > >> the > >>>>>> mentors > >>>>>>>> told me earlier that the community uses and modifies the > >>>> open-source > >>>>>>>> project as per their individual requirements or those of the > >>>>>> community. > >>>>>>> As > >>>>>>>> far as I know, there is no single entity that is responsible > >> for > >>>>>>> supporting > >>>>>>>> something in MXNet — corrections to my understanding are > >> welcome. > >>>>>>>> > >>>>>>>> 2) I think your question probably is what should be tested by > >> the > >>>>>> Apache > >>>>>>>> MXNet CI and NOT what is supported by Apache MXNet, correct? > >>>>>>>> > >>>>>>>> If yes, I propose testing only the latest CUDA9 and the > >>> respective > >>>>>> latest > >>>>>>>> cuDNN version in the MXNet CI since CUDA9 is backward > >> compatible > >>>> with > >>>>>>>> earlier Nvidia hardware generations. > >>>>>>>> > >>>>>>>> I would like to hear reasons why this would not work. > >>>>>>>> > >>>>>>>> I have commented on the github issue as well: > >>>>>>>> https://github.com/apache/incubator-mxnet/issues/8805 > >>>>>>>> > >>>>>>>> Bhavin Thaker. > >>>>>>>> > >>>>>>>> On Sat, Jan 6, 2018 at 3:30 AM kellen sunderland < > >>>>>>>> [email protected]> wrote: > >>>>>>>> > >>>>>>>>> Hello all, I'd like to propose that we nail down exactly > >> which > >>>>>> versions > >>>>>>>> of > >>>>>>>>> CUDA we're supporting. We can then ensure that we've got > >> good > >>>> test > >>>>>>>>> coverage for those specific versions in CI. At the moment > >> it's > >>>>>>> ambiguous > >>>>>>>>> what our current policy is. I.e. when do we drop support for > >>> old > >>>>>>>>> versions? As a result we potentially cut a release promising > >>> to > >>>>>>> support > >>>>>>>> a > >>>>>>>>> certain version of CUDA, then retroactively drop support > >> after > >>> we > >>>>>> find > >>>>>>> an > >>>>>>>>> issue. > >>>>>>>>> > >>>>>>>>> I'd like to propose that we officially support N, and N-1 > >>>> versions > >>>>> of > >>>>>>>> CUDA, > >>>>>>>>> where N is the most recent major version release. In > >> addition > >>> we > >>>>> can > >>>>>>> do > >>>>>>>>> our best to support libraries that are available for download > >>> for > >>>>>> those > >>>>>>>>> versions. Supporting these CUDA versions would also dictate > >>>> which > >>>>>>>> hardware > >>>>>>>>> we support in terms of compute capability (of course resource > >>>>>>> constraints > >>>>>>>>> would also play some role in our ability to support some > >>>> hardware). > >>>>>>>>> > >>>>>>>>> As an example this would mean that currently we'd officially > >>>>> support > >>>>>>> CUDA > >>>>>>>>> 9.* and 8. This would imply we support CUDNN 5.1 through 7, > >> as > >>>>> those > >>>>>>>>> libraries are available for CUDA 8, and 9. It would also > >> mean > >>> we > >>>>>>> support > >>>>>>>>> 3.0-7.x (Kepler, Maxwell, Pascal, Volta) taking the more > >>>>> restrictive > >>>>>>>>> hardware requirements of CUDA 9 into account. > >>>>>>>>> > >>>>>>>>> What do you all think? Would this be a reasonable support > >>>>> strategy? > >>>>>>> Are > >>>>>>>>> these the versions you'd like to see covered in CI? > >>>>>>>>> > >>>>>>>>> -Kellen > >>>>>>>>> > >>>>>>>>> A relevant issue: > >>>>>>> https://github.com/apache/incubator-mxnet/issues/8805 > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> >
