+1 to that. I think we don't have to run CUDA 8 on every PR. On Sat, Jan 6, 2018 at 12:26 PM, Marco de Abreu < [email protected]> wrote:
> Very good points, pracheer. > > We have been thinking about running nightly integration tests which will > test the master branch on a wide base of settings (including IoT devices). > How about switching to Cuda 9 in terms of PR validation, and doing > extensive checks on cuda8/9 and another variety of environments during > nightly. PRs would be tested on the latest and most widely used > environments. I think this would be a viable solution as if issues arise in > cuda 8 but not in cuda 9, this is rather something we as a community should > investigate instead of just the PR creator as this could have a wide impact > and may also influence other parts of MXNet. > > -Marco > > On Sat, Jan 6, 2018 at 9:18 PM, pracheer gupta <[email protected] > > > wrote: > > > I agree with Naveen that we shouldn’t be forcing production systems to > > forcefully update immediately as soon as the new version comes out. If we > > want to get MXNet more adopted, we should think about convenience of the > > customers and the pain we might be forcing on them. Having said that, as > > Bhavin pointed out, given the resources it might be tricky to completely > > support n-1 version in earnest. In fact trying to support more > > configurations runs the risk of resulting in overall lower quality of > > support even for things that are important due to limited resources. > > > > Wondering if there is a compromise possible where we don’t create a > > “panic” for production systems to upgrade. For instance, how about just > > supporting all the permutations of software/hardware launches in last 2 > > years (or may be 3)? This would give enough time to people to upgrade > while > > reducing the amount of configurations we need to support? > > > > I also think that discussion in this thread seems to have two sides to > it. > > One specifically for cuda8/9 support and other being a general thought > > process around how many options should mxnet community support. > > > > For cuda9, assuming it is truly backward compatible (read: no bugs at > > all), we might be able to force everyone to upgrade in sometime > (6months? 1 > > year?). Until then we should keep cuda8 in the ci system? > > > > Another possible solution is to be strategic for the time being and > decide > > what is the best decision that might help us get better in the long term > at > > the cost of short term pains: officially support only the latest (for > next > > few months at least) until we are able to get the CI system to a really > > good place where it is convenient, easy to use and easy to add support > for > > more configurations and then figure out the policy of how many software > > versions we should support? > > > > -Pracheer > > > > > > > On Jan 6, 2018, at 11:18 AM, Marco de Abreu < > > [email protected]> wrote: > > > > > > What do you think about finding out which version of cuda our users are > > > actually using and maybe finding out why they didn't upgrade if they > are > > > still using an old version? Maybe there are some proper business > reasons > > we > > > are not aware of. > > > > > > -Marco > > > > > > Am 06.01.2018 8:08 nachm. schrieb "Naveen Swamy" <[email protected]>: > > > > > >> I will have to disagree with abandoning a N-1 version of the dependent > > >> libraries as a general guideline for the project. there might be > > exceptions > > >> to this which should be discussed and agreed on and **well > documented** > > on > > >> the Apache MXNet webpage. > > >> > > >> My reasoning is users who are running software in their production > > >> environment take time to pick up the latest software to deploy on to > > their > > >> production environments. From my experience for critical systems, they > > will > > >> carefully test and evaluate new software before deploying. The latest > > >> software sometimes have backward incompatible features that would > break > > >> their system. In order to earn trust from users its important we don't > > >> start deprecating software as and when new libraries come up. > > >> > > >> What we could do is announce starting version MXNet 1.00... + N we > would > > >> only support N+1 library with good reasoning like this one CUDA 9 > being > > >> backward compatible and recommend users to upgrade as well. Ideally > this > > >> would happen when we release new version of MXNet > > >> > > >> So I think we should support CUDA 8 at least till we release a new > > version > > >> of MXNet and pre-announce if we plan to drop. > > >> > > >> my 2 cents. > > >> > > >> Thanks, Naveen > > >> > > >> > > >> > > >> On Sat, Jan 6, 2018 at 9:48 AM, Bhavin Thaker <[email protected] > > > > >> wrote: > > >> > > >>> Hi Marco, > > >>> > > >>> Here are the Years in which the GPU architectures were introduced: > > >>> > > >>> - Tesla: 2008; > > >>> - Fermi: 2010; > > >>> - Kepler: 2012; > > >>> - Maxwell: 2014; > > >>> - Pascal:2016; > > >>> - Volta: 2017; > > >>> > > >>> I see no need to support the 7+ year old Fermi architecture for > > >> fast-moving > > >>> Apache MXNet. > > >>> > > >>> Bhavin Thaker. > > >>> > > >>> On Sat, Jan 6, 2018 at 9:36 AM Marco de Abreu < > > >>> [email protected]> > > >>> wrote: > > >>> > > >>>> Just to provide some data. Dropping CUDA8 support would deprecate > the > > >>>> Fermi-Architecture, effectively affecting the following devices: > > >>>> > > >>>> 2.0 Fermi <https://en.wikipedia.org/wiki/Fermi_(microarchitecture)> > > >>> GF100, > > >>>> GF110 GeForce GTX 590, GeForce GTX 580, GeForce GTX 570, GeForce GTX > > >> 480, > > >>>> GeForce GTX 470, GeForce GTX 465, GeForce GTX 480M Quadro 6000, > Quadro > > >>>> 5000, Quadro 4000, Quadro 4000 for Mac, Quadro Plex 7000, Quadro > > 5010M, > > >>>> Quadro 5000M Tesla C2075, Tesla C2050/C2070, Tesla > > >>> M2050/M2070/M2075/M2090 > > >>>> 2.1 GF104, GF106 GF108, GF114, GF116, GF117, GF119 GeForce GTX 560 > Ti, > > >>>> GeForce GTX 550 Ti, GeForce GTX 460, GeForce GTS 450, GeForce GTS > > 450*, > > >>>> GeForce GT 640 (GDDR3), GeForce GT 630, GeForce GT 620, GeForce GT > > 610, > > >>>> GeForce GT 520, GeForce GT 440, GeForce GT 440*, GeForce GT 430, > > >> GeForce > > >>> GT > > >>>> 430*, GeForce GT 420*, > > >>>> GeForce GTX 675M, GeForce GTX 670M, GeForce GT 635M, GeForce GT > 630M, > > >>>> GeForce GT 625M, GeForce GT 720M, GeForce GT 620M, GeForce 710M, > > >> GeForce > > >>>> 610M, GeForce 820M, GeForce GTX 580M, GeForce GTX 570M, GeForce GTX > > >> 560M, > > >>>> GeForce GT 555M, GeForce GT 550M, GeForce GT 540M, GeForce GT 525M, > > >>> GeForce > > >>>> GT 520MX, GeForce GT 520M, GeForce GTX 485M, GeForce GTX 470M, > GeForce > > >>> GTX > > >>>> 460M, GeForce GT 445M, GeForce GT 435M, GeForce GT 420M, GeForce GT > > >> 415M, > > >>>> GeForce 710M, GeForce 410M Quadro 2000, Quadro 2000D, Quadro 600, > > >> Quadro > > >>>> 4000M, Quadro 3000M, Quadro 2000M, Quadro 1000M, NVS 310, NVS 315, > NVS > > >>>> 5400M, NVS 5200M, NVS 4200M > > >>>> > > >>>> -Marco > > >>>> > > >>>> On Sat, Jan 6, 2018 at 6:31 PM, kellen sunderland < > > >>>> [email protected]> wrote: > > >>>> > > >>>>> I like that proposal Bhavin. I'm also interested to see what the > > >> other > > >>>>> community members think. > > >>>>> > > >>>>> On Sat, Jan 6, 2018 at 6:27 PM, Bhavin Thaker < > > >> [email protected]> > > >>>>> wrote: > > >>>>> > > >>>>>> Hi Kellen, > > >>>>>> > > >>>>>> Here is my opinion and stand on this: > > >>>>>> > > >>>>>> I see no need to test on CUDA8 in Apache MXNet CI, especially when > > >>>> CUDA9 > > >>>>> is > > >>>>>> backward compatible with earlier Nvidia hardware generations. > There > > >>> is > > >>>>> time > > >>>>>> and resources cost to maintaining the various combinations in the > > >> CI > > >>>> and > > >>>>> so > > >>>>>> I am NOT in favor of running CUDA8 in CI unless there is a > > >> technical > > >>>>>> reason/requirement for it. This approach helps to encourage users > > >> to > > >>>> move > > >>>>>> to the latest CUDA version and thus keep the open-source > > >> community’s > > >>>>>> maintenance cost low for the generic option of CUDA9. > > >>>>>> > > >>>>>> For example: If a user opens a github issue/problem with Apache > > >> MXNet > > >>>> and > > >>>>>> CUDA8, I would ask the user to test it with CUDA9. If the problem > > >>>> happens > > >>>>>> only on CUDA8, then a volunteer in the community may work on it. > If > > >>> the > > >>>>>> problem happens on CUDA9 as well, then, in my humble opinion, and > > >>> this > > >>>>>> problem must be fixed by the community. In short, I propose that > > >> the > > >>>>> MXNet > > >>>>>> CI run tests only with latest CUDA9 version and NOT CUDA8. > > >>>>>> > > >>>>>> I am eager to hear alternate viewpoints/corrections from folks > > >> other > > >>>> than > > >>>>>> Kellen and me. > > >>>>>> > > >>>>>> Bhavin Thaker. > > >>>>>> > > >>>>>> On Sat, Jan 6, 2018 at 8:24 AM kellen sunderland < > > >>>>>> [email protected]> wrote: > > >>>>>> > > >>>>>>> Thanks for the thoughts Bhavin, supporting the latest release > > >> would > > >>>>> also > > >>>>>> be > > >>>>>>> an option, and it would be easier from a support point of view. > > >>>>>>> > > >>>>>>> "2) I think your question probably is what should be tested by > > >> the > > >>>>> Apache > > >>>>>>> MXNet CI and NOT what is supported by Apache MXNet, correct?" > > >>>>>>> > > >>>>>>> I view these two things as being closely related, if not > > >>> equivalent. > > >>>>> If > > >>>>>> we > > >>>>>>> don't run at least basic tests of old versions of CUDA I think > > >>> there > > >>>>> will > > >>>>>>> be issues that slip through. That being said we can rely on > > >> users > > >>> to > > >>>>>>> report these issues, and chances are we'll be able to provide > > >>>> backwards > > >>>>>>> compatible patches. At a minimum I'd recommend we should run > > >> tests > > >>>> on > > >>>>>> all > > >>>>>>> supported CUDA versions before a release. > > >>>>>>> > > >>>>>>> -Kellen > > >>>>>>> > > >>>>>>> > > >>>>>>> On Sat, Jan 6, 2018 at 5:05 PM, Bhavin Thaker < > > >>>> [email protected]> > > >>>>>>> wrote: > > >>>>>>> > > >>>>>>>> Hi Kellen, > > >>>>>>>> > > >>>>>>>> 1) Does Apache MXNet (Incubating) have a support matrix? I > > >> think > > >>>> the > > >>>>>>> answer > > >>>>>>>> is no, because I don’t know of where it is documented. One of > > >> the > > >>>>>> mentors > > >>>>>>>> told me earlier that the community uses and modifies the > > >>>> open-source > > >>>>>>>> project as per their individual requirements or those of the > > >>>>>> community. > > >>>>>>> As > > >>>>>>>> far as I know, there is no single entity that is responsible > > >> for > > >>>>>>> supporting > > >>>>>>>> something in MXNet — corrections to my understanding are > > >> welcome. > > >>>>>>>> > > >>>>>>>> 2) I think your question probably is what should be tested by > > >> the > > >>>>>> Apache > > >>>>>>>> MXNet CI and NOT what is supported by Apache MXNet, correct? > > >>>>>>>> > > >>>>>>>> If yes, I propose testing only the latest CUDA9 and the > > >>> respective > > >>>>>> latest > > >>>>>>>> cuDNN version in the MXNet CI since CUDA9 is backward > > >> compatible > > >>>> with > > >>>>>>>> earlier Nvidia hardware generations. > > >>>>>>>> > > >>>>>>>> I would like to hear reasons why this would not work. > > >>>>>>>> > > >>>>>>>> I have commented on the github issue as well: > > >>>>>>>> https://github.com/apache/incubator-mxnet/issues/8805 > > >>>>>>>> > > >>>>>>>> Bhavin Thaker. > > >>>>>>>> > > >>>>>>>> On Sat, Jan 6, 2018 at 3:30 AM kellen sunderland < > > >>>>>>>> [email protected]> wrote: > > >>>>>>>> > > >>>>>>>>> Hello all, I'd like to propose that we nail down exactly > > >> which > > >>>>>> versions > > >>>>>>>> of > > >>>>>>>>> CUDA we're supporting. We can then ensure that we've got > > >> good > > >>>> test > > >>>>>>>>> coverage for those specific versions in CI. At the moment > > >> it's > > >>>>>>> ambiguous > > >>>>>>>>> what our current policy is. I.e. when do we drop support for > > >>> old > > >>>>>>>>> versions? As a result we potentially cut a release promising > > >>> to > > >>>>>>> support > > >>>>>>>> a > > >>>>>>>>> certain version of CUDA, then retroactively drop support > > >> after > > >>> we > > >>>>>> find > > >>>>>>> an > > >>>>>>>>> issue. > > >>>>>>>>> > > >>>>>>>>> I'd like to propose that we officially support N, and N-1 > > >>>> versions > > >>>>> of > > >>>>>>>> CUDA, > > >>>>>>>>> where N is the most recent major version release. In > > >> addition > > >>> we > > >>>>> can > > >>>>>>> do > > >>>>>>>>> our best to support libraries that are available for download > > >>> for > > >>>>>> those > > >>>>>>>>> versions. Supporting these CUDA versions would also dictate > > >>>> which > > >>>>>>>> hardware > > >>>>>>>>> we support in terms of compute capability (of course resource > > >>>>>>> constraints > > >>>>>>>>> would also play some role in our ability to support some > > >>>> hardware). > > >>>>>>>>> > > >>>>>>>>> As an example this would mean that currently we'd officially > > >>>>> support > > >>>>>>> CUDA > > >>>>>>>>> 9.* and 8. This would imply we support CUDNN 5.1 through 7, > > >> as > > >>>>> those > > >>>>>>>>> libraries are available for CUDA 8, and 9. It would also > > >> mean > > >>> we > > >>>>>>> support > > >>>>>>>>> 3.0-7.x (Kepler, Maxwell, Pascal, Volta) taking the more > > >>>>> restrictive > > >>>>>>>>> hardware requirements of CUDA 9 into account. > > >>>>>>>>> > > >>>>>>>>> What do you all think? Would this be a reasonable support > > >>>>> strategy? > > >>>>>>> Are > > >>>>>>>>> these the versions you'd like to see covered in CI? > > >>>>>>>>> > > >>>>>>>>> -Kellen > > >>>>>>>>> > > >>>>>>>>> A relevant issue: > > >>>>>>> https://github.com/apache/incubator-mxnet/issues/8805 > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > >
