The assumption is the AMP requirement is something that has a steep learning curve. Developers may get confused by the name, but the question the developer has to essentially answer is (and this can be added in the error): 1. If the operator can run in FP16 and FP32 modes, put it in FP16_FP32_FUNCS. 2. If you want to always run the op in FP16 because of the benefits, put it in FP16. 3. Not sure, Don't know or dont care: FP32 list.
Any operator developer who wrote InferType logic for their operator won't take much time to choose one of these options. Now it may be possible that it took a long time before developer realizes that he has to do this, thats why I suggested we need to reduce the time it takes for him to realize that something was missed. Anirudh On Tue, May 28, 2019 at 4:57 PM Sheng Zha <zhash...@apache.org> wrote: > This is driving people away exactly because they don't know this is what's > asked of them, and why they are asked of this AMP requirement in the first > place. To someone who's already familiar with the context, this is little > to be worried about. It's now the process that requires everyone to become > familiarized with the AMP requirement that becomes costly. More > importantly, this is not a question of whether it's too much, but whether > it should be there in the first place. If it's merely a question of cost I > imagine you'd have no trouble stepping up and support all of the future > operators for AMP :) > > -sz > > On 2019/05/28 23:49:52, Marco de Abreu <marco.g.ab...@gmail.com> wrote: > > I'm having trouble in how far adding the name of an operator in a single > > file is too much to expect from somebody and how this is driving people > > away. > > > > If somebody adds a tutorial, they also have to add the tutorial to a > > specific file. As far as I can tell, this has not resulted in people not > > wanting to write tutorials anymore or it being considered as such a big > > burden. > > > > So far, I'm really not following why adding a single line to a single > file > > is considered such a big deal. Considering how long this guide [1] > already > > is, what's the harm in adding this as an additional instruction? (AMP is > > not mentioned there yet, it would be great if that could be follow up) > > > > [1] > > > https://mxnet.incubator.apache.org/versions/master/faq/add_op_in_backend.html > > > > -Marco > > > > On Wed, May 29, 2019 at 1:42 AM Sheng Zha <zhash...@apache.org> wrote: > > > > > AMP is in contrib so there's no guarantee that the API is final. > Adopting > > > the test as-is is harmful because operator authors should not be > required > > > to invest in an experimental feature that they are not aware of. > > > > > > I'm all for openness and welcoming, but think about whether you'd like > to > > > turn away developers who just want to write a CPU-only operator. The > more > > > you impose on the developers the less likely they will make the > > > contribution through. > > > > > > Having an unfamiliar operator in AMP as a warning could let everyone > know > > > what the support state is whenever that feature is used. For those who > care > > > about this, they would see the warning and add the support to get the > speed > > > benefit of not casting to fp32. In this case, rather than imposing it > to > > > developers who don't know about AMP, the one who actually uses AMP and > > > cares about this feature would drive the work forward. > > > > > > -sz > > > > > > On 2019/05/28 23:25:43, Marco de Abreu <marco.g.ab...@gmail.com> > wrote: > > > > While AMP might be an experimental feature, I rather would like to > put > > > the > > > > focus on the maturity of its interfaces. If the interfaces and the > > > actions > > > > developers have to do aren't finalized yet, I'd agree with disabling > the > > > > test. But if the API is final and easy to use, I don't see why > adopting > > > > early on would be harmful. But from what I can see, the output of the > > > test > > > > is very meaningful and explicit, easily understandable and offers the > > > > developer a clear list of action items that they can follow. > > > > > > > > If people actually start commenting "CI test failure seems unrelated > to > > > my > > > > change, please proceed and merge", we should advise them to please > open > > > the > > > > result tab, which will directly show the clear list of action items. > > > > Committers should support these contributors who are not that > familiar > > > with > > > > our CI system and explain to them how they can retrieve the reasons > for > > > the > > > > failure. Simply ignoring the fact that the change is not compatible > with > > > > AMP doesn't seem to be the right way. > > > > > > > > I think nobody is opposed to having AMP in MXNet. As part of > accepting > > > the > > > > feature, we also added AMP to the established methods including the > > > coding > > > > constraints and other checks that come with it. Lets be open and > welcome > > > to > > > > this new feature and come back if the turnaround time is actually too > > > high. > > > > > > > > -Marco > > > > > > > > > > > > On Wed, May 29, 2019 at 1:13 AM Sheng Zha <zhash...@apache.org> > wrote: > > > > > > > > > The support for AMP should not be a burden of authors of new > operators. > > > > > The lint analogy doesn't apply because lint is for established and > > > accepted > > > > > coding standard at MXNet and AMP is not. AMP is an experimental > feature > > > > > right now and it doesn't make sense to require contributors to > invest > > > in > > > > > it. Keeping this as error will inevitably cause comments like "CI > test > > > > > failure seems unrelated to my change, please proceed and merge". > > > > > > > > > > -sz > > > > > > > > > > On 2019/05/28 22:51:30, Marco de Abreu <marco.g.ab...@gmail.com> > > > wrote: > > > > > > Hi, > > > > > > > > > > > > I'm generally in favour of these kind of tests since they make > > > developers > > > > > > aware of changes they have to make which they would usually not > be > > > aware > > > > > > of. We have a similar test for tutorials, for example. Whenever > > > somebody > > > > > > adds a tutorial, there's a validation that assures that all > > > contraints in > > > > > > our testing environment are met and that they are properly tied > into > > > the > > > > > > system. This AMP test fits into the same category in my opinion > and > > > we > > > > > > never heard bad feedback about these kind of checks. > > > > > > > > > > > > What seems to be bothering people is the fact that the feedback > time > > > is > > > > > too > > > > > > high. Thus, I'd like to propose to move the test into the > sanity-test > > > > > stage > > > > > > instead of doing it as part of the unit tests which take quite a > bit > > > of > > > > > > time until they're actually executed. The sanity checks run > > > immediately > > > > > and > > > > > > give a response within about 1 minute. > > > > > > > > > > > > While I understand that this might increase the amount of work a > > > > > developer > > > > > > has to do if they develop a new operator, I think that this is > the > > > right > > > > > > thing to do. Developers won't know of every single feature other > > > people > > > > > > worked on and thus might simply miss adding the support for it. > This > > > kind > > > > > > of test on the other hand makes them aware of it. If they'd like > to > > > opt > > > > > > out, it's one single line they would have to change and then > they're > > > > > > totally fine. On the other hand, this might motivate them to add > the > > > > > > support since the kernel would be the last piece and everything > else > > > > > would > > > > > > already be implemented. > > > > > > > > > > > > Considering how often a PR gets declined because of linting > errors, > > > I'd > > > > > say > > > > > > that these kind of errors are WAY more frequent that AMP telling > > > somebody > > > > > > to add their operator to a list. Considering that this would only > > > have to > > > > > > be done once per operator, that's work of about one minute. Add > that > > > to > > > > > the > > > > > > waiting time of the sanity check and you're left with about five > > > "wasted" > > > > > > minutes. > > > > > > > > > > > > I'm opposed towards adding a warning or treating them as float32 > by > > > > > default > > > > > > since the operator author wouldn't notice. What will happen is > that > > > > > people > > > > > > won't know about AMP and simply forget about low precision in > general > > > > > until > > > > > > they're actively reminded. This check will remind them actively > and > > > thus > > > > > > bring more attention to the feature. I know that the feature is > still > > > > > > experimental, but we have just started with the 1.6 branch and > thus > > > > > there's > > > > > > enough time to make the experimental features production ready. > > > Adding > > > > > this > > > > > > test early on will allow others to add the support for AMP > during the > > > > > early > > > > > > stage of the 1.6 branch instead of asking them in the last few > weeks > > > > > before > > > > > > a release. The result would only be that stuff is rushed or > > > forgotten. > > > > > > > > > > > > To sum it up: I think this test is good and it should be kept as > > > error, > > > > > but > > > > > > it should be moved to sanity checks. > > > > > > > > > > > > -Marco > > > > > > > > > > > > On Wed, May 29, 2019 at 12:21 AM Sheng Zha <zhash...@apache.org> > > > wrote: > > > > > > > > > > > > > Thanks for initiating the discussion. > > > > > > > > > > > > > > The premise for adding the test was to make sure that AMP > feature > > > is > > > > > "not > > > > > > > broken", but that's IMO not the right view. AMP is not > supposed to > > > > > support > > > > > > > a new operator it hasn't seen before in the first place. > There's no > > > > > way for > > > > > > > it to know whether the fp32 cast should happen or not. So AMP > > > feature > > > > > > > cannot provide the guarantee that it works for all future > > > operators. > > > > > Thus, > > > > > > > adding new operators to AMP list should be considered new > feature > > > > > instead > > > > > > > of fixing existing feature. > > > > > > > > > > > > > > The AMP test that breaks upon the addition of new operator is > thus > > > > > > > equivalent to forcing developers of the new operator to add > the new > > > > > support > > > > > > > for AMP. This feels wrong. Especially given that AMP is an > > > experimental > > > > > > > feature in contrib namespace (i.e. no semver guarantee), this > > > practice > > > > > > > should be stopped immediately. We cannot force new developers > to > > > invest > > > > > > > into experimental feature this way. > > > > > > > > > > > > > > I'd suggest the following changes: > > > > > > > - for new operators that aren't registered in AMP, cast to > float32 > > > by > > > > > > > default and print one-time warning. People using AMP who want > to > > > avoid > > > > > > > casting can register it in the AMP's list. > > > > > > > - change the test to print warning about the operators that > are not > > > > > listed > > > > > > > so that it's easy to track the problem. > > > > > > > > > > > > > > -sz > > > > > > > > > > > > > > On 2019/05/28 21:32:42, Przemys��aw Tr��dak < > ptre...@apache.org> > > > > > wrote: > > > > > > > > Dear Community, > > > > > > > > > > > > > > > > One of the recently merged features of the 1.5 release, AMP > > > > > (Automatic > > > > > > > Mixed Precision) support (PR [1], design doc [5]), introduced a > > > > > requirement > > > > > > > that every new operator added to MXNet would need to be > present in > > > 1 > > > > > of the > > > > > > > lists (in [2]). To make sure that this requirement is not > broken > > > when > > > > > > > somebody adds a new operator and does not know about AMP's > > > existence, a > > > > > > > test was added to CI ([3]). > > > > > > > > > > > > > > > > A few people reached out to me (the original author of the > > > feature) > > > > > > > saying this test increases a burden on a developer of new > > > operators and > > > > > > > should not be an actual error, but just warning (PR for that > change > > > > > [4]). > > > > > > > That is why I would like to present a motivation for it and > discuss > > > > > with > > > > > > > the wider audience why I feel it was necessary. > > > > > > > > > > > > > > > > First, for people who do not know the details of what AMP is > - > > > it is > > > > > a > > > > > > > solution that tries to automatically apply best practices of > > > training > > > > > in > > > > > > > lower precision (FP16) to user's FP32 model in order to fully > > > utilize > > > > > > > capabilities of modern GPUs (and potentially other hardware in > the > > > > > future). > > > > > > > It does so by casting to lower precision inputs to operators > > > > > benefitting > > > > > > > from it, while casting to full precision inputs of operators > that > > > are > > > > > > > unsafe to run in lower precision or just do not support it. > > > > > > > > > > > > > > > > The first iteration of AMP kept 2 main lists of operators - > > > operators > > > > > > > that are beneficial and safe to do in fp16 and operators that > need > > > to > > > > > be > > > > > > > cast to FP32. The problem (raised in review of the PR [6], > [8]) is > > > how > > > > > to > > > > > > > make sure that the feature works as intended and is not > > > inadvertently > > > > > > > broken by somebody adding a new operator. The failure scenario > > > here is > > > > > > > adding a new operator that does not support FP16 and so should > be > > > cast > > > > > to > > > > > > > FP32, but AMP does not know about its existence and so does > not do > > > the > > > > > > > casting. The solution proposed in the review was to implicitly > > > treat > > > > > all of > > > > > > > the unknown operators as FP32-only and keep the list of > operators > > > that > > > > > work > > > > > > > fine in both FP16 and FP32. This solution however does not > really > > > work, > > > > > > > because there are multiple operators (most notably optimizers) > > > where > > > > > > > introducing additional casting of the input to FP32 would > break the > > > > > > > operator. > > > > > > > > > > > > > > > > That is why after discussion with a few members of the > > > community, I > > > > > > > decided to proceed with all lists being explicit and > introducing > > > the > > > > > test > > > > > > > that would fail when somebody added an operator without > > > classifying it > > > > > into > > > > > > > 1 of the categories, and explain clearly how to do it [7]. It > is > > > not > > > > > ideal > > > > > > > solution, as it introduces some burden on the developers who > are > > > not > > > > > aware > > > > > > > about AMP, however in the typical case of adding at most a few > > > > > operators to > > > > > > > MXNet the inconvenience is I think pretty minor while important > > > for the > > > > > > > feature correctness going forward. > > > > > > > > > > > > > > > > I would like to gather Community feedback and ideas how to > handle > > > > > this > > > > > > > situation. > > > > > > > > > > > > > > > > [1] https://github.com/apache/incubator-mxnet/pull/14173 > > > > > > > > [2] > > > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contrib/amp/lists/symbol.py > > > > > > > > [3] > > > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py > > > > > > > > [4] https://github.com/apache/incubator-mxnet/pull/15085 > > > > > > > > [5] > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1sQzMoPEwux0WXSWirY07us1POD_6Y8pLYq--b9Fvd1o/edit?usp=sharing > > > > > > > > [6] > > > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/pull/14173#discussion_r270728019 > > > > > > > > [7] > > > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py#L62-L80 > > > > > > > > [8] > > > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/pull/14173#pullrequestreview-235846341 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >