Hey Jun, could we please quantify the amount of time and effort that is required to follow the actions to add an operator to the FP32_FUNCS? To me it sounds like we are making this a bigger deal than it actually is.
-Marco On Wed, May 29, 2019 at 1:20 AM Jun Wu <[email protected]> wrote: > Thanks for initiating the discussion on dev. > > I understand the dilemma from designing AMP for making the feature usable > and functional as well as for not breaking other developer experience. > However, IMO, this is not about WHEN we should let other developers know > they have made a mistake by not adding newly developed ops to the AMP list, > but rather about whether we should do it in this way and thinking about its > long-term impact on development. > > Asking other developers to delve into the feature they are not necessarily > aware of has imposed a too strong burden and will hinder others' > development work. FP16 is a feature nice-to-have in model training, but has > not become a standard so that every developer should be aware of. As we are > growing the community by encouraging more junior developers to contribute, > such error not caused by their work would actually hold them back from > moving forward. > > IMO, AMP is not supposed cover the operators it has not seen and should > treat them as those in FP32_FUNCS to be safe. We should come up with a > maintenance plan of promoting newly added operators that can benefit from > FP16 computing into the appropriate lists, instead of shifting the burden > away to many other developers at the moment. > > On Tue, May 28, 2019 at 4:13 PM Sheng Zha <[email protected]> wrote: > > > The support for AMP should not be a burden of authors of new operators. > > The lint analogy doesn't apply because lint is for established and > accepted > > coding standard at MXNet and AMP is not. AMP is an experimental feature > > right now and it doesn't make sense to require contributors to invest in > > it. Keeping this as error will inevitably cause comments like "CI test > > failure seems unrelated to my change, please proceed and merge". > > > > -sz > > > > On 2019/05/28 22:51:30, Marco de Abreu <[email protected]> wrote: > > > Hi, > > > > > > I'm generally in favour of these kind of tests since they make > developers > > > aware of changes they have to make which they would usually not be > aware > > > of. We have a similar test for tutorials, for example. Whenever > somebody > > > adds a tutorial, there's a validation that assures that all contraints > in > > > our testing environment are met and that they are properly tied into > the > > > system. This AMP test fits into the same category in my opinion and we > > > never heard bad feedback about these kind of checks. > > > > > > What seems to be bothering people is the fact that the feedback time is > > too > > > high. Thus, I'd like to propose to move the test into the sanity-test > > stage > > > instead of doing it as part of the unit tests which take quite a bit of > > > time until they're actually executed. The sanity checks run immediately > > and > > > give a response within about 1 minute. > > > > > > While I understand that this might increase the amount of work a > > developer > > > has to do if they develop a new operator, I think that this is the > right > > > thing to do. Developers won't know of every single feature other people > > > worked on and thus might simply miss adding the support for it. This > kind > > > of test on the other hand makes them aware of it. If they'd like to opt > > > out, it's one single line they would have to change and then they're > > > totally fine. On the other hand, this might motivate them to add the > > > support since the kernel would be the last piece and everything else > > would > > > already be implemented. > > > > > > Considering how often a PR gets declined because of linting errors, I'd > > say > > > that these kind of errors are WAY more frequent that AMP telling > somebody > > > to add their operator to a list. Considering that this would only have > to > > > be done once per operator, that's work of about one minute. Add that to > > the > > > waiting time of the sanity check and you're left with about five > "wasted" > > > minutes. > > > > > > I'm opposed towards adding a warning or treating them as float32 by > > default > > > since the operator author wouldn't notice. What will happen is that > > people > > > won't know about AMP and simply forget about low precision in general > > until > > > they're actively reminded. This check will remind them actively and > thus > > > bring more attention to the feature. I know that the feature is still > > > experimental, but we have just started with the 1.6 branch and thus > > there's > > > enough time to make the experimental features production ready. Adding > > this > > > test early on will allow others to add the support for AMP during the > > early > > > stage of the 1.6 branch instead of asking them in the last few weeks > > before > > > a release. The result would only be that stuff is rushed or forgotten. > > > > > > To sum it up: I think this test is good and it should be kept as error, > > but > > > it should be moved to sanity checks. > > > > > > -Marco > > > > > > On Wed, May 29, 2019 at 12:21 AM Sheng Zha <[email protected]> > wrote: > > > > > > > Thanks for initiating the discussion. > > > > > > > > The premise for adding the test was to make sure that AMP feature is > > "not > > > > broken", but that's IMO not the right view. AMP is not supposed to > > support > > > > a new operator it hasn't seen before in the first place. There's no > > way for > > > > it to know whether the fp32 cast should happen or not. So AMP feature > > > > cannot provide the guarantee that it works for all future operators. > > Thus, > > > > adding new operators to AMP list should be considered new feature > > instead > > > > of fixing existing feature. > > > > > > > > The AMP test that breaks upon the addition of new operator is thus > > > > equivalent to forcing developers of the new operator to add the new > > support > > > > for AMP. This feels wrong. Especially given that AMP is an > experimental > > > > feature in contrib namespace (i.e. no semver guarantee), this > practice > > > > should be stopped immediately. We cannot force new developers to > invest > > > > into experimental feature this way. > > > > > > > > I'd suggest the following changes: > > > > - for new operators that aren't registered in AMP, cast to float32 by > > > > default and print one-time warning. People using AMP who want to > avoid > > > > casting can register it in the AMP's list. > > > > - change the test to print warning about the operators that are not > > listed > > > > so that it's easy to track the problem. > > > > > > > > -sz > > > > > > > > On 2019/05/28 21:32:42, Przemys��aw Tr��dak <[email protected]> > > wrote: > > > > > Dear Community, > > > > > > > > > > One of the recently merged features of the 1.5 release, AMP > > (Automatic > > > > Mixed Precision) support (PR [1], design doc [5]), introduced a > > requirement > > > > that every new operator added to MXNet would need to be present in 1 > > of the > > > > lists (in [2]). To make sure that this requirement is not broken when > > > > somebody adds a new operator and does not know about AMP's > existence, a > > > > test was added to CI ([3]). > > > > > > > > > > A few people reached out to me (the original author of the feature) > > > > saying this test increases a burden on a developer of new operators > and > > > > should not be an actual error, but just warning (PR for that change > > [4]). > > > > That is why I would like to present a motivation for it and discuss > > with > > > > the wider audience why I feel it was necessary. > > > > > > > > > > First, for people who do not know the details of what AMP is - it > is > > a > > > > solution that tries to automatically apply best practices of training > > in > > > > lower precision (FP16) to user's FP32 model in order to fully utilize > > > > capabilities of modern GPUs (and potentially other hardware in the > > future). > > > > It does so by casting to lower precision inputs to operators > > benefitting > > > > from it, while casting to full precision inputs of operators that are > > > > unsafe to run in lower precision or just do not support it. > > > > > > > > > > The first iteration of AMP kept 2 main lists of operators - > operators > > > > that are beneficial and safe to do in fp16 and operators that need to > > be > > > > cast to FP32. The problem (raised in review of the PR [6], [8]) is > how > > to > > > > make sure that the feature works as intended and is not inadvertently > > > > broken by somebody adding a new operator. The failure scenario here > is > > > > adding a new operator that does not support FP16 and so should be > cast > > to > > > > FP32, but AMP does not know about its existence and so does not do > the > > > > casting. The solution proposed in the review was to implicitly treat > > all of > > > > the unknown operators as FP32-only and keep the list of operators > that > > work > > > > fine in both FP16 and FP32. This solution however does not really > work, > > > > because there are multiple operators (most notably optimizers) where > > > > introducing additional casting of the input to FP32 would break the > > > > operator. > > > > > > > > > > That is why after discussion with a few members of the community, I > > > > decided to proceed with all lists being explicit and introducing the > > test > > > > that would fail when somebody added an operator without classifying > it > > into > > > > 1 of the categories, and explain clearly how to do it [7]. It is not > > ideal > > > > solution, as it introduces some burden on the developers who are not > > aware > > > > about AMP, however in the typical case of adding at most a few > > operators to > > > > MXNet the inconvenience is I think pretty minor while important for > the > > > > feature correctness going forward. > > > > > > > > > > I would like to gather Community feedback and ideas how to handle > > this > > > > situation. > > > > > > > > > > [1] https://github.com/apache/incubator-mxnet/pull/14173 > > > > > [2] > > > > > > > https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contrib/amp/lists/symbol.py > > > > > [3] > > > > > > > https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py > > > > > [4] https://github.com/apache/incubator-mxnet/pull/15085 > > > > > [5] > > > > > > > https://docs.google.com/document/d/1sQzMoPEwux0WXSWirY07us1POD_6Y8pLYq--b9Fvd1o/edit?usp=sharing > > > > > [6] > > > > > > > https://github.com/apache/incubator-mxnet/pull/14173#discussion_r270728019 > > > > > [7] > > > > > > > https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py#L62-L80 > > > > > [8] > > > > > > > https://github.com/apache/incubator-mxnet/pull/14173#pullrequestreview-235846341 > > > > > > > > > > > > > > >
