Re: Making new operators and AMP lists

Sheng Zha Wed, 29 May 2019 11:59:28 -0700

> The second misunderstanding is that the ask from the test is to somehow "add 
> support" for AMP in the operator. That is definitely not true and adding a 
> single line in either FP32_FUNCS (cast me always to FP32) or FP16_FP32_FUNCS 
> (do not do anything with me because I'm not relevant for this usecase) is 
> perfectly fine.


I respectfully disagree with this statement. In this case, the "support to add" 
is exactly to differentiate how the new operator should be treated. It was not 
there before, so it's new. And requiring other developers to add any line is 
totally not fine.

> The test ensures that the list is updated at the same time as new operator is 
> introduced.

It would indeed have that effect. The problem is that the test requires the 
wrong person to add such support. A developer who adds a new operator should be 
under no obligation to do anything with AMP.

Having it as warning both in runtime and in the test has the effect to prompt 
people who actually use and care about the performance from AMP to take the 
steps and decide where a new operator fits. And like you said, "It is up to 
people like me actively working on better FP16 support to go through the FP32 
list to introduce support and then move such operator to another list." This is 
also exactly how it should work with AMP.

-sz

On 2019/05/29 18:25:46, Przemys��aw Tr��dak <[email protected]> wrote: 
> Thank you all for responding.
> 
> Let me address a few misunderstandings about AMP in the discussion so far:
>  - I think the main misunderstanding in the discussion so far is that it 
> would be somehow possible to implicitly cast to FP32 all previously unseen 
> operators. I would much prefer such solution myself, however this is actually 
> (unfortunately) not really possible. The technical reason for that is the 
> only place from which one can get MXNet operators is MXListAllOps call, which 
> gives back all operators without any way of differentiating them, including 
> for example calls to optimizers. If somebody added a new optimizer and AMP 
> inserts a FP32 cast before its input, the optimizer would actually update the 
> FP32 copy of the tensor instead of the tensor itself. This is not a 
> performance problem (which, I agree, would be perfectly fine to solve later) 
> - it is a correctness problem. That is why the error message in the test 
> explicitly mentions this case as one where the operator should be put in the 
> list that does not make any casts.
>  - The second misunderstanding is that the ask from the test is to somehow 
> "add support" for AMP in the operator. That is definitely not true and adding 
> a single line in either FP32_FUNCS (cast me always to FP32) or 
> FP16_FP32_FUNCS (do not do anything with me because I'm not relevant for this 
> usecase) is perfectly fine. It is up to people like me actively working on 
> better FP16 support to go through the FP32 list to introduce support and then 
> move such operator to another list.
>  - The third (although minor) misunderstanding is that this feature is 
> completely GPU-specific. It is not, and actually the review comment that 
> prompted me to introduce the design as is currently implemented comes from 
> MKLDNN developer :-).
> 
> About the proposal to move it from Python list into a operator property - I 
> am in favor of it and, as Anirudh mentioned, this should make it much easier 
> when just adding aliases to the operators. But again, this could not be an 
> optional parameter unfortunately since automatic "put me in FP32 list" is not 
> possible.
> About the proposal to change the error into a warning and then rely on people 
> interested in AMP to adding operators to AMP lists - I do not like this 
> proposal because of user experience that it would introduce. Basically what 
> would happen is a user who downloads nightly build of MXNet (because they 
> needed a feature introduced after last official release) would likely find 
> that AMP is broken in that build because it happened between updates to the 
> AMP lists. The test ensures that the list is updated at the same time as new 
> operator is introduced. This is a question really of how do we treat nightly 
> releases - should the user expect for all the features to generally work in 
> them or not? If the Community thinks that it is acceptable outcome then I am 
> fine with the proposal and regularly updating those lists myself.
> 
> Thank you
> Przemek
> 
> On 2019/05/28 21:32:42, Przemys��aw Tr��dak <[email protected]> wrote: 
> > Dear Community,
> > 
> > One of the recently merged features of the 1.5 release, AMP (Automatic 
> > Mixed Precision) support (PR [1], design doc [5]), introduced a requirement 
> > that every new operator added to MXNet would need to be present in 1 of the 
> > lists (in [2]). To make sure that this requirement is not broken when 
> > somebody adds a new operator and does not know about AMP's existence, a 
> > test was added to CI ([3]).
> > 
> > A few people reached out to me (the original author of the feature) saying 
> > this test increases a burden on a developer of new operators and should not 
> > be an actual error, but just warning (PR for that change [4]). That is why 
> > I would like to present a motivation for it and discuss with the wider 
> > audience why I feel it was necessary.
> > 
> > First, for people who do not know the details of what AMP is - it is a 
> > solution that tries to automatically apply best practices of training in 
> > lower precision (FP16) to user's FP32 model in order to fully utilize 
> > capabilities of modern GPUs (and potentially other hardware in the future). 
> > It does so by casting to lower precision inputs to operators benefitting 
> > from it, while casting to full precision inputs of operators that are 
> > unsafe to run in lower precision or just do not support it.
> > 
> > The first iteration of AMP kept 2 main lists of operators - operators that 
> > are beneficial and safe to do in fp16 and operators that need to be cast to 
> > FP32. The problem (raised in review of the PR [6], [8]) is how to make sure 
> > that the feature works as intended and is not inadvertently broken by 
> > somebody adding a new operator. The failure scenario here is adding a new 
> > operator that does not support FP16 and so should be cast to FP32, but AMP 
> > does not know about its existence and so does not do the casting. The 
> > solution proposed in the review was to implicitly treat all of the unknown 
> > operators as FP32-only and keep the list of operators that work fine in 
> > both FP16 and FP32. This solution however does not really work, because 
> > there are multiple operators (most notably optimizers) where introducing 
> > additional casting of the input to FP32 would break the operator.
> > 
> > That is why after discussion with a few members of the community, I decided 
> > to proceed with all lists being explicit and introducing the test that 
> > would fail when somebody added an operator without classifying it into 1 of 
> > the categories, and explain clearly how to do it [7]. It is not ideal 
> > solution, as it introduces some burden on the developers who are not aware 
> > about AMP, however in the typical case of adding at most a few operators to 
> > MXNet the inconvenience is I think pretty minor while important for the 
> > feature correctness going forward.
> > 
> > I would like to gather Community feedback and ideas how to handle this 
> > situation.
> > 
> > [1] https://github.com/apache/incubator-mxnet/pull/14173
> > [2] 
> > https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contrib/amp/lists/symbol.py
> > [3] 
> > https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py
> > [4] https://github.com/apache/incubator-mxnet/pull/15085
> > [5] 
> > https://docs.google.com/document/d/1sQzMoPEwux0WXSWirY07us1POD_6Y8pLYq--b9Fvd1o/edit?usp=sharing
> > [6] 
> > https://github.com/apache/incubator-mxnet/pull/14173#discussion_r270728019
> > [7] 
> > https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py#L62-L80
> > [8] 
> > https://github.com/apache/incubator-mxnet/pull/14173#pullrequestreview-235846341
> > 
>

Re: Making new operators and AMP lists

Reply via email to