Dear Community,

One of the recently merged features of the 1.5 release, AMP (Automatic Mixed 
Precision) support (PR [1], design doc [5]), introduced a requirement that 
every new operator added to MXNet would need to be present in 1 of the lists 
(in [2]). To make sure that this requirement is not broken when somebody adds a 
new operator and does not know about AMP's existence, a test was added to CI 
([3]).

A few people reached out to me (the original author of the feature) saying this 
test increases a burden on a developer of new operators and should not be an 
actual error, but just warning (PR for that change [4]). That is why I would 
like to present a motivation for it and discuss with the wider audience why I 
feel it was necessary.

First, for people who do not know the details of what AMP is - it is a solution 
that tries to automatically apply best practices of training in lower precision 
(FP16) to user's FP32 model in order to fully utilize capabilities of modern 
GPUs (and potentially other hardware in the future). It does so by casting to 
lower precision inputs to operators benefitting from it, while casting to full 
precision inputs of operators that are unsafe to run in lower precision or just 
do not support it.

The first iteration of AMP kept 2 main lists of operators - operators that are 
beneficial and safe to do in fp16 and operators that need to be cast to FP32. 
The problem (raised in review of the PR [6], [8]) is how to make sure that the 
feature works as intended and is not inadvertently broken by somebody adding a 
new operator. The failure scenario here is adding a new operator that does not 
support FP16 and so should be cast to FP32, but AMP does not know about its 
existence and so does not do the casting. The solution proposed in the review 
was to implicitly treat all of the unknown operators as FP32-only and keep the 
list of operators that work fine in both FP16 and FP32. This solution however 
does not really work, because there are multiple operators (most notably 
optimizers) where introducing additional casting of the input to FP32 would 
break the operator.

That is why after discussion with a few members of the community, I decided to 
proceed with all lists being explicit and introducing the test that would fail 
when somebody added an operator without classifying it into 1 of the 
categories, and explain clearly how to do it [7]. It is not ideal solution, as 
it introduces some burden on the developers who are not aware about AMP, 
however in the typical case of adding at most a few operators to MXNet the 
inconvenience is I think pretty minor while important for the feature 
correctness going forward.

I would like to gather Community feedback and ideas how to handle this 
situation.

[1] https://github.com/apache/incubator-mxnet/pull/14173
[2] 
https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contrib/amp/lists/symbol.py
[3] 
https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py
[4] https://github.com/apache/incubator-mxnet/pull/15085
[5] 
https://docs.google.com/document/d/1sQzMoPEwux0WXSWirY07us1POD_6Y8pLYq--b9Fvd1o/edit?usp=sharing
[6] https://github.com/apache/incubator-mxnet/pull/14173#discussion_r270728019
[7] 
https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py#L62-L80
[8] 
https://github.com/apache/incubator-mxnet/pull/14173#pullrequestreview-235846341

Reply via email to