Thanks Przemek for the additional explanation, but I’m still confused on this 
part. I don’t understand the explanation of the optimizer’s interaction here. 

>> The technical reason for that is the only place from which one can get MXNet 
>> operators is MXListAllOps call, which gives back all operators without any 
>> way of differentiating them, including for example calls to optimizers. If 
>> somebody added a new optimizer and AMP inserts a FP32 cast before its input, 
>> the optimizer would actually update the FP32 copy of the tensor instead of 
>> the tensor itself. This is not a performance problem (which, I agree, would 
>> be perfectly fine to solve later) - it is a correctness problem.

>> That is why the error message in the test explicitly mentions this case as 
>> one where the operator should be put in the list that does not make any 
>> casts.


If this is third list that Anirudh mentions:

>> 3. Not sure, Don't know or dont care: FP32 list.


and this list simply contains all operators that have not been categorized, why 
do we need the list at all? 

Thanks,
Sam


> On May 29, 2019, at 11:58 AM, Sheng Zha <[email protected]> wrote:
> 
>> The second misunderstanding is that the ask from the test is to somehow "add 
>> support" for AMP in the operator. That is definitely not true and adding a 
>> single line in either FP32_FUNCS (cast me always to FP32) or FP16_FP32_FUNCS 
>> (do not do anything with me because I'm not relevant for this usecase) is 
>> perfectly fine.
> 
> I respectfully disagree with this statement. In this case, the "support to 
> add" is exactly to differentiate how the new operator should be treated. It 
> was not there before, so it's new. And requiring other developers to add any 
> line is totally not fine.
> 
>> The test ensures that the list is updated at the same time as new operator 
>> is introduced.
> 
> It would indeed have that effect. The problem is that the test requires the 
> wrong person to add such support. A developer who adds a new operator should 
> be under no obligation to do anything with AMP.
> 
> Having it as warning both in runtime and in the test has the effect to prompt 
> people who actually use and care about the performance from AMP to take the 
> steps and decide where a new operator fits. And like you said, "It is up to 
> people like me actively working on better FP16 support to go through the FP32 
> list to introduce support and then move such operator to another list." This 
> is also exactly how it should work with AMP.
> 
> -sz
> 
> On 2019/05/29 18:25:46, Przemys��aw Tr��dak <[email protected]> wrote: 
>> Thank you all for responding.
>> 
>> Let me address a few misunderstandings about AMP in the discussion so far:
>> - I think the main misunderstanding in the discussion so far is that it 
>> would be somehow possible to implicitly cast to FP32 all previously unseen 
>> operators. I would much prefer such solution myself, however this is 
>> actually (unfortunately) not really possible. The technical reason for that 
>> is the only place from which one can get MXNet operators is MXListAllOps 
>> call, which gives back all operators without any way of differentiating 
>> them, including for example calls to optimizers. If somebody added a new 
>> optimizer and AMP inserts a FP32 cast before its input, the optimizer would 
>> actually update the FP32 copy of the tensor instead of the tensor itself. 
>> This is not a performance problem (which, I agree, would be perfectly fine 
>> to solve later) - it is a correctness problem. That is why the error message 
>> in the test explicitly mentions this case as one where the operator should 
>> be put in the list that does not make any casts.
>> - The second misunderstanding is that the ask from the test is to somehow 
>> "add support" for AMP in the operator. That is definitely not true and 
>> adding a single line in either FP32_FUNCS (cast me always to FP32) or 
>> FP16_FP32_FUNCS (do not do anything with me because I'm not relevant for 
>> this usecase) is perfectly fine. It is up to people like me actively working 
>> on better FP16 support to go through the FP32 list to introduce support and 
>> then move such operator to another list.
>> - The third (although minor) misunderstanding is that this feature is 
>> completely GPU-specific. It is not, and actually the review comment that 
>> prompted me to introduce the design as is currently implemented comes from 
>> MKLDNN developer :-).
>> 
>> About the proposal to move it from Python list into a operator property - I 
>> am in favor of it and, as Anirudh mentioned, this should make it much easier 
>> when just adding aliases to the operators. But again, this could not be an 
>> optional parameter unfortunately since automatic "put me in FP32 list" is 
>> not possible.
>> About the proposal to change the error into a warning and then rely on 
>> people interested in AMP to adding operators to AMP lists - I do not like 
>> this proposal because of user experience that it would introduce. Basically 
>> what would happen is a user who downloads nightly build of MXNet (because 
>> they needed a feature introduced after last official release) would likely 
>> find that AMP is broken in that build because it happened between updates to 
>> the AMP lists. The test ensures that the list is updated at the same time as 
>> new operator is introduced. This is a question really of how do we treat 
>> nightly releases - should the user expect for all the features to generally 
>> work in them or not? If the Community thinks that it is acceptable outcome 
>> then I am fine with the proposal and regularly updating those lists myself.
>> 
>> Thank you
>> Przemek
>> 
>> On 2019/05/28 21:32:42, Przemys��aw Tr��dak <[email protected]> wrote: 
>>> Dear Community,
>>> 
>>> One of the recently merged features of the 1.5 release, AMP (Automatic 
>>> Mixed Precision) support (PR [1], design doc [5]), introduced a requirement 
>>> that every new operator added to MXNet would need to be present in 1 of the 
>>> lists (in [2]). To make sure that this requirement is not broken when 
>>> somebody adds a new operator and does not know about AMP's existence, a 
>>> test was added to CI ([3]).
>>> 
>>> A few people reached out to me (the original author of the feature) saying 
>>> this test increases a burden on a developer of new operators and should not 
>>> be an actual error, but just warning (PR for that change [4]). That is why 
>>> I would like to present a motivation for it and discuss with the wider 
>>> audience why I feel it was necessary.
>>> 
>>> First, for people who do not know the details of what AMP is - it is a 
>>> solution that tries to automatically apply best practices of training in 
>>> lower precision (FP16) to user's FP32 model in order to fully utilize 
>>> capabilities of modern GPUs (and potentially other hardware in the future). 
>>> It does so by casting to lower precision inputs to operators benefitting 
>>> from it, while casting to full precision inputs of operators that are 
>>> unsafe to run in lower precision or just do not support it.
>>> 
>>> The first iteration of AMP kept 2 main lists of operators - operators that 
>>> are beneficial and safe to do in fp16 and operators that need to be cast to 
>>> FP32. The problem (raised in review of the PR [6], [8]) is how to make sure 
>>> that the feature works as intended and is not inadvertently broken by 
>>> somebody adding a new operator. The failure scenario here is adding a new 
>>> operator that does not support FP16 and so should be cast to FP32, but AMP 
>>> does not know about its existence and so does not do the casting. The 
>>> solution proposed in the review was to implicitly treat all of the unknown 
>>> operators as FP32-only and keep the list of operators that work fine in 
>>> both FP16 and FP32. This solution however does not really work, because 
>>> there are multiple operators (most notably optimizers) where introducing 
>>> additional casting of the input to FP32 would break the operator.
>>> 
>>> That is why after discussion with a few members of the community, I decided 
>>> to proceed with all lists being explicit and introducing the test that 
>>> would fail when somebody added an operator without classifying it into 1 of 
>>> the categories, and explain clearly how to do it [7]. It is not ideal 
>>> solution, as it introduces some burden on the developers who are not aware 
>>> about AMP, however in the typical case of adding at most a few operators to 
>>> MXNet the inconvenience is I think pretty minor while important for the 
>>> feature correctness going forward.
>>> 
>>> I would like to gather Community feedback and ideas how to handle this 
>>> situation.
>>> 
>>> [1] https://github.com/apache/incubator-mxnet/pull/14173
>>> [2] 
>>> https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contrib/amp/lists/symbol.py
>>> [3] 
>>> https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py
>>> [4] https://github.com/apache/incubator-mxnet/pull/15085
>>> [5] 
>>> https://docs.google.com/document/d/1sQzMoPEwux0WXSWirY07us1POD_6Y8pLYq--b9Fvd1o/edit?usp=sharing
>>> [6] 
>>> https://github.com/apache/incubator-mxnet/pull/14173#discussion_r270728019
>>> [7] 
>>> https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py#L62-L80
>>> [8] 
>>> https://github.com/apache/incubator-mxnet/pull/14173#pullrequestreview-235846341
>>> 
>> 

Reply via email to