Re: Making new operators and AMP lists

Jun Wu Tue, 28 May 2019 20:18:44 -0700

Hi Marco,

As it has been stated by others, the concern is not about how much cost it
results in, but about whether it should be there from the beginning. Just
imagine how you would explain to a MKLDNN developer who just added an
operator for CPU computing that he/she must put the operator name in AMP
list, which is used for training jobs on GPUs. The requirement imposed on
every developer responsible for adding operators in the future deviates
from the philosophy of making MXNet easy to use and develop. From this
perspective, this is a big deal.


Design-wise, I agree that making the requirement as an operator attribute
in the backend makes sense. In addition to that, we need to pay attention
to the following two conditions:
1. The attribute should be optional as FInplace, because for the most
common cases, people would just use FP32 version without worrying about
overflow/underflow. It's not as essential as attributes FInferShape,
FInferType, and FCompute that must exist to run the most common cases.
2. The attribute should be device/kernel-specific as FCompute, since CPU
implementation may be totally different than GPU.

Thanks,
Jun

On Tue, May 28, 2019 at 5:08 PM Anirudh Subramanian <[email protected]>
wrote:

> The assumption is the AMP requirement is something that has a steep
> learning curve. Developers may get confused by the name, but the question
> the developer has to essentially answer is (and this can be added in the
> error):
> 1. If the operator can run in FP16 and FP32 modes, put it in
> FP16_FP32_FUNCS.
> 2. If you want to always run the op in FP16 because of the benefits, put it
> in FP16.
> 3. Not sure, Don't know or dont care: FP32 list.
>
> Any operator developer who wrote InferType logic for their operator won't
> take much time to choose one of these options.
> Now it may be possible that it took a long time before developer realizes
> that he has to do this,
> thats why I suggested we need to reduce the time it takes for him to
> realize that something was missed.
>
> Anirudh
>
> On Tue, May 28, 2019 at 4:57 PM Sheng Zha <[email protected]> wrote:
>
> > This is driving people away exactly because they don't know this is
> what's
> > asked of them, and why they are asked of this AMP requirement in the
> first
> > place. To someone who's already familiar with the context, this is little
> > to be worried about. It's now the process that requires everyone to
> become
> > familiarized with the AMP requirement that becomes costly. More
> > importantly, this is not a question of whether it's too much, but whether
> > it should be there in the first place. If it's merely a question of cost
> I
> > imagine you'd have no trouble stepping up and support all of the future
> > operators for AMP :)
> >
> > -sz
> >
> > On 2019/05/28 23:49:52, Marco de Abreu <[email protected]> wrote:
> > > I'm having trouble in how far adding the name of an operator in a
> single
> > > file is too much to expect from somebody and how this is driving people
> > > away.
> > >
> > > If somebody adds a tutorial, they also have to add the tutorial to a
> > > specific file. As far as I can tell, this has not resulted in people
> not
> > > wanting to write tutorials anymore or it being considered as such a big
> > > burden.
> > >
> > > So far, I'm really not following why adding a single line to a single
> > file
> > > is considered such a big deal. Considering how long this guide [1]
> > already
> > > is, what's the harm in adding this as an additional instruction? (AMP
> is
> > > not mentioned there yet, it would be great if that could be follow up)
> > >
> > > [1]
> > >
> >
> https://mxnet.incubator.apache.org/versions/master/faq/add_op_in_backend.html
> > >
> > > -Marco
> > >
> > > On Wed, May 29, 2019 at 1:42 AM Sheng Zha <[email protected]> wrote:
> > >
> > > > AMP is in contrib so there's no guarantee that the API is final.
> > Adopting
> > > > the test as-is is harmful because operator authors should not be
> > required
> > > > to invest in an experimental feature that they are not aware of.
> > > >
> > > > I'm all for openness and welcoming, but think about whether you'd
> like
> > to
> > > > turn away developers who just want to write a CPU-only operator. The
> > more
> > > > you impose on the developers the less likely they will make the
> > > > contribution through.
> > > >
> > > > Having an unfamiliar operator in AMP as a warning could let everyone
> > know
> > > > what the support state is whenever that feature is used. For those
> who
> > care
> > > > about this, they would see the warning and add the support to get the
> > speed
> > > > benefit of not casting to fp32. In this case, rather than imposing it
> > to
> > > > developers who don't know about AMP, the one who actually uses AMP
> and
> > > > cares about this feature would drive the work forward.
> > > >
> > > > -sz
> > > >
> > > > On 2019/05/28 23:25:43, Marco de Abreu <[email protected]>
> > wrote:
> > > > > While AMP might be an experimental feature, I rather would like to
> > put
> > > > the
> > > > > focus on the maturity of its interfaces. If the interfaces and the
> > > > actions
> > > > > developers have to do aren't finalized yet, I'd agree with
> disabling
> > the
> > > > > test. But if the API is final and easy to use, I don't see why
> > adopting
> > > > > early on would be harmful. But from what I can see, the output of
> the
> > > > test
> > > > > is very meaningful and explicit, easily understandable and offers
> the
> > > > > developer a clear list of action items that they can follow.
> > > > >
> > > > > If people actually start commenting "CI test failure seems
> unrelated
> > to
> > > > my
> > > > > change, please proceed and merge", we should advise them to please
> > open
> > > > the
> > > > > result tab, which will directly show the clear list of action
> items.
> > > > > Committers should support these contributors who are not that
> > familiar
> > > > with
> > > > > our CI system and explain to them how they can retrieve the reasons
> > for
> > > > the
> > > > > failure. Simply ignoring the fact that the change is not compatible
> > with
> > > > > AMP doesn't seem to be the right way.
> > > > >
> > > > > I think nobody is opposed to having AMP in MXNet. As part of
> > accepting
> > > > the
> > > > > feature, we also added AMP to the established methods including the
> > > > coding
> > > > > constraints and other checks that come with it. Lets be open and
> > welcome
> > > > to
> > > > > this new feature and come back if the turnaround time is actually
> too
> > > > high.
> > > > >
> > > > > -Marco
> > > > >
> > > > >
> > > > > On Wed, May 29, 2019 at 1:13 AM Sheng Zha <[email protected]>
> > wrote:
> > > > >
> > > > > > The support for AMP should not be a burden of authors of new
> > operators.
> > > > > > The lint analogy doesn't apply because lint is for established
> and
> > > > accepted
> > > > > > coding standard at MXNet and AMP is not. AMP is an experimental
> > feature
> > > > > > right now and it doesn't make sense to require contributors to
> > invest
> > > > in
> > > > > > it. Keeping this as error will inevitably cause comments like "CI
> > test
> > > > > > failure seems unrelated to my change, please proceed and merge".
> > > > > >
> > > > > > -sz
> > > > > >
> > > > > > On 2019/05/28 22:51:30, Marco de Abreu <[email protected]>
> > > > wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I'm generally in favour of these kind of tests since they make
> > > > developers
> > > > > > > aware of changes they have to make which they would usually not
> > be
> > > > aware
> > > > > > > of. We have a similar test for tutorials, for example. Whenever
> > > > somebody
> > > > > > > adds a tutorial, there's a validation that assures that all
> > > > contraints in
> > > > > > > our testing environment are met and that they are properly tied
> > into
> > > > the
> > > > > > > system. This AMP test fits into the same category in my opinion
> > and
> > > > we
> > > > > > > never heard bad feedback about these kind of checks.
> > > > > > >
> > > > > > > What seems to be bothering people is the fact that the feedback
> > time
> > > > is
> > > > > > too
> > > > > > > high. Thus, I'd like to propose to move the test into the
> > sanity-test
> > > > > > stage
> > > > > > > instead of doing it as part of the unit tests which take quite
> a
> > bit
> > > > of
> > > > > > > time until they're actually executed. The sanity checks run
> > > > immediately
> > > > > > and
> > > > > > > give a response within about 1 minute.
> > > > > > >
> > > > > > > While I understand that this might increase the amount of work
> a
> > > > > > developer
> > > > > > > has to do if they develop a new operator, I think that this is
> > the
> > > > right
> > > > > > > thing to do. Developers won't know of every single feature
> other
> > > > people
> > > > > > > worked on and thus might simply miss adding the support for it.
> > This
> > > > kind
> > > > > > > of test on the other hand makes them aware of it. If they'd
> like
> > to
> > > > opt
> > > > > > > out, it's one single line they would have to change and then
> > they're
> > > > > > > totally fine. On the other hand, this might motivate them to
> add
> > the
> > > > > > > support since the kernel would be the last piece and everything
> > else
> > > > > > would
> > > > > > > already be implemented.
> > > > > > >
> > > > > > > Considering how often a PR gets declined because of linting
> > errors,
> > > > I'd
> > > > > > say
> > > > > > > that these kind of errors are WAY more frequent that AMP
> telling
> > > > somebody
> > > > > > > to add their operator to a list. Considering that this would
> only
> > > > have to
> > > > > > > be done once per operator, that's work of about one minute. Add
> > that
> > > > to
> > > > > > the
> > > > > > > waiting time of the sanity check and you're left with about
> five
> > > > "wasted"
> > > > > > > minutes.
> > > > > > >
> > > > > > > I'm opposed towards adding a warning or treating them as
> float32
> > by
> > > > > > default
> > > > > > > since the operator author wouldn't notice. What will happen is
> > that
> > > > > > people
> > > > > > > won't know about AMP and simply forget about low precision in
> > general
> > > > > > until
> > > > > > > they're actively reminded. This check will remind them actively
> > and
> > > > thus
> > > > > > > bring more attention to the feature. I know that the feature is
> > still
> > > > > > > experimental, but we have just started with the 1.6 branch and
> > thus
> > > > > > there's
> > > > > > > enough time to make the experimental features production ready.
> > > > Adding
> > > > > > this
> > > > > > > test early on will allow others to add the support for AMP
> > during the
> > > > > > early
> > > > > > > stage of the 1.6 branch instead of asking them in the last few
> > weeks
> > > > > > before
> > > > > > > a release. The result would only be that stuff is rushed or
> > > > forgotten.
> > > > > > >
> > > > > > > To sum it up: I think this test is good and it should be kept
> as
> > > > error,
> > > > > > but
> > > > > > > it should be moved to sanity checks.
> > > > > > >
> > > > > > > -Marco
> > > > > > >
> > > > > > > On Wed, May 29, 2019 at 12:21 AM Sheng Zha <
> [email protected]>
> > > > wrote:
> > > > > > >
> > > > > > > > Thanks for initiating the discussion.
> > > > > > > >
> > > > > > > > The premise for adding the test was to make sure that AMP
> > feature
> > > > is
> > > > > > "not
> > > > > > > > broken", but that's IMO not the right view. AMP is not
> > supposed to
> > > > > > support
> > > > > > > > a new operator it hasn't seen before in the first place.
> > There's no
> > > > > > way for
> > > > > > > > it to know whether the fp32 cast should happen or not. So AMP
> > > > feature
> > > > > > > > cannot provide the guarantee that it works for all future
> > > > operators.
> > > > > > Thus,
> > > > > > > > adding new operators to AMP list should be considered new
> > feature
> > > > > > instead
> > > > > > > > of fixing existing feature.
> > > > > > > >
> > > > > > > > The AMP test that breaks upon the addition of new operator is
> > thus
> > > > > > > > equivalent to forcing developers of the new operator to add
> > the new
> > > > > > support
> > > > > > > > for AMP. This feels wrong. Especially given that AMP is an
> > > > experimental
> > > > > > > > feature in contrib namespace (i.e. no semver guarantee), this
> > > > practice
> > > > > > > > should be stopped immediately. We cannot force new developers
> > to
> > > > invest
> > > > > > > > into experimental feature this way.
> > > > > > > >
> > > > > > > > I'd suggest the following changes:
> > > > > > > > - for new operators that aren't registered in AMP, cast to
> > float32
> > > > by
> > > > > > > > default and print one-time warning. People using AMP who want
> > to
> > > > avoid
> > > > > > > > casting can register it in the AMP's list.
> > > > > > > > - change the test to print warning about the operators that
> > are not
> > > > > > listed
> > > > > > > > so that it's easy to track the problem.
> > > > > > > >
> > > > > > > > -sz
> > > > > > > >
> > > > > > > > On 2019/05/28 21:32:42, Przemys��aw Tr��dak <
> > [email protected]>
> > > > > > wrote:
> > > > > > > > > Dear Community,
> > > > > > > > >
> > > > > > > > > One of the recently merged features of the 1.5 release, AMP
> > > > > > (Automatic
> > > > > > > > Mixed Precision) support (PR [1], design doc [5]),
> introduced a
> > > > > > requirement
> > > > > > > > that every new operator added to MXNet would need to be
> > present in
> > > > 1
> > > > > > of the
> > > > > > > > lists (in [2]). To make sure that this requirement is not
> > broken
> > > > when
> > > > > > > > somebody adds a new operator and does not know about AMP's
> > > > existence, a
> > > > > > > > test was added to CI ([3]).
> > > > > > > > >
> > > > > > > > > A few people reached out to me (the original author of the
> > > > feature)
> > > > > > > > saying this test increases a burden on a developer of new
> > > > operators and
> > > > > > > > should not be an actual error, but just warning (PR for that
> > change
> > > > > > [4]).
> > > > > > > > That is why I would like to present a motivation for it and
> > discuss
> > > > > > with
> > > > > > > > the wider audience why I feel it was necessary.
> > > > > > > > >
> > > > > > > > > First, for people who do not know the details of what AMP
> is
> > -
> > > > it is
> > > > > > a
> > > > > > > > solution that tries to automatically apply best practices of
> > > > training
> > > > > > in
> > > > > > > > lower precision (FP16) to user's FP32 model in order to fully
> > > > utilize
> > > > > > > > capabilities of modern GPUs (and potentially other hardware
> in
> > the
> > > > > > future).
> > > > > > > > It does so by casting to lower precision inputs to operators
> > > > > > benefitting
> > > > > > > > from it, while casting to full precision inputs of operators
> > that
> > > > are
> > > > > > > > unsafe to run in lower precision or just do not support it.
> > > > > > > > >
> > > > > > > > > The first iteration of AMP kept 2 main lists of operators -
> > > > operators
> > > > > > > > that are beneficial and safe to do in fp16 and operators that
> > need
> > > > to
> > > > > > be
> > > > > > > > cast to FP32. The problem (raised in review of the PR [6],
> > [8]) is
> > > > how
> > > > > > to
> > > > > > > > make sure that the feature works as intended and is not
> > > > inadvertently
> > > > > > > > broken by somebody adding a new operator. The failure
> scenario
> > > > here is
> > > > > > > > adding a new operator that does not support FP16 and so
> should
> > be
> > > > cast
> > > > > > to
> > > > > > > > FP32, but AMP does not know about its existence and so does
> > not do
> > > > the
> > > > > > > > casting. The solution proposed in the review was to
> implicitly
> > > > treat
> > > > > > all of
> > > > > > > > the unknown operators as FP32-only and keep the list of
> > operators
> > > > that
> > > > > > work
> > > > > > > > fine in both FP16 and FP32. This solution however does not
> > really
> > > > work,
> > > > > > > > because there are multiple operators (most notably
> optimizers)
> > > > where
> > > > > > > > introducing additional casting of the input to FP32 would
> > break the
> > > > > > > > operator.
> > > > > > > > >
> > > > > > > > > That is why after discussion with a few members of the
> > > > community, I
> > > > > > > > decided to proceed with all lists being explicit and
> > introducing
> > > > the
> > > > > > test
> > > > > > > > that would fail when somebody added an operator without
> > > > classifying it
> > > > > > into
> > > > > > > > 1 of the categories, and explain clearly how to do it [7]. It
> > is
> > > > not
> > > > > > ideal
> > > > > > > > solution, as it introduces some burden on the developers who
> > are
> > > > not
> > > > > > aware
> > > > > > > > about AMP, however in the typical case of adding at most a
> few
> > > > > > operators to
> > > > > > > > MXNet the inconvenience is I think pretty minor while
> important
> > > > for the
> > > > > > > > feature correctness going forward.
> > > > > > > > >
> > > > > > > > > I would like to gather Community feedback and ideas how to
> > handle
> > > > > > this
> > > > > > > > situation.
> > > > > > > > >
> > > > > > > > > [1] https://github.com/apache/incubator-mxnet/pull/14173
> > > > > > > > > [2]
> > > > > > > >
> > > > > >
> > > >
> >
> https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contrib/amp/lists/symbol.py
> > > > > > > > > [3]
> > > > > > > >
> > > > > >
> > > >
> >
> https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py
> > > > > > > > > [4] https://github.com/apache/incubator-mxnet/pull/15085
> > > > > > > > > [5]
> > > > > > > >
> > > > > >
> > > >
> >
> https://docs.google.com/document/d/1sQzMoPEwux0WXSWirY07us1POD_6Y8pLYq--b9Fvd1o/edit?usp=sharing
> > > > > > > > > [6]
> > > > > > > >
> > > > > >
> > > >
> >
> https://github.com/apache/incubator-mxnet/pull/14173#discussion_r270728019
> > > > > > > > > [7]
> > > > > > > >
> > > > > >
> > > >
> >
> https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py#L62-L80
> > > > > > > > > [8]
> > > > > > > >
> > > > > >
> > > >
> >
> https://github.com/apache/incubator-mxnet/pull/14173#pullrequestreview-235846341
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Making new operators and AMP lists

Reply via email to