Re: [VOTE] Release MXNet version 1.3.0.RC0

Thomas DELTEIL Tue, 04 Sep 2018 18:03:02 -0700

-0
(non-binding)

If I may add some nuancing plus a personal data point as one of the users
commenting in the bug report in question:


- Performance vs. Basic functionality => I don't think high performance
use-cases and basic functionality are two obviously opposed concepts and
see no contradiction in Hagay's and Sandeep's statements.
Float16 support is feature of MXNet that provides more than twice the
performance of Float32 on supported platforms, hence the high performance
use-case. The bug is that the basic functionality of reloading a saved
float16 models is currently broken.

- This bug vs Other bugs => Contrary the vast majority of the 140 open bugs
that are mentioned above, I would put to Sandeep's credit that this one bug
has a PR open that provides a fix for it. This would make it a better
candidate to get included in this release than a bug that has no fix ready
for it.

- Personal datapoint: I recently did some experimentation with float16 [1]
and actually coincidentally just published a video on optimizing
performance for Gluon. Float16 conversion is one of the most, if not the
most effective way to get performance out of MXNet [2]. I believe there is
a lot of value in publicizing more its use and hence making sure at least
the basic support for normal use-cases is present.

Of course this needs to be balanced with the overhead of preparing a new
release candidate once the fixed is reviewed and merged, which seems to be
a lengthy and complex process in its own right, and the delay with
providing the other features present in 1.3 for users that are not running
off the nightly builds.

All the best,

Thomas

[1] https://github.com/ThomasDelteil/PerformanceTricksMXNetGluon
[2]
https://www.youtube.com/watch?v=Cqo7FPftNyo&t=0s&list=PLkEvNnRk8uVk6U515Pj-jHQUxFC4eDi3m

Le mar. 4 sept. 2018 à 17:11, Sheng Zha <szha....@gmail.com> a écrit :

> Sandeep,
>
> Thanks for explaining your veto. We have open bugs that impacted a lot more
> than just 3 customers, just by referring to the number of commenters on the
> issue [1].
>
> You said that this is for "high performance use cases", which contradicts
> with Hagay's assement that this is "basic functionality broken". Given that
> this is for advanced use cases of using half-precision training, why is it
> so much more important than any other open bug reports, that for this
> specific bug fix, we have to delay the access of regular users to the new
> MXNet 1.3 release by at least another week?
>
> Honestly, I'm concerned that your vote is biased by Amazon involvement,
> given that you quoted Amazon Rekognition.
>
> -sz
>
> [1]
>
> https://github.com/apache/incubator-mxnet/issues?q=is%3Aissue+is%3Aopen+label%3ABug+sort%3Acomments-desc
>
> On Tue, Sep 4, 2018 at 4:51 PM sandeep krishnamurthy <
> sandeep.krishn...@gmail.com> wrote:
>
> > My initial vote of “-0” was due to lack of info from a user who had said,
> > he overcame this issue for FP16 model.
> >
> >
> > However, suggested workaround [1] for the issue is not straight forward
> and
> > generally usable for all users. Also, issue is not simple and isolated to
> > be listed in the Release Notes as known issue with a workaround.
> >
> >
> > Changing my vote to: "-1 (binding)" owing to the user impact [3]
> >
> >
> >
> > @Sheng:
> >
> > 1. Agreed, bug existed from long time. However, FP16 and such
> optimizations
> > were added later on. Followed by users [2] using this feature for high
> > performance use cases. It is not ok to measure severity of the bug based
> on
> > its past existence, rather we can see who is impacted now and is it a
> small
> > subset with a simple workaround or large user impacting issue.
> >
> > 2. Agreed bug was reported 7/21. However, I became aware of this issue on
> > 08/29 and submitted the fix on 08/30. Also, I did bring this to the
> notice
> > of community, you and 1.3 release manager (Roshani) on the RC0 proposal
> > thread. Also, I would focus on the issue and user impact than who
> > identified and who is fixing the issue.
> >
> >
> > Based on my discussion with 2 users, I think it is a important feature
> for
> > them to see in Apache MXNet v1.3.0.
> >
> >
> >
> > Best,
> >
> > Sandeep
> >
> >
> > [1] Workaround used by the user.
> >
> >
> > net_fp16 = mx.gluon.SymbolBlock.imports('resnet34_fp16-symbol.json',
> > ['data'])
> >
> > params_fp16 = mx.nd.load('resnet34_fp16-0000.params')
> >
> >
> > for k, v in params_fp16.items():
> >
> >     new_key = k.split(':')[1]
> >
> >     net_fp16.collect_params()[new_key].cast(v.dtype)
> >
> >
> > net_fp16.collect_params().load('resnet34_fp16-0000.params', ctx)
> >
> >
> > [2] Amazon Rekognition
> >
> >
> > [3] User story: Train a model -> Cast it to FP16 -> Save the model ->
> Load
> > back the model does not work. They have to cast every parameter with a
> > workaround mentioned above [1].
> >
> > On Tue, Sep 4, 2018 at 4:14 PM Hagay Lupesko <lupe...@gmail.com> wrote:
> >
> > > Hi Sheng,
> > >
> > > Addressing your questions:
> > >
> > > - "why this specific bug is more important than all the other known
> bugs,
> > > that this becomes a release blocker"
> > > I do not consider it to be more or less important than other fixes. It
> > can
> > > be fixed and included in the release alongside the rest of the release
> > > content, right?
> > > From the description of the issue it seems important since it is
> blocking
> > > users from loading models that were previously trained and saved. There
> > is
> > > nothing stopping the community from including this fix into 1.3.0,
> > > alongside the rest of the features and fixes.
> > >
> > > - "The bug exists since SymbolBlock was introduced a year ago and has
> > > survived at least three releases, so this is not a regression."
> > > I do not think I said it is a regression. However, the fact a bug
> existed
> > > before, does not mean it is OK to release it rather than fix it.
> > >
> > > - "Timeline-wise, this bug was reported on 7/21, but was not reported
> as
> > > release-blocker in the release discussion thread until 8/31 [1].
> Neither
> > > its reporting as release-blocker nor its fix made it for the 8/3 code
> > > freeze."
> > > You are right, would have been better to have this identified and fixed
> > > earlier and included before code freeze.
> > >
> > > - "The PR is still not ready yet as it doesn't have approval."
> > > I think it is waiting for your review.
> > >
> > > - "it would be great if you could provide some additional reasoning
> > besides
> > > "X mentions the issue" or "fix was done by X""
> > > I have. Repeating what I wrote in my previous email for clarity: Basic
> > > functionality broken: loading a model (albeit one that that was saved
> as
> > > non FP32)
> > >
> > > So, yes - this issue seems to have been out there for a while, somehow
> > went
> > > under the radar... but I think the key question is whether this blocks
> a
> > > basic functionality in MXNet. I believe so, hence my -1 vote.
> > >
> > > Hagay
> > >
> > > On Tue, Sep 4, 2018 at 1:19 PM Sheng Zha <szha....@gmail.com> wrote:
> > >
> > > > Hi Hagay and Sandeep,
> > > >
> > > > Could you help us understand why this specific bug is more important
> > than
> > > > all the other known bugs, that this becomes a release blocker?
> > > >
> > > > Some facts to consider:
> > > > - The bug exists since SymbolBlock was introduced a year ago and has
> > > > survived at least three releases, so this is not a regression.
> > > > - Timeline-wise, this bug was reported on 7/21, but was not reported
> as
> > > > release-blocker in the release discussion thread until 8/31 [1].
> > Neither
> > > > its reporting as release-blocker nor its fix made it for the 8/3 code
> > > > freeze.
> > > > - The PR is still not ready yet as it doesn't have approval.
> > > >
> > > > Hagay, it would be great if you could provide some additional
> reasoning
> > > > besides "X mentions the issue" or "fix was done by X". Thanks.
> > > >
> > > > -sz
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/d1ed611f98c20d5d85c294b0c07c8bdebca13a209cf66a3872c9123e@%3Cdev.mxnet.apache.org%3E
> > > >
> > > > On Tue, Sep 4, 2018 at 12:39 PM Hagay Lupesko <lupe...@gmail.com>
> > wrote:
> > > >
> > > > > Sandeep mentions the issue of an error when user tries to load
> model
> > > > params
> > > > > trained/saved as FP16.
> > > > > https://github.com/apache/incubator-mxnet/issues/11849
> > > > > The fix was done by Sandeep:
> > > > > https://github.com/apache/incubator-mxnet/pull/12412 and is ready
> to
> > > be
> > > > > cherry picked into the release branch.
> > > > >
> > > > > This seems like a release blocker to me:
> > > > > - Basic functionality broken: loading a model (albeit one that that
> > was
> > > > > saved as non FP32)
> > > > > - Reported by 3 users (wgchang@, nicklhy@ and ThomasDelteil@)
> > > > >
> > > > > -1 (non binding)
> > > > >
> > > > > Hagay
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Sep 4, 2018 at 12:01 PM sandeep krishnamurthy <
> > > > > sandeep.krishn...@gmail.com> wrote:
> > > > >
> > > > > > "- 0"
> > > > > >
> > > > > > I believe the bug #11849
> > > > > > <https://github.com/apache/incubator-mxnet/issues/11849>, unable
> > to
> > > > > import
> > > > > > non-fp32 models into Gluon, fixed in this PR #12412
> > > > > > <https://github.com/apache/incubator-mxnet/pull/12412> is
> > important
> > > > for
> > > > > > the
> > > > > > users. I would rather pick this fix in this release than plan a
> > minor
> > > > > > release later.
> > > > > >
> > > > > > Best,
> > > > > > Sandeep
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Sep 3, 2018 at 2:34 PM Philip Cho <
> > > chohy...@cs.washington.edu>
> > > > > > wrote:
> > > > > >
> > > > > > > Actually, the command "git clone --recursive
> > > > > > > https://github.com/apache/incubator-mxnet -b 1.3.0.rc0" works
> > fine
> > > > > now,
> > > > > > > never mind.
> > > > > > >
> > > > > > > On Mon, Sep 3, 2018 at 1:45 PM Philip Cho <
> > > > chohy...@cs.washington.edu>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Unfortunately, MXNet was depending on a branch of TVM that is
> > now
> > > > > > > deleted.
> > > > > > > > We will have to merge #12448
> > > > > > > > <https://github.com/apache/incubator-mxnet/pull/12448>
> before
> > > the
> > > > > > > release.
> > > > > > > >
> > > > > > > > Background: See dmlc/tvm#1394 <
> > > > > https://github.com/dmlc/tvm/issues/1394
> > > > > > >.
> > > > > > > >
> > > > > > > > Philip.
> > > > > > > >
> > > > > > > > On Mon, Sep 3, 2018 at 7:26 AM Carin Meier <
> > carinme...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > >> Checked out the tag, built and tested the Clojure package.
> +1
> > > > > > > >>
> > > > > > > >> On Fri, Aug 31, 2018 at 10:59 PM Roshani Nagmote <
> > > > > > > >> roshaninagmo...@gmail.com>
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >> > Hi all,
> > > > > > > >> >
> > > > > > > >> > I would like to propose a vote to release Apache MXNet
> > > > > (incubating)
> > > > > > > >> version
> > > > > > > >> > 1.3.0.RC0. Voting will start now (Friday, Aug 31st) and
> end
> > at
> > > > > 7:00
> > > > > > PM
> > > > > > > >> > PDT, Wednesday, Sept 5th.
> > > > > > > >> >
> > > > > > > >> > Link to release notes:
> > > > > > > >> > https://github.com/apache/incubator-mxnet/releases
> > > > > > > >> >
> > > > > > > >> > Link to release candidate 1.3.0.rc0:
> > > > > > > >> > *
> > > > https://github.com/apache/incubator-mxnet/releases/tag/1.3.0.rc
> > > > > > > >> > <
> > > > https://github.com/apache/incubator-mxnet/releases/tag/1.3.0.rc0
> > > > > > >0*
> > > > > > > >> >
> > > > > > > >> > View this page, click on "Build from Source", and use the
> > > source
> > > > > > code
> > > > > > > >> > obtained from 1.3.0.rc0 tag:
> > > > > > > >> > https://mxnet.incubator.apache.org/install/index.html
> > > > > > > >> >
> > > > > > > >> > Please remember to TEST first before voting accordingly:
> > > > > > > >> >
> > > > > > > >> > +1 = approve
> > > > > > > >> > +0 = no opinion
> > > > > > > >> > -1 = disapprove (provide reason)
> > > > > > > >> >
> > > > > > > >> > Thanks,
> > > > > > > >> > Roshani
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sandeep Krishnamurthy
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Sandeep Krishnamurthy
> >
>

Re: [VOTE] Release MXNet version 1.3.0.RC0

Reply via email to