Hi dev@,

I was testing GluonNLP with MXNet master, and found that BERT training
crashes a few hours after I launch the job. I can confirm that MXNet pip
package 20190412 works fine. I am bisecting changes in MXNet/GluonNLP to
check what causes the problem. I'll send an update as soon as I find the
root cause, or if I find any workaround.

Thanks,
Haibin

On Thu, May 23, 2019 at 2:12 AM Lin Yuan <apefor...@gmail.com> wrote:

> Hi Lai,
>
> One important PR that is currently blocked by a Flaky TensorRT test:
>
> https://github.com/apache/incubator-mxnet/pull/15041
>
> I have retriggered it several times. If it fails again, I may need CI team
> to help disable this test. It has been reported by multiple people:
> https://github.com/apache/incubator-mxnet/issues/14978
>
> Thanks,
>
> Lin
>
> On Wed, May 22, 2019 at 11:38 PM Zhao, Patric <patric.z...@intel.com>
> wrote:
>
> > Thanks, Lai.
> >
> > With the great helps from the community, all PRs listed in the roadmap
> are
> > done :)
> >
> >
> https://github.com/apache/incubator-mxnet/issues/14619#issuecomment-480110642
> >
> > Update the status of the below list
> >
> >  - [1] PR#14713 is almost done and wait for internal validation results
> >  - [2] PR#14893 is merged
> >  - [3] PR#15031 is merged
> >  - [7] PR#15038 new PR to fix the bug in C++ interface, will be merged
> > soon after the review.
> >
> > Feel free to let me know if anything our team can help :)
> >
> > BR,
> >
> > --Patric
> >
> > > -----Original Message-----
> > > From: Lai Wei [mailto:roywei...@gmail.com]
> > > Sent: Thursday, May 23, 2019 6:05 AM
> > > To: dev@mxnet.incubator.apache.org
> > > Subject: Re: [DISCUSS] 1.5.0 Release Plan
> > >
> > > Hi @dev,
> > >
> > > Thanks for working hard for the 1.5 release, since there has been
> several
> > > release blockers (mostly fixed). We are extending the code freeze to
> > Friday
> > > 05/22/2019. Right now we are tracking the following 5 open
> > PRs[1][2][3][4][5]
> > > and 1 issue[6]. Please let us know if you need more time.
> > >
> > > I would like to encourage all downstream projects to test with latest
> > MXNet
> > > to avoid any incompatibility in the coming 1.5.0 release. If you have
> any
> > > issues that may block the release, please let us know.
> > > Thank you very much.
> > >
> > > [1] https://github.com/apache/incubator-mxnet/pull/14713
> > > [2] https://github.com/apache/incubator-mxnet/pull/14893
> > > [3] https://github.com/apache/incubator-mxnet/pull/15031
> > > [4] https://github.com/apache/incubator-mxnet/pull/15039
> > > [5] https://github.com/apache/incubator-mxnet/pull/15041
> > > [6] https://github.com/apache/incubator-mxnet/issues/15034
> > >
> > >
> > > Best Regards
> > >
> > > Lai
> > >
> > >
> > > On Wed, May 15, 2019 at 9:05 PM Junru Shao <junrushao1...@gmail.com>
> > > wrote:
> > >
> > > > Hi folks,
> > > >
> > > > Here I may have a release blocker for 1.5.0 about implementation of
> > > > dynamic shape mechanism, which somehow conflicts with Gluon's
> > > deferred
> > > > initialization [1].
> > > >
> > > > [1] https://github.com/dmlc/gluon-nlp/issues/706
> > > >
> > > > On Wed, May 15, 2019 at 12:09 PM Anirudh Subramanian <
> > > > anirudh2...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Lai,
> > > > >
> > > > > From the discussion I had with Nvidia offline they are targeting on
> > > > pushing
> > > > > the required changes today.
> > > > > Since this is important feature for the release, if this gets
> > > > > delayed and cannot  be merged by 05/17/2019, the code freeze date
> > > > > may need to be changed.
> > > > >
> > > > > Anirudh
> > > > >
> > > > > On Wed, May 15, 2019 at 1:23 AM Lv, Tao A <tao.a...@intel.com>
> > wrote:
> > > > >
> > > > > > Hi dev,
> > > > > >
> > > > > > We see there are several github issues [1][2][3][4] about mxnet
> > > > > > windows build experience. The team is working intensively
> > > > > > [5][6][7] on that to
> > > > > fix
> > > > > > some problems of MKL-DNN build on windows. We hope these fixes
> > > can
> > > > catch
> > > > > > the code freeze and finally enter the 1.5.0 release.
> > > > > >
> > > > > > The PR against mshadow (#374) was already merged and MXNet PR
> > > > > > #14877 is under review - great thanks to CI team for helping on
> > > > > > the MKL
> > > > > installation
> > > > > > request. PR #14952 is document change according to build logic
> > > > > > changes
> > > > in
> > > > > > PR #14877. So I think these two PRs should be merged
> > simultaneously.
> > > > > > Currently #14877 is experiencing a CI response problem.
> > > > > >
> > > > > > Please take your time to have a look at these two PRs. Your
> > > > > > comments
> > > > and
> > > > > > suggestions are highly appreciated.
> > > > > >
> > > > > > Thanks,
> > > > > > -tao
> > > > > >
> > > > > > [1] https://github.com/apache/incubator-mxnet/issues/14670
> > > > > > [2] https://github.com/apache/incubator-mxnet/issues/14335
> > > > > > [3] https://github.com/apache/incubator-mxnet/issues/14203
> > > > > > [4] https://github.com/apache/incubator-mxnet/issues/14085
> > > > > > [5] https://github.com/apache/incubator-mxnet/pull/14877
> > > > > > [6] https://github.com/dmlc/mshadow/pull/374
> > > > > > [7] https://github.com/apache/incubator-mxnet/pull/14952
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Lai Wei [mailto:roywei...@gmail.com]
> > > > > > Sent: Wednesday, May 15, 2019 2:57 PM
> > > > > > To: dev@mxnet.incubator.apache.org
> > > > > > Subject: Re: [DISCUSS] 1.5.0 Release Plan
> > > > > >
> > > > > > Hi Anirudh,
> > > > > >
> > > > > > I see there was an offline disucssion <
> > > > > >
> > > > >
> > > > https://github.com/apache/incubator-
> > > mxnet/pull/14173#pullrequestreview
> > > > -235846341
> > > > > > >
> > > > > > and I have updated the AMP feature and your project on the
> release
> > > > > tracker
> > > > > > <
> > > > > >
> > > > >
> > > >
> https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Plan+a
> > > > nd+Status
> > > > > > >
> > > > > > ,
> > > > > > Please let me know if you have any updates.
> > > > > >
> > > > > > Hi @dev,
> > > > > > This is a gentle reminder that  the code freeze for 1.5.0 release
> > > > > > is on 05/17/2019, please let us know if you have any WIP pull
> > > > > > requests aiming
> > > > > for
> > > > > > 1.5.0 that needs attention.
> > > > > > Please understand we already have around 650 commits in master
> > > > > > that
> > > > need
> > > > > > to be released in time. We understand TensorRT test in CI is
> > > > > > failing
> > > > and
> > > > > > are trying to fix it. Meanwhile please update the tracker if
> there
> > > > > > is
> > > > any
> > > > > > change:
> > > > > >
> > > > > >
> > > > >
> > > >
> https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Plan+a
> > > > nd+Status
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > Lai
> > > > > >
> > > > > >
> > > > > > On Wed, May 8, 2019 at 11:58 AM Anirudh Subramanian <
> > > > > anirudh2...@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Sheng,
> > > > > > >
> > > > > > > I had a discussion with nvidia folks offline today (@ptrendx
> et.
> > > > al.).
> > > > > > > I strongly feel that the AMP feature should be included as part
> > > > > > > of
> > > > the
> > > > > > > release: https://github.com/apache/incubator-mxnet/pull/14173
> .
> > > > > > > The PR is aimed for completion for next week but reviews and
> RFC
> > > > > > > discussions may take some time. I would request to extend the
> > > > > > > release code freeze by 2 weeks.
> > > > > > > Also, I would like to include
> > > > > > >
> > > > > > >
> > > >
> > > https://cwiki.apache.org/confluence/display/MXNET/Conversion+from+FP32
> > > > > > > +to+Mixed+Precision+Models
> > > > > > > which
> > > > > > > depends on the AMP PR.
> > > > > > > I am also aiming for adding a PR by this week end or early next
> > > > > > > week, but reviews will take longer than May 17th.
> > > > > > >
> > > > > > > Anirudh
> > > > > > >
> > > > > > >
> > > > > > > On Mon, May 6, 2019 at 11:49 PM Sheng Zha <szha....@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > While 1.4.1 vote on general@incubator is still on going, I’d
> > > > > > > > like
> > > > to
> > > > > > > > propose that we start preparing 1.5.0 release.
> > > > > > > >
> > > > > > > > 1.5.0 will include changes that dates back to last year and
> > > > > > > > there has
> > > > > > > been
> > > > > > > > a lot of new features and improvements in it, so it will
> > > > > > > > likely
> > > > time
> > > > > > > > us more time to prepare than 1.4.1. I propose the following
> > > > timeline:
> > > > > > > > - Cut release branch: release branch already cut. Will sync
> > > > > > > > with master branch on 5/15/2019 EOD.
> > > > > > > > - Code freeze: 5/17/2019. No more changes unless the release
> > > > > > > > branch is in a broken state.
> > > > > > > > - Tag and vote: 5/20/2019 onward.
> > > > > > > >
> > > > > > > > Lai Wei (roywei@) expressed to me offline that he’s willing
> to
> > > > help
> > > > > > > drive
> > > > > > > > this release as release manager, and I’m happy to help again
> > > > > > > > as
> > > > > > > committer.
> > > > > > > >
> > > > > > > > If you have features in progress that you’d like to include
> in
> > > > 1.5.0:
> > > > > > > > - Add your feature to the scope:
> > > > > > > >
> > > > > > >
> > > >
> https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Plan+a
> > > > > > > nd+Status
> > > > > > > > - Indicate in this thread:
> > > > > > > >   - how confident you are about making it happen before the
> > > > > > > > code
> > > > > > freeze.
> > > > > > > > If not confident, provide estimate for a more manageable code
> > > > freeze
> > > > > > > > date so that people can discuss whether to extend the
> deadline
> > > > > > > > or
> > > > to
> > > > > > > > skip one release for it.
> > > > > > > > - whether your PR requires more attention to make it happen.
> > > > > > > >
> > > > > > > > Thanks for your attention. Comments and suggestions are also
> > > > welcome.
> > > > > > > >
> > > > > > > > -sz
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
>

Reply via email to