Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Hi Anirudh, Thanks for jumping into this quickly, I followed up on the issue. I was meant for sockeye developer/maintainers to help setup nightly tests and raise issues early. Thanks! On Fri, Jun 21, 2019 at 10:10 AM Haibin Lin wrote: > In GluonNLP we are testing with MXNET nightly build for each PR, and we did > find some MXNet related issue caught by the CI. > I recommend other toolkits also add integration tests with MXNet nightly. > It helps identify issues early. > > Best, > Haibin > > On Thu, Jun 20, 2019 at 18:52 Zhao, Patric wrote: > > > Thanks to raise the issue and we will take a look ASAP. > > > > The downstream cases is not in the MXNet CI so it's hard to catch the > > potential bugs or performance degradation for MXNet developers. > > > > In the future, I suggest adding the major downstream test cases, like > from > > sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the nightly test. > > If it's still too heavy, maybe testing it weekly or monthly :) > > > > Thanks, > > > > --Patric > > > > > -Original Message- > > > From: Anirudh Subramanian [mailto:anirudh2...@gmail.com] > > > Sent: Friday, June 21, 2019 9:31 AM > > > To: dev@mxnet.incubator.apache.org > > > Cc: d...@mxnet.apache.org > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1 > > > > > > Hi Lai, > > > > > > I have opened an issue: > > > https://github.com/apache/incubator-mxnet/issues/15297 > > > I came to know about this issue only today and I have not been > monitoring > > > sockeye. > > > I jumped onto this issue to make sure it wasn't caused by the dlpack > > changes. > > > Also, I don't think sockeye CI checks against master, it is using > 1.4.1. > > > > > > Anirudh > > > > > > > > > On Thu, Jun 20, 2019 at 6:17 PM Lai Wei wrote: > > > > > > > Hi, > > > > > > > > Could you share which test failed and what’s the crash? How to > > > > reproduce it? > > > > > > > > I was able to install sockeye and run all tests passed. Using python > > > > setup.py test > > > > > > > > I have tested both nightly pip package and 1.5.0.rc1 > > > > > > > > It would be great to create an issue with reproducible steps and move > > > > the discussion there. > > > > > > > > Also I see sockeye nightly build[1] has been failing for some time, > if > > > > it’s due to MXNet change, please raise this early so we can track and > > > > solve it in time rather than block the release during vote time. > > > > > > > > [1] https://travis-ci.org/awslabs/sockeye > > > > > > > > > > > > On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian > > > > > > > > > > > > wrote: > > > > > > > > > I was able to reproduce a crash with the commit > > > > > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit > > > > > a862270beb2d796c1ba311183f7f4a766a18ad6c. > > > > > > > > > > Anirudh > > > > > > > > > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei > wrote: > > > > > > > > > > > Hi Przemyslaw, > > > > > > > > > > > > Is there an issue with more details to track the problem? > > > > > > > > > > > > > > > > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak > > > > > > > > > > > > wrote: > > > > > > > > > > > > > -1 > > > > > > > > > > > > > > There is a crash in sockeye unit test (python setup.py test) > > > > > > > observed starting with nightly 1.5 build from 6/13 and still > > > > > > > occuring in > > > > > 1.5rc1. I > > > > > > > don't yet have the exact commit that is responsible for it, but > > > > > > > it is either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack > > > > > > > related) or > > > > > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op > > > optimization). > > > > > > > > > > > > > > On 2019/06/20 06:36:22, Lai Wei wrote: > > > > > > > > Dear MXNet community, > > > > > > > > > > > > > > > > This is the 3-day vote to release Apache MXNet (incubating) > > > > > > > > version > > > > > > > 1.5.0. > > > > > > > > Voting on dev@ will start June 19, 23:59:59(PST) and close > on > > > > June > > > > > > 22, > > > > > > > > 23:59:59. > > > > > > > > > > > > > > > > 1) Link to release notes: > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Note > > > > > s > > > > > > > > > > > > > > > > > > > > > > > > 2) Link to release candidate: > > > > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.r > > > > > > > > c1 > > > > > > > > > > > > > > > > > > > > > > > > 3) Link to source and signatures on apache dist server: > > > > > > > > > > > > > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.r > > > > > > > > c1/ > > > > > > > > > > > > > > > > > > > > > > > > Please remember to TEST first before voting accordingly: > > > > > > > > > > > > > > > > +1 = approve > > > > > > > > +0 = no opinion > > > > > > > > -1 = disapprove (provide reason) > > > > > > > > -- > > > > > > > > Best Regards > > > > > > > > > > > > > > > > Lai > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Best Regards > > > > > > > > > > >
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
In GluonNLP we are testing with MXNET nightly build for each PR, and we did find some MXNet related issue caught by the CI. I recommend other toolkits also add integration tests with MXNet nightly. It helps identify issues early. Best, Haibin On Thu, Jun 20, 2019 at 18:52 Zhao, Patric wrote: > Thanks to raise the issue and we will take a look ASAP. > > The downstream cases is not in the MXNet CI so it's hard to catch the > potential bugs or performance degradation for MXNet developers. > > In the future, I suggest adding the major downstream test cases, like from > sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the nightly test. > If it's still too heavy, maybe testing it weekly or monthly :) > > Thanks, > > --Patric > > > -Original Message- > > From: Anirudh Subramanian [mailto:anirudh2...@gmail.com] > > Sent: Friday, June 21, 2019 9:31 AM > > To: dev@mxnet.incubator.apache.org > > Cc: d...@mxnet.apache.org > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1 > > > > Hi Lai, > > > > I have opened an issue: > > https://github.com/apache/incubator-mxnet/issues/15297 > > I came to know about this issue only today and I have not been monitoring > > sockeye. > > I jumped onto this issue to make sure it wasn't caused by the dlpack > changes. > > Also, I don't think sockeye CI checks against master, it is using 1.4.1. > > > > Anirudh > > > > > > On Thu, Jun 20, 2019 at 6:17 PM Lai Wei wrote: > > > > > Hi, > > > > > > Could you share which test failed and what’s the crash? How to > > > reproduce it? > > > > > > I was able to install sockeye and run all tests passed. Using python > > > setup.py test > > > > > > I have tested both nightly pip package and 1.5.0.rc1 > > > > > > It would be great to create an issue with reproducible steps and move > > > the discussion there. > > > > > > Also I see sockeye nightly build[1] has been failing for some time, if > > > it’s due to MXNet change, please raise this early so we can track and > > > solve it in time rather than block the release during vote time. > > > > > > [1] https://travis-ci.org/awslabs/sockeye > > > > > > > > > On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian > > > > > > > > > wrote: > > > > > > > I was able to reproduce a crash with the commit > > > > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit > > > > a862270beb2d796c1ba311183f7f4a766a18ad6c. > > > > > > > > Anirudh > > > > > > > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei wrote: > > > > > > > > > Hi Przemyslaw, > > > > > > > > > > Is there an issue with more details to track the problem? > > > > > > > > > > > > > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak > > > > > > > > > > wrote: > > > > > > > > > > > -1 > > > > > > > > > > > > There is a crash in sockeye unit test (python setup.py test) > > > > > > observed starting with nightly 1.5 build from 6/13 and still > > > > > > occuring in > > > > 1.5rc1. I > > > > > > don't yet have the exact commit that is responsible for it, but > > > > > > it is either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack > > > > > > related) or > > > > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op > > optimization). > > > > > > > > > > > > On 2019/06/20 06:36:22, Lai Wei wrote: > > > > > > > Dear MXNet community, > > > > > > > > > > > > > > This is the 3-day vote to release Apache MXNet (incubating) > > > > > > > version > > > > > > 1.5.0. > > > > > > > Voting on dev@ will start June 19, 23:59:59(PST) and close on > > > June > > > > > 22, > > > > > > > 23:59:59. > > > > > > > > > > > > > > 1) Link to release notes: > > > > > > > > > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Note > > > > s > > > > > > > > > > > > > > > > > > > > > 2) Link to release candidate: > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.r > > > > > > > c1 > > > > > > > > > > > > > > > > > > > > > 3) Link to source and signatures on apache dist server: > > > > > > > > > > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.r > > > > > > > c1/ > > > > > > > > > > > > > > > > > > > > > Please remember to TEST first before voting accordingly: > > > > > > > > > > > > > > +1 = approve > > > > > > > +0 = no opinion > > > > > > > -1 = disapprove (provide reason) > > > > > > > -- > > > > > > > Best Regards > > > > > > > > > > > > > > Lai > > > > > > > > > > > > > > > > > > -- > > > > > Best Regards > > > > > > > > > > Lai > > > > > > > > > > > > -- > > > Best Regards > > > > > > Lai > > > >
RE: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Thanks to raise the issue and we will take a look ASAP. The downstream cases is not in the MXNet CI so it's hard to catch the potential bugs or performance degradation for MXNet developers. In the future, I suggest adding the major downstream test cases, like from sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the nightly test. If it's still too heavy, maybe testing it weekly or monthly :) Thanks, --Patric > -Original Message- > From: Anirudh Subramanian [mailto:anirudh2...@gmail.com] > Sent: Friday, June 21, 2019 9:31 AM > To: dev@mxnet.incubator.apache.org > Cc: d...@mxnet.apache.org > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1 > > Hi Lai, > > I have opened an issue: > https://github.com/apache/incubator-mxnet/issues/15297 > I came to know about this issue only today and I have not been monitoring > sockeye. > I jumped onto this issue to make sure it wasn't caused by the dlpack changes. > Also, I don't think sockeye CI checks against master, it is using 1.4.1. > > Anirudh > > > On Thu, Jun 20, 2019 at 6:17 PM Lai Wei wrote: > > > Hi, > > > > Could you share which test failed and what’s the crash? How to > > reproduce it? > > > > I was able to install sockeye and run all tests passed. Using python > > setup.py test > > > > I have tested both nightly pip package and 1.5.0.rc1 > > > > It would be great to create an issue with reproducible steps and move > > the discussion there. > > > > Also I see sockeye nightly build[1] has been failing for some time, if > > it’s due to MXNet change, please raise this early so we can track and > > solve it in time rather than block the release during vote time. > > > > [1] https://travis-ci.org/awslabs/sockeye > > > > > > On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian > > > > > > wrote: > > > > > I was able to reproduce a crash with the commit > > > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit > > > a862270beb2d796c1ba311183f7f4a766a18ad6c. > > > > > > Anirudh > > > > > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei wrote: > > > > > > > Hi Przemyslaw, > > > > > > > > Is there an issue with more details to track the problem? > > > > > > > > > > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak > > > > > > > > wrote: > > > > > > > > > -1 > > > > > > > > > > There is a crash in sockeye unit test (python setup.py test) > > > > > observed starting with nightly 1.5 build from 6/13 and still > > > > > occuring in > > > 1.5rc1. I > > > > > don't yet have the exact commit that is responsible for it, but > > > > > it is either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack > > > > > related) or > > > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op > optimization). > > > > > > > > > > On 2019/06/20 06:36:22, Lai Wei wrote: > > > > > > Dear MXNet community, > > > > > > > > > > > > This is the 3-day vote to release Apache MXNet (incubating) > > > > > > version > > > > > 1.5.0. > > > > > > Voting on dev@ will start June 19, 23:59:59(PST) and close on > > June > > > > 22, > > > > > > 23:59:59. > > > > > > > > > > > > 1) Link to release notes: > > > > > > > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Note > > > s > > > > > > > > > > > > > > > > > > 2) Link to release candidate: > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.r > > > > > > c1 > > > > > > > > > > > > > > > > > > 3) Link to source and signatures on apache dist server: > > > > > > > > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.r > > > > > > c1/ > > > > > > > > > > > > > > > > > > Please remember to TEST first before voting accordingly: > > > > > > > > > > > > +1 = approve > > > > > > +0 = no opinion > > > > > > -1 = disapprove (provide reason) > > > > > > -- > > > > > > Best Regards > > > > > > > > > > > > Lai > > > > > > > > > > > > > > > -- > > > > Best Regards > > > > > > > > Lai > > > > > > > > > -- > > Best Regards > > > > Lai > >
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Hi Lai, I have opened an issue: https://github.com/apache/incubator-mxnet/issues/15297 I came to know about this issue only today and I have not been monitoring sockeye. I jumped onto this issue to make sure it wasn't caused by the dlpack changes. Also, I don't think sockeye CI checks against master, it is using 1.4.1. Anirudh On Thu, Jun 20, 2019 at 6:17 PM Lai Wei wrote: > Hi, > > Could you share which test failed and what’s the crash? How to reproduce > it? > > I was able to install sockeye and run all tests passed. Using > python setup.py test > > I have tested both nightly pip package and 1.5.0.rc1 > > It would be great to create an issue with reproducible steps and move the > discussion there. > > Also I see sockeye nightly build[1] has been failing for some time, if it’s > due to MXNet change, please raise this early so we can track and solve it > in time rather than block the release during vote time. > > [1] https://travis-ci.org/awslabs/sockeye > > > On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian > > wrote: > > > I was able to reproduce a crash with the commit > > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit > > a862270beb2d796c1ba311183f7f4a766a18ad6c. > > > > Anirudh > > > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei wrote: > > > > > Hi Przemyslaw, > > > > > > Is there an issue with more details to track the problem? > > > > > > > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak > > > wrote: > > > > > > > -1 > > > > > > > > There is a crash in sockeye unit test (python setup.py test) observed > > > > starting with nightly 1.5 build from 6/13 and still occuring in > > 1.5rc1. I > > > > don't yet have the exact commit that is responsible for it, but it is > > > > either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or > > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op optimization). > > > > > > > > On 2019/06/20 06:36:22, Lai Wei wrote: > > > > > Dear MXNet community, > > > > > > > > > > This is the 3-day vote to release Apache MXNet (incubating) version > > > > 1.5.0. > > > > > Voting on dev@ will start June 19, 23:59:59(PST) and close on > June > > > 22, > > > > > 23:59:59. > > > > > > > > > > 1) Link to release notes: > > > > > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes > > > > > > > > > > > > > > > 2) Link to release candidate: > > > > > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1 > > > > > > > > > > > > > > > 3) Link to source and signatures on apache dist server: > > > > > > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/ > > > > > > > > > > > > > > > Please remember to TEST first before voting accordingly: > > > > > > > > > > +1 = approve > > > > > +0 = no opinion > > > > > -1 = disapprove (provide reason) > > > > > -- > > > > > Best Regards > > > > > > > > > > Lai > > > > > > > > > > > > -- > > > Best Regards > > > > > > Lai > > > > > > -- > Best Regards > > Lai >
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Hi, Could you share which test failed and what’s the crash? How to reproduce it? I was able to install sockeye and run all tests passed. Using python setup.py test I have tested both nightly pip package and 1.5.0.rc1 It would be great to create an issue with reproducible steps and move the discussion there. Also I see sockeye nightly build[1] has been failing for some time, if it’s due to MXNet change, please raise this early so we can track and solve it in time rather than block the release during vote time. [1] https://travis-ci.org/awslabs/sockeye On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian wrote: > I was able to reproduce a crash with the commit > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit > a862270beb2d796c1ba311183f7f4a766a18ad6c. > > Anirudh > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei wrote: > > > Hi Przemyslaw, > > > > Is there an issue with more details to track the problem? > > > > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak > > wrote: > > > > > -1 > > > > > > There is a crash in sockeye unit test (python setup.py test) observed > > > starting with nightly 1.5 build from 6/13 and still occuring in > 1.5rc1. I > > > don't yet have the exact commit that is responsible for it, but it is > > > either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op optimization). > > > > > > On 2019/06/20 06:36:22, Lai Wei wrote: > > > > Dear MXNet community, > > > > > > > > This is the 3-day vote to release Apache MXNet (incubating) version > > > 1.5.0. > > > > Voting on dev@ will start June 19, 23:59:59(PST) and close on June > > 22, > > > > 23:59:59. > > > > > > > > 1) Link to release notes: > > > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes > > > > > > > > > > > > 2) Link to release candidate: > > > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1 > > > > > > > > > > > > 3) Link to source and signatures on apache dist server: > > > > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/ > > > > > > > > > > > > Please remember to TEST first before voting accordingly: > > > > > > > > +1 = approve > > > > +0 = no opinion > > > > -1 = disapprove (provide reason) > > > > -- > > > > Best Regards > > > > > > > > Lai > > > > > > > > > -- > > Best Regards > > > > Lai > > > -- Best Regards Lai
Re: OMP
As already proposed, I think the easiest way to get a common understanding is if we start with a few docker containers. Pedro, would it be possible for you to wrap your benchmarks into a few containers that will produce your shown results? That way, we can avoid possible misunderstandings and also pinpoint the exact parts where people disagree or misunderstood each other. -Marco Pedro Larroy schrieb am Do., 20. Juni 2019, 21:47: > I can confirm that we are linking with two versions of omp, I'm > gaining more clarity into this topic, but I have still questions, the > facts that I got so far are the folllowing: > > * #1: We are linking with two versions of omp, intel's omp and llvm > openmp when building with MKL enabled. > * #2: We have 3 different possible OMP versions: Intel OMP (comes with > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This > one is used on the PR proposed by Anton). > > Questions: > > * #1 Is it ok to have two versions of openmp linked at the same time? > * #2 Which implementation of OMP gives the best performance? (See > total training time of my measurement for a partial answer) > * #3 Should we have a build flag so we can choose the OMP version at > runtime? > * #4 Which Compiler and build flags did Chris use to get 10x slowdown? > * #5 @Stas: is there a script to replicate your benchmarks easily? If > so could you provide a link? I think we would need to reproduce your > benchmarks and verify which versions are being linked. It's possible > that while compiling with MKL intel's omp was pulled in instead of > GNU OpenMP. > * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we > update the subrepo regularly? > > My conclusion so far: > > * #1 We should avoid linking two versions of omp if possible and > allow users to choose one in the build as we do for BLAS. > * #2 For performance reasons and more control vs different compiler > versions seems it makes indeed sense to keep the LLVM OpenMP version > in 3rdparty for now. So unless some more data is gathered, it makes > sense not to remove it as of now. > * #3 We should provide build options to choose which openmp library > is to be used from the three options available, including libgomp. > * #4 Refining the build we could also enable OpenMP in mac without > additional contortions (doesn't work as of today): > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/ > * #5 We should add different omp versions to our benchmarks and track > the performance, so this data is available for prescribing the best > build options and for binary releases. > > This is also an interesting related gh issue posted in the mkl-dnn > repository: https://github.com/intel/mkl-dnn/issues/230 > > > I don't observe the order of magnitude divergence reported by Chris in > vanilla Ubuntu 18.04 in samples / s but the full training finishes > indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp. > > There's also differences in training time when using MKL and the , > it's actually a bit slower, I don't know if it's related to OMP. > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) > > Anton's branch: g...@github.com:lebeg/incubator-mxnet.git branch 'omp' > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd > build/libmxnet.so |grep -i omp > libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 > (0x7fd99a51d000) > > time python train_mnist.py > > INFO:root:Epoch[18] Validation-accuracy=0.984176 > INFO:root:Epoch[19] Batch [0-100] Speed: 41617.00 samples/sec > accuracy=1.00 > INFO:root:Epoch[19] Batch [100-200] Speed: 47990.69 samples/sec > accuracy=0.999531 > INFO:root:Epoch[19] Batch [200-300] Speed: 47517.01 samples/sec > accuracy=0.999687 > INFO:root:Epoch[19] Batch [300-400] Speed: 47430.53 samples/sec > accuracy=1.00 > INFO:root:Epoch[19] Batch [400-500] Speed: 47649.77 samples/sec > accuracy=0.999687 > INFO:root:Epoch[19] Batch [500-600] Speed: 51708.12 samples/sec > accuracy=0.999687 > INFO:root:Epoch[19] Batch [600-700] Speed: 57228.63 samples/sec > accuracy=0.999375 > INFO:root:Epoch[19] Batch [700-800] Speed: 50887.85 samples/sec > accuracy=0.999844 > INFO:root:Epoch[19] Batch [800-900] Speed: 53947.98 samples/sec > accuracy=0.999531 > INFO:root:Epoch[19] Train-accuracy=0.999717 > INFO:root:Epoch[19] Time cost=1.219 > INFO:root:Epoch[19] Validation-accuracy=0.983977 > 1011.98user 26.78system 0:31.54elapsed 3292%CPU (0avgtext+0avgdata > 1146052maxresident)k > 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps > > Master, MKL ON: > > (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification [master]> ldd > ../../build/libmxnet.so | grep -i omp > libomp.so => > /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so > (0x7f05ba38f000) > libiomp5.so => > > /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so > (0x7f05b09f4000) > > INFO:root:Epoch[18]
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
I was able to reproduce a crash with the commit 09202f7f261954383aa387144524d38f83f18d06 but not with the commit a862270beb2d796c1ba311183f7f4a766a18ad6c. Anirudh On Thu, Jun 20, 2019 at 3:53 PM Lai Wei wrote: > Hi Przemyslaw, > > Is there an issue with more details to track the problem? > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak > wrote: > > > -1 > > > > There is a crash in sockeye unit test (python setup.py test) observed > > starting with nightly 1.5 build from 6/13 and still occuring in 1.5rc1. I > > don't yet have the exact commit that is responsible for it, but it is > > either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or > > 09202f7f261954383aa387144524d38f83f18d06 (cached op optimization). > > > > On 2019/06/20 06:36:22, Lai Wei wrote: > > > Dear MXNet community, > > > > > > This is the 3-day vote to release Apache MXNet (incubating) version > > 1.5.0. > > > Voting on dev@ will start June 19, 23:59:59(PST) and close on June > 22, > > > 23:59:59. > > > > > > 1) Link to release notes: > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes > > > > > > > > > 2) Link to release candidate: > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1 > > > > > > > > > 3) Link to source and signatures on apache dist server: > > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/ > > > > > > > > > Please remember to TEST first before voting accordingly: > > > > > > +1 = approve > > > +0 = no opinion > > > -1 = disapprove (provide reason) > > > -- > > > Best Regards > > > > > > Lai > > > > > > -- > Best Regards > > Lai >
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Hi Przemyslaw, Is there an issue with more details to track the problem? On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak wrote: > -1 > > There is a crash in sockeye unit test (python setup.py test) observed > starting with nightly 1.5 build from 6/13 and still occuring in 1.5rc1. I > don't yet have the exact commit that is responsible for it, but it is > either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or > 09202f7f261954383aa387144524d38f83f18d06 (cached op optimization). > > On 2019/06/20 06:36:22, Lai Wei wrote: > > Dear MXNet community, > > > > This is the 3-day vote to release Apache MXNet (incubating) version > 1.5.0. > > Voting on dev@ will start June 19, 23:59:59(PST) and close on June 22, > > 23:59:59. > > > > 1) Link to release notes: > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes > > > > > > 2) Link to release candidate: > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1 > > > > > > 3) Link to source and signatures on apache dist server: > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/ > > > > > > Please remember to TEST first before voting accordingly: > > > > +1 = approve > > +0 = no opinion > > -1 = disapprove (provide reason) > > -- > > Best Regards > > > > Lai > > > -- Best Regards Lai
Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
-1 There is a crash in sockeye unit test (python setup.py test) observed starting with nightly 1.5 build from 6/13 and still occuring in 1.5rc1. I don't yet have the exact commit that is responsible for it, but it is either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack related) or 09202f7f261954383aa387144524d38f83f18d06 (cached op optimization). On 2019/06/20 06:36:22, Lai Wei wrote: > Dear MXNet community, > > This is the 3-day vote to release Apache MXNet (incubating) version 1.5.0. > Voting on dev@ will start June 19, 23:59:59(PST) and close on June 22, > 23:59:59. > > 1) Link to release notes: > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes > > > 2) Link to release candidate: > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1 > > > 3) Link to source and signatures on apache dist server: > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/ > > > Please remember to TEST first before voting accordingly: > > +1 = approve > +0 = no opinion > -1 = disapprove (provide reason) > -- > Best Regards > > Lai >
Re: OMP
I can confirm that we are linking with two versions of omp, I'm gaining more clarity into this topic, but I have still questions, the facts that I got so far are the folllowing: * #1: We are linking with two versions of omp, intel's omp and llvm openmp when building with MKL enabled. * #2: We have 3 different possible OMP versions: Intel OMP (comes with MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This one is used on the PR proposed by Anton). Questions: * #1 Is it ok to have two versions of openmp linked at the same time? * #2 Which implementation of OMP gives the best performance? (See total training time of my measurement for a partial answer) * #3 Should we have a build flag so we can choose the OMP version at runtime? * #4 Which Compiler and build flags did Chris use to get 10x slowdown? * #5 @Stas: is there a script to replicate your benchmarks easily? If so could you provide a link? I think we would need to reproduce your benchmarks and verify which versions are being linked. It's possible that while compiling with MKL intel's omp was pulled in instead of GNU OpenMP. * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we update the subrepo regularly? My conclusion so far: * #1 We should avoid linking two versions of omp if possible and allow users to choose one in the build as we do for BLAS. * #2 For performance reasons and more control vs different compiler versions seems it makes indeed sense to keep the LLVM OpenMP version in 3rdparty for now. So unless some more data is gathered, it makes sense not to remove it as of now. * #3 We should provide build options to choose which openmp library is to be used from the three options available, including libgomp. * #4 Refining the build we could also enable OpenMP in mac without additional contortions (doesn't work as of today): https://iscinumpy.gitlab.io/post/omp-on-high-sierra/ * #5 We should add different omp versions to our benchmarks and track the performance, so this data is available for prescribing the best build options and for binary releases. This is also an interesting related gh issue posted in the mkl-dnn repository: https://github.com/intel/mkl-dnn/issues/230 I don't observe the order of magnitude divergence reported by Chris in vanilla Ubuntu 18.04 in samples / s but the full training finishes indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp. There's also differences in training time when using MKL and the , it's actually a bit slower, I don't know if it's related to OMP. gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) Anton's branch: g...@github.com:lebeg/incubator-mxnet.git branch 'omp' (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd build/libmxnet.so |grep -i omp libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x7fd99a51d000) time python train_mnist.py INFO:root:Epoch[18] Validation-accuracy=0.984176 INFO:root:Epoch[19] Batch [0-100] Speed: 41617.00 samples/sec accuracy=1.00 INFO:root:Epoch[19] Batch [100-200] Speed: 47990.69 samples/sec accuracy=0.999531 INFO:root:Epoch[19] Batch [200-300] Speed: 47517.01 samples/sec accuracy=0.999687 INFO:root:Epoch[19] Batch [300-400] Speed: 47430.53 samples/sec accuracy=1.00 INFO:root:Epoch[19] Batch [400-500] Speed: 47649.77 samples/sec accuracy=0.999687 INFO:root:Epoch[19] Batch [500-600] Speed: 51708.12 samples/sec accuracy=0.999687 INFO:root:Epoch[19] Batch [600-700] Speed: 57228.63 samples/sec accuracy=0.999375 INFO:root:Epoch[19] Batch [700-800] Speed: 50887.85 samples/sec accuracy=0.999844 INFO:root:Epoch[19] Batch [800-900] Speed: 53947.98 samples/sec accuracy=0.999531 INFO:root:Epoch[19] Train-accuracy=0.999717 INFO:root:Epoch[19] Time cost=1.219 INFO:root:Epoch[19] Validation-accuracy=0.983977 1011.98user 26.78system 0:31.54elapsed 3292%CPU (0avgtext+0avgdata 1146052maxresident)k 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps Master, MKL ON: (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification [master]> ldd ../../build/libmxnet.so | grep -i omp libomp.so => /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so (0x7f05ba38f000) libiomp5.so => /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so (0x7f05b09f4000) INFO:root:Epoch[18] Validation-accuracy=0.982484 INFO:root:Epoch[19] Batch [0-100] Speed: 36651.63 samples/sec accuracy=0.999691 INFO:root:Epoch[19] Batch [100-200] Speed: 45093.98 samples/sec accuracy=0.999844 INFO:root:Epoch[19] Batch [200-300] Speed: 45146.84 samples/sec accuracy=0.999687 INFO:root:Epoch[19] Batch [300-400] Speed: 45119.90 samples/sec accuracy=0.999687 INFO:root:Epoch[19] Batch [400-500] Speed: 44998.96 samples/sec accuracy=0.999531 INFO:root:Epoch[19] Batch [500-600] Speed: 45072.25 samples/sec accuracy=0.999844 INFO:root:Epoch[19] Batch [600-700] Speed: 44969.79 samples/sec
[VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
Dear MXNet community, This is the 3-day vote to release Apache MXNet (incubating) version 1.5.0. Voting on dev@ will start June 19, 23:59:59(PST) and close on June 22, 23:59:59. 1) Link to release notes: https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes 2) Link to release candidate: https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc1 3) Link to source and signatures on apache dist server: https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc1/ Please remember to TEST first before voting accordingly: +1 = approve +0 = no opinion -1 = disapprove (provide reason) -- Best Regards Lai