Re: Profiler Broken?

2020-05-28 Thread Pedro Larroy
Yes the profiler seems to be broken / has some concurrency issues. I have
seen corrupted profile results.

On Thu, May 28, 2020 at 12:30 PM Naveen Swamy  wrote:

> I am attempting to profile one of our models, I used the profiler.state to
> run/stop in code and also used the environment variables to autostart the
> profiler. It creates a 600MB json file, however when I view in chrome
> tracing it comes out to be blank screen (loading seems to be fine, didn't
> get any errors)
>
> Wondering if anyone has recently tried or if aware of profiler being
> broken?
>
> ENVIRON: Ubuntu 18.04
> MXNet : mxnet-cu101mkl
> Deep Learning AMI (Ubuntu 18.04) Version 29.0 (ami-043f9aeaf108ebc37)
>
> Thanks, Naveen
>


Re: Workflow proposal

2020-03-17 Thread Pedro Larroy
The idea is that it would be rolled back automatically to the previous
successful nightly.  So PRs would be rebased and would address that nightly
test failure, this also links with the manual trigger of CI, which can also
be used to retrigger nightly or benchmarks.

On Mon, Mar 16, 2020 at 11:53 AM Marco de Abreu 
wrote:

> Considering how unstable our PR as well as our nightly jobs have been so
> far, is that an assumption we can rightfully make? Also, who'd be
> responsible for fixing that branch in case a PR actually breaks a nightly
> test?
>
> -Marco
>
> On Mon, Mar 16, 2020 at 7:41 PM Pedro Larroy  >
> wrote:
>
> > The original idea is that the promotion to the other branch is automated
> by
> > nightly CI, so it shouldn't have those problems that are mentioned, so
> > there shouldn't be any manual merging on that branch.
> >
> > On Wed, Mar 11, 2020 at 7:43 PM Chris Olivier 
> > wrote:
> >
> > > My $0.02
> > >
> > > We had this model dual-branch when I was at GE and it was problematic.
> > > Among other things, the two branches would tend to diverge to a large
> > > degree and you ended up just cherry-picking in stuff here and there,
> > which
> > > caused even more problems, as well as the model allows the secondary
> > branch
> > > to get pretty buggy -- human nature being what it is -- to the point
> > where
> > > it's difficult to merge it into master without freezing them both and
> > > stabilizing, merging into master, then stabilizing again (small things
> > > almost certainly went into master in the meantime -- hotfixes, critical
> > > features, etc, while everything was on hold stabilizing the secondary
> > > branch).  It just double the work in the end, is what I experienced.
> > >
> > > On Wed, Mar 11, 2020 at 5:47 PM Yuan Tang 
> > wrote:
> > >
> > > > Second to not introduce a dev branch. We should try to improve our
> > > release
> > > > process instead and avoid another branch that may introduce confusion
> > > > around the source of truth.
> > > >
> > > > On Wed, Mar 11, 2020 at 8:39 PM Tianqi Chen <
> tqc...@cs.washington.edu>
> > > > wrote:
> > > >
> > > > > While the idea of staging seems to be reasonable.
> > > > > Most OSS projects choose not to do so because having a complicated
> > > > staging
> > > > > will likely confuse the contributors, and increase the change of
> > > > > divergence(between dev and master).
> > > > >
> > > > > Given that we have a release model, so in some sense the release
> > itself
> > > > > serves as a staging pt.
> > > > > A good approach would simply setup the nightly if necessary strive
> to
> > > fix
> > > > > regressions and make sure the formal release addresses the issues.
> > > > >
> > > > > TQ
> > > > >
> > > > > On Wed, Mar 11, 2020 at 5:32 PM Pedro Larroy <
> > > > pedro.larroy.li...@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi
> > > > > >
> > > > > > I talk to some people about this and they thought it would be a
> > good
> > > > > idea,
> > > > > > so sharing it here:
> > > > > >
> > > > > > I would propose to use a staging or "dev" branch into which
> > nightly &
> > > > > > performance tests are done periodically and then this branch is
> > > merged
> > > > to
> > > > > > master. The goal of this workflow would be to avoid having
> > > regressions
> > > > > and
> > > > > > nightly failures creeping into master. PRs would get merged into
> > dev
> > > > and
> > > > > > dev promoted periodically / nightly into master.
> > > > > >
> > > > > > The names can be swapped as well, between dev and master, so PRS
> > get
> > > > > merged
> > > > > > into master and it doesn't change the workflow, and staging is
> the
> > > > branch
> > > > > > where nightly changes are merged to.
> > > > > >
> > > > > > Have this been considered?
> > > > > >
> > > > > > Pedro.
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Yuan Tang
> > > > https://terrytangyuan.github.io/about/ <
> > http://twitter.com/TerryTangYuan
> > > >
> > > > <https://terrytangyuan.github.io/about/>
> > > >
> > >
> >
>


Re: Workflow proposal

2020-03-16 Thread Pedro Larroy
The original idea is that the promotion to the other branch is automated by
nightly CI, so it shouldn't have those problems that are mentioned, so
there shouldn't be any manual merging on that branch.

On Wed, Mar 11, 2020 at 7:43 PM Chris Olivier  wrote:

> My $0.02
>
> We had this model dual-branch when I was at GE and it was problematic.
> Among other things, the two branches would tend to diverge to a large
> degree and you ended up just cherry-picking in stuff here and there, which
> caused even more problems, as well as the model allows the secondary branch
> to get pretty buggy -- human nature being what it is -- to the point where
> it's difficult to merge it into master without freezing them both and
> stabilizing, merging into master, then stabilizing again (small things
> almost certainly went into master in the meantime -- hotfixes, critical
> features, etc, while everything was on hold stabilizing the secondary
> branch).  It just double the work in the end, is what I experienced.
>
> On Wed, Mar 11, 2020 at 5:47 PM Yuan Tang  wrote:
>
> > Second to not introduce a dev branch. We should try to improve our
> release
> > process instead and avoid another branch that may introduce confusion
> > around the source of truth.
> >
> > On Wed, Mar 11, 2020 at 8:39 PM Tianqi Chen 
> > wrote:
> >
> > > While the idea of staging seems to be reasonable.
> > > Most OSS projects choose not to do so because having a complicated
> > staging
> > > will likely confuse the contributors, and increase the change of
> > > divergence(between dev and master).
> > >
> > > Given that we have a release model, so in some sense the release itself
> > > serves as a staging pt.
> > > A good approach would simply setup the nightly if necessary strive to
> fix
> > > regressions and make sure the formal release addresses the issues.
> > >
> > > TQ
> > >
> > > On Wed, Mar 11, 2020 at 5:32 PM Pedro Larroy <
> > pedro.larroy.li...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi
> > > >
> > > > I talk to some people about this and they thought it would be a good
> > > idea,
> > > > so sharing it here:
> > > >
> > > > I would propose to use a staging or "dev" branch into which nightly &
> > > > performance tests are done periodically and then this branch is
> merged
> > to
> > > > master. The goal of this workflow would be to avoid having
> regressions
> > > and
> > > > nightly failures creeping into master. PRs would get merged into dev
> > and
> > > > dev promoted periodically / nightly into master.
> > > >
> > > > The names can be swapped as well, between dev and master, so PRS get
> > > merged
> > > > into master and it doesn't change the workflow, and staging is the
> > branch
> > > > where nightly changes are merged to.
> > > >
> > > > Have this been considered?
> > > >
> > > > Pedro.
> > > >
> > >
> >
> >
> > --
> > Yuan Tang
> > https://terrytangyuan.github.io/about/ <http://twitter.com/TerryTangYuan
> >
> > <https://terrytangyuan.github.io/about/>
> >
>


Workflow proposal

2020-03-11 Thread Pedro Larroy
Hi

I talk to some people about this and they thought it would be a good idea,
so sharing it here:

I would propose to use a staging or "dev" branch into which nightly &
performance tests are done periodically and then this branch is merged to
master. The goal of this workflow would be to avoid having regressions and
nightly failures creeping into master. PRs would get merged into dev and
dev promoted periodically / nightly into master.

The names can be swapped as well, between dev and master, so PRS get merged
into master and it doesn't change the workflow, and staging is the branch
where nightly changes are merged to.

Have this been considered?

Pedro.


Re: New AMIs for CI

2020-02-21 Thread Pedro Larroy
CI is back to normal. We haven't updated Windows AMIs due to issues with
GPU unit tests.

You might need to retrigger your PRs.

Thanks for your patience.

On Wed, Feb 19, 2020 at 5:54 PM Pedro Larroy 
wrote:

> I reverted the CI rollout due to the following issues:
>
> https://github.com/apache/incubator-mxnet/issues/17633
>
> https://github.com/apache/incubator-mxnet/issues/17635
>
> I would need help from the community to fix them as we can't even compile
> in debug mode in windows as the above, and also due to older cmake being
> used in vs2017.
>
> For updating to vs2019 we would need to update cuda.
>
> Pedro.
>
>
>
> On Tue, Feb 18, 2020 at 5:31 PM Pedro Larroy 
> wrote:
>
>> Hi
>>
>> Tomorrow I will be updating the CI environment with new AMIs, and
>> deploying updated autoscaling logic with fixes, expect some disruptions in
>> CI runs.
>>
>> The Linux AMIs will be updated to Ubuntu 18.04 with updated GPU drivers,
>> this won't affect Linux container builds.
>>
>> The new Windows AMI comes with a reproducible environment, VS2017, Visual
>> C++ updated from VC14 to VC15.
>>
>> CMake 3.16.2, Perl and LLVM which are required for MXNet and TVM. Cuda is
>> still 9.2, but now it's easier to update as the installation is automated.
>>
>>  Once the environment is updated, my PR needs to be merged to bring back
>> windows compilation in working order:
>>
>> https://github.com/apache/incubator-mxnet/pull/17206
>>
>> Thanks to Leonard and Joe for helping with various issues.
>>
>> Pedro.
>>
>


Re: New AMIs for CI

2020-02-19 Thread Pedro Larroy
I reverted the CI rollout due to the following issues:

https://github.com/apache/incubator-mxnet/issues/17633

https://github.com/apache/incubator-mxnet/issues/17635

I would need help from the community to fix them as we can't even compile
in debug mode in windows as the above, and also due to older cmake being
used in vs2017.

For updating to vs2019 we would need to update cuda.

Pedro.



On Tue, Feb 18, 2020 at 5:31 PM Pedro Larroy 
wrote:

> Hi
>
> Tomorrow I will be updating the CI environment with new AMIs, and
> deploying updated autoscaling logic with fixes, expect some disruptions in
> CI runs.
>
> The Linux AMIs will be updated to Ubuntu 18.04 with updated GPU drivers,
> this won't affect Linux container builds.
>
> The new Windows AMI comes with a reproducible environment, VS2017, Visual
> C++ updated from VC14 to VC15.
>
> CMake 3.16.2, Perl and LLVM which are required for MXNet and TVM. Cuda is
> still 9.2, but now it's easier to update as the installation is automated.
>
>  Once the environment is updated, my PR needs to be merged to bring back
> windows compilation in working order:
>
> https://github.com/apache/incubator-mxnet/pull/17206
>
> Thanks to Leonard and Joe for helping with various issues.
>
> Pedro.
>


New AMIs for CI

2020-02-18 Thread Pedro Larroy
Hi

Tomorrow I will be updating the CI environment with new AMIs, and deploying
updated autoscaling logic with fixes, expect some disruptions in CI runs.

The Linux AMIs will be updated to Ubuntu 18.04 with updated GPU drivers,
this won't affect Linux container builds.

The new Windows AMI comes with a reproducible environment, VS2017, Visual
C++ updated from VC14 to VC15.

CMake 3.16.2, Perl and LLVM which are required for MXNet and TVM. Cuda is
still 9.2, but now it's easier to update as the installation is automated.

 Once the environment is updated, my PR needs to be merged to bring back
windows compilation in working order:

https://github.com/apache/incubator-mxnet/pull/17206

Thanks to Leonard and Joe for helping with various issues.

Pedro.


Re: Cuda 10.2 Wheels

2020-02-17 Thread Pedro Larroy
I would suggest to update the pip page descriptions or website with a link
to the new distribution channel. Right now It's ungoogleable how to find
the pre-release wheels. Also would be useful to link to this from the
website if possible.   Google directs to pip. If I find it confusing, I
can't image a random user.

On Tue, Feb 11, 2020 at 7:25 PM Sheng Zha  wrote:

> Thanks for bringing this up. That table is misleading and is not an
> acceptable solution for a static reference of the latest pre-releases (more
> in [1]). I’m currently working on the replacement that provides similar
> experiences as pytorch nightly builds page.
>
> -sz
>
> [1]
> https://github.com/apache/incubator-mxnet/issues/17537#issuecomment-584683578
>
>
> > On Feb 11, 2020, at 10:06 PM, Lv, Tao A  wrote:
> >
> > Hi Sheng,
> >
> > It seems the top latest build table is not well updated. I see there are
> 2020-2-12 builds for different variants but the latest build are still
> 2020-2-10 - the build date is not reflected in the link but can be got
> through `pip list`.
> >
> > Thanks,
> > -tao
> >
> > -Original Message-
> > From: Sheng Zha 
> > Sent: Tuesday, February 11, 2020 11:37 PM
> > To: d...@mxnet.apache.org
> > Subject: Re: Cuda 10.2 Wheels
> >
> > The static page is now accessible from
> https://repo.mxnet.io/dist/index.html. Note that the previous links may
> have been moved as part of reorganizing the file store namespaces. Please
> refer to the latest page.
> >
> > -sz
> >
> >> On 2020/02/06 23:21:21, Alfredo Luque 
> wrote:
> >> Looks like it updated since I last posted. Thanks!
> >>
> >> On February 6, 2020 at 3:20:34 PM, Pedro Larroy (
> >> pedro.larroy.li...@gmail.com) wrote:
> >>
> >> Hi Alfredo.
> >>
> >> Isn't "mxnet_cu102mkl-1.6.0
> >> <
> >>
> https://repo.mxnet.io/dist/mxnet_cu102mkl-1.6.0-py2.py3-none-manylinux1_x86_64.whl
> >"
> >>
> >> what you are looking for? I see it on the second link you posted.
> >>
> >> Pedro
> >>
> >> On Tue, Feb 4, 2020 at 3:29 PM Alfredo Luque
> >>  wrote:
> >>
> >>> Hi folks,
> >>>
> >>> Are there any blockers on releasing CUDA 10.2 compatible wheels?
> >>> Based on this readme <
> >>>
> >> https://github.com/apache/incubator-mxnet/blob/master/tools/pip/doc/CU
> >> 102_ADDITIONAL.md
> >>>>
> >>> the
> >>> packages should be available on PyPi already but they don’t appear
> >>> to
> >> exist
> >>> yet.
> >>>
> >>> On the other thread, someone posted this static page
> >>> <https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/index.html>
> >>> that
> >> has
> >>> nightly builds hosted on S3 but it appears CUDA 10.2 wheels aren’t
> >>> on there.
> >>>
> >>> —
> >>> Alfredo Luque
> >>> Software Engineer
> >>> Machine Learning Infrastructure
> >>> Airbnb
> >>> San Francisco, CA
> >>>
> >>
> >> —
> >> Alfredo Luque
> >> Software Engineer
> >> Machine Learning Infrastructure
> >> Airbnb
> >> San Francisco, CA
> >>
>


Re: Join request for MXNet Swift support

2020-02-10 Thread Pedro Larroy
Welcome Rahul ! Excited to have you join us.

I was wondering how fast, effective and what options are to call from
python into Swift, and from Swift into C to execute the dataflow graph or
call into operators. There was a thread before about microbenchmarking
calling into the C++ engine from Python using different methods. Not sure
if you have done some experiments in that direction.

Pedro.

On Mon, Feb 10, 2020 at 3:57 AM Tao Lv  wrote:

> Hi Rahul,
>
> Invite is sent to rahulbhal...@protonmail.com. Welcome to the community
> and
> looking forward to your contribution.
>
> -tao
>
> On Mon, Feb 10, 2020 at 1:10 PM Rahul  .invalid>
> wrote:
>
> > Hello,
> >
> > As per the conversation with [Pedro Larroy](https://twitter.com/plarroy)
> > on [Twitter thread](
> https://twitter.com/plarroy/status/1226408543621771264)
> > I would like to join this Slack channel for contributing to MXNet in
> Swift.
> >
> > Regards
> > Rahul Bhalley
> > [ORCID](https://orcid.org/-0002-4574-0390)
>


Re: Cuda 10.2 Wheels

2020-02-06 Thread Pedro Larroy
Hi Alfredo.

Isn't "mxnet_cu102mkl-1.6.0
"
what you are looking for? I see it on the second link you posted.

Pedro

On Tue, Feb 4, 2020 at 3:29 PM Alfredo Luque
 wrote:

> Hi folks,
>
> Are there any blockers on releasing CUDA 10.2 compatible wheels? Based on
> this
> readme
> <
> https://github.com/apache/incubator-mxnet/blob/master/tools/pip/doc/CU102_ADDITIONAL.md
> >
> the
> packages should be available on PyPi already but they don’t appear to exist
> yet.
>
> On the other thread, someone posted this static page
>  that has
> nightly builds hosted on S3 but it appears CUDA 10.2 wheels aren’t on
> there.
>
> —
> Alfredo Luque
> Software Engineer
> Machine Learning Infrastructure
> Airbnb
> San Francisco, CA
>


Re: [VOTE] Release Apache MXNet (incubating) version 1.6.0.rc2

2020-02-04 Thread Pedro Larroy
Hi Przemek

I'm fine if we add it to the release notes and try to fix it for the next
release. Changing my vote to +1

Pedro.

On Mon, Feb 3, 2020 at 7:42 PM Pedro Larroy 
wrote:

>
> -1
>
> Unit tests passed in CPU build.
>
> I observe crashes related to openmp using cpp unit tests:
>
> https://github.com/apache/incubator-mxnet/issues/17043
>
> Pedro.
>
> On Mon, Feb 3, 2020 at 6:44 PM Chaitanya Bapat 
> wrote:
>
>> +1
>> Successfully built MXNet 1.6.0rc2 on Linux
>> Tested for OpPerf utility
>> For CPU -
>> https://gist.github.com/ChaiBapchya/d5ecc3e971c5a3c558d672477b4b6b9c
>>
>> Works well!
>>
>>
>>
>> On Mon, 3 Feb 2020 at 15:43, Lin Yuan  wrote:
>>
>> > +1
>> >
>> > Tested Horovod with mnist example. My compiler flags are below:
>> >
>> > [✔ CUDA, ✔ CUDNN, ✔ NCCL, ✔ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔
>> CPU_SSE2, ✔
>> > CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖
>> CPU_AVX2, ✔
>> > OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS, ✖
>> BLAS_MKL, ✖
>> > BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✔ OPENCV, ✖ CAFFE, ✖ PROFILER, ✔
>> > DIST_KVSTORE, ✖ CXX14, ✖ INT64_TENSOR_SIZE, ✖ SIGNAL_HANDLER, ✖ DEBUG, ✖
>> > TVM_OP]
>> >
>> > Lin
>> >
>> > On Sat, Feb 1, 2020 at 9:55 PM Tao Lv  wrote:
>> >
>> > > +1
>> > >
>> > > I tested below items:
>> > > 1. download artifacts from Apache dist repo;
>> > > 2. the signature looks good;
>> > > 3. build from source code with MKL-DNN and MKL on centos;
>> > > 4. run fp32 and int8 inference of ResNet50 under
>> /example/quantization/.
>> > >
>> > > thanks,
>> > > -tao
>> > >
>> > > On Sun, Feb 2, 2020 at 11:00 AM Tao Lv  wrote:
>> > >
>> > > > I see. I was looking at this page:
>> > > > https://github.com/apache/incubator-mxnet/releases/tag/1.6.0.rc2
>> > > >
>> > > > On Sun, Feb 2, 2020 at 4:54 AM Przemysław Trędak <
>> ptre...@apache.org>
>> > > > wrote:
>> > > >
>> > > >> Hi Tao,
>> > > >>
>> > > >> Could you tell me where did you look for it and did not find it? I
>> > just
>> > > >> checked and both
>> > > >> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.6.0.rc2/
>> and
>> > > >> draft of the release on GitHub have them.
>> > > >>
>> > > >> Thank you
>> > > >> Przemek
>> > > >>
>> > > >> On 2020/02/01 14:23:11, Tao Lv  wrote:
>> > > >> > It seems the src tar and signature are missing from the tag.
>> > > >> >
>> > > >> > On Fri, Jan 31, 2020 at 11:09 AM Przemysław Trędak <
>> > > ptre...@apache.org>
>> > > >> > wrote:
>> > > >> >
>> > > >> > > Dear MXNet community,
>> > > >> > >
>> > > >> > > This is the vote to release Apache MXNet (incubating) version
>> > 1.6.0.
>> > > >> > > Voting starts today and will close on Monday 2/3/2020 23:59
>> PST.
>> > > >> > >
>> > > >> > > Link to release notes:
>> > > >> > >
>> > > https://cwiki.apache.org/confluence/display/MXNET/1.6.0+Release+notes
>> > > >> > >
>> > > >> > > Link to release candidate:
>> > > >> > >
>> https://github.com/apache/incubator-mxnet/releases/tag/1.6.0.rc2
>> > > >> > >
>> > > >> > > Link to source and signatures on apache dist server:
>> > > >> > >
>> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.6.0.rc2/
>> > > >> > >
>> > > >> > > The differences comparing to previous release candidate
>> 1.6.0.rc1:
>> > > >> > >  * Fixes for license issues (#17361, #17375, #17370, #17460)
>> > > >> > >  * Bugfix for saving LSTM layer parameter (#17288)
>> > > >> > >  * Bugfix for downloading the model from model zoo from
>> multiple
>> > > >> processes
>> > > >> > > (#17372)
>> > > >> > >  * Fixed a symbol.py in AMP for GluonNLP (#17408)
>> > > >> > >
>> > > >> > >
>> > > >> > > Please remember to TEST first before voting accordingly:
>> > > >> > > +1 = approve
>> > > >> > > +0 = no opinion
>> > > >> > > -1 = disapprove (provide reason)
>> > > >> > >
>> > > >> > >
>> > > >> > > Best regards,
>> > > >> > > Przemyslaw Tredak
>> > > >> > >
>> > > >> >
>> > > >>
>> > > >
>> > >
>> >
>>
>>
>> --
>> *Chaitanya Prakash Bapat*
>> *+1 (973) 953-6299*
>>
>> [image: https://www.linkedin.com//in/chaibapat25]
>> <https://github.com/ChaiBapchya>[image:
>> https://www.facebook.com/chaibapat]
>> <https://www.facebook.com/chaibapchya>[image:
>> https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image:
>> https://www.linkedin.com//in/chaibapat25]
>> <https://www.linkedin.com//in/chaibapchya/>
>>
>


Re: [VOTE] Release Apache MXNet (incubating) version 1.6.0.rc2

2020-02-04 Thread Pedro Larroy
@Chris: If you actually go and read the issue that I linked above, you can
see that I was using gdb. Maybe you can have a look into the issue if you
have an idea to fix. The backtrace points to a segfault in the omp library.
While the cause could be somewhere else which is causing undefined
behaviour, taking into consideration that this is not happening with
libgomp and other engineers believe that mixing openmp implementations at
runtime can cause UB, it's reasonable to believe that there's a good chance
that is related to this. I personally don't have time to investigate this
further, as I don't think introducing this dependency is worth the trouble
is causing, when the one provided by the platform works well enough.

0x743b284a in __kmp_fork_call () from
/home/piotr/mxnet/build/3rdparty/openmp/runtime/src/libomp.so
(gdb) bt


@Lin: I personally wouldn't be comfortable releasing a version that
segfaults, I don't think that meets the quality bar. but this is up to the
community to decide, I'm only reporting what I observe.

Releasing with indications of this kind of problems causes issues later in
downstream projects and running services.

On Tue, Feb 4, 2020 at 11:07 AM Chris Olivier  wrote:

> When "fixing", please "fix" through actual root-cause analysis (use gdb,
> for instance) and not simply by guesswork and cutting out things which
> probably aren't actually at fault (blaming an OMP library that's in
> worldwide distribution int he billions should be treated with great
> skepticism).
>
> On Tue, Feb 4, 2020 at 10:44 AM Lin Yuan  wrote:
>
> > Pedro,
> >
> > While I agree with you we need to fix this usability issue, I don't think
> > this is a release blocker as Przemek mentioned above. Could we fix this
> in
> > the next minor release?
> >
> > Thanks,
> >
> > Lin
> >
> > On Tue, Feb 4, 2020 at 10:38 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com
> > >
> > wrote:
> >
> > > Right. Would it be possible to have the CMake build also use libgomp
> for
> > > consistency with the releases until these issues are resolved?
> > > This can affect anyone compiling the distribution with CMake and also
> > > happens randomly in CI, worsening the contributor experience due to CI
> > > failures.
> > >
> > > On Tue, Feb 4, 2020 at 9:33 AM Przemysław Trędak 
> > > wrote:
> > >
> > > > Hi Pedro,
> > > >
> > > > From the issue that you linked it seems that you are using the LLVM
> > > > OpenMP, whereas I believe the actual release uses libgomp (at least
> > > that's
> > > > what seems to be the conclusion from this issue:
> > > > https://github.com/apache/incubator-mxnet/issues/16891)?
> > > >
> > > > Przemek
> > > >
> > > > On 2020/02/04 03:42:30, Pedro Larroy 
> > > > wrote:
> > > > > -1
> > > > >
> > > > > Unit tests passed in CPU build.
> > > > >
> > > > > I observe crashes related to openmp using cpp unit tests:
> > > > >
> > > > > https://github.com/apache/incubator-mxnet/issues/17043
> > > > >
> > > > > Pedro.
> > > > >
> > > > > On Mon, Feb 3, 2020 at 6:44 PM Chaitanya Bapat <
> chai.ba...@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > > > +1
> > > > > > Successfully built MXNet 1.6.0rc2 on Linux
> > > > > > Tested for OpPerf utility
> > > > > > For CPU -
> > > > > >
> > https://gist.github.com/ChaiBapchya/d5ecc3e971c5a3c558d672477b4b6b9c
> > > > > >
> > > > > > Works well!
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, 3 Feb 2020 at 15:43, Lin Yuan 
> wrote:
> > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > Tested Horovod with mnist example. My compiler flags are below:
> > > > > > >
> > > > > > > [✔ CUDA, ✔ CUDNN, ✔ NCCL, ✔ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔
> > > > CPU_SSE2,
> > > > > > ✔
> > > > > > > CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖
> > > > > > CPU_AVX2, ✔
> > > > > > > OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS, ✖
> > > > > > BLAS_MKL, ✖
> > > > > > > BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✔ OPENCV, ✖ 

Re: [VOTE] Release Apache MXNet (incubating) version 1.6.0.rc2

2020-02-04 Thread Pedro Larroy
Right. Would it be possible to have the CMake build also use libgomp for
consistency with the releases until these issues are resolved?
This can affect anyone compiling the distribution with CMake and also
happens randomly in CI, worsening the contributor experience due to CI
failures.

On Tue, Feb 4, 2020 at 9:33 AM Przemysław Trędak  wrote:

> Hi Pedro,
>
> From the issue that you linked it seems that you are using the LLVM
> OpenMP, whereas I believe the actual release uses libgomp (at least that's
> what seems to be the conclusion from this issue:
> https://github.com/apache/incubator-mxnet/issues/16891)?
>
> Przemek
>
> On 2020/02/04 03:42:30, Pedro Larroy 
> wrote:
> > -1
> >
> > Unit tests passed in CPU build.
> >
> > I observe crashes related to openmp using cpp unit tests:
> >
> > https://github.com/apache/incubator-mxnet/issues/17043
> >
> > Pedro.
> >
> > On Mon, Feb 3, 2020 at 6:44 PM Chaitanya Bapat 
> wrote:
> >
> > > +1
> > > Successfully built MXNet 1.6.0rc2 on Linux
> > > Tested for OpPerf utility
> > > For CPU -
> > > https://gist.github.com/ChaiBapchya/d5ecc3e971c5a3c558d672477b4b6b9c
> > >
> > > Works well!
> > >
> > >
> > >
> > > On Mon, 3 Feb 2020 at 15:43, Lin Yuan  wrote:
> > >
> > > > +1
> > > >
> > > > Tested Horovod with mnist example. My compiler flags are below:
> > > >
> > > > [✔ CUDA, ✔ CUDNN, ✔ NCCL, ✔ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔
> CPU_SSE2,
> > > ✔
> > > > CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖
> > > CPU_AVX2, ✔
> > > > OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS, ✖
> > > BLAS_MKL, ✖
> > > > BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✔ OPENCV, ✖ CAFFE, ✖ PROFILER, ✔
> > > > DIST_KVSTORE, ✖ CXX14, ✖ INT64_TENSOR_SIZE, ✖ SIGNAL_HANDLER, ✖
> DEBUG, ✖
> > > > TVM_OP]
> > > >
> > > > Lin
> > > >
> > > > On Sat, Feb 1, 2020 at 9:55 PM Tao Lv  wrote:
> > > >
> > > > > +1
> > > > >
> > > > > I tested below items:
> > > > > 1. download artifacts from Apache dist repo;
> > > > > 2. the signature looks good;
> > > > > 3. build from source code with MKL-DNN and MKL on centos;
> > > > > 4. run fp32 and int8 inference of ResNet50 under
> > > /example/quantization/.
> > > > >
> > > > > thanks,
> > > > > -tao
> > > > >
> > > > > On Sun, Feb 2, 2020 at 11:00 AM Tao Lv  wrote:
> > > > >
> > > > > > I see. I was looking at this page:
> > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.6.0.rc2
> > > > > >
> > > > > > On Sun, Feb 2, 2020 at 4:54 AM Przemysław Trędak <
> ptre...@apache.org
> > > >
> > > > > > wrote:
> > > > > >
> > > > > >> Hi Tao,
> > > > > >>
> > > > > >> Could you tell me where did you look for it and did not find
> it? I
> > > > just
> > > > > >> checked and both
> > > > > >>
> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.6.0.rc2/
> > > and
> > > > > >> draft of the release on GitHub have them.
> > > > > >>
> > > > > >> Thank you
> > > > > >> Przemek
> > > > > >>
> > > > > >> On 2020/02/01 14:23:11, Tao Lv  wrote:
> > > > > >> > It seems the src tar and signature are missing from the tag.
> > > > > >> >
> > > > > >> > On Fri, Jan 31, 2020 at 11:09 AM Przemysław Trędak <
> > > > > ptre...@apache.org>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > Dear MXNet community,
> > > > > >> > >
> > > > > >> > > This is the vote to release Apache MXNet (incubating)
> version
> > > > 1.6.0.
> > > > > >> > > Voting starts today and will close on Monday 2/3/2020 23:59
> PST.
> > > > > >> > >
> > > > > >> > > Link to release notes:
> > > > > >> > >
> > > > >
> https://cwiki.apache.org/confluence/display/MXNET/1.6.0+Release+notes
> > >

Re: [VOTE] Release Apache MXNet (incubating) version 1.6.0.rc2

2020-02-03 Thread Pedro Larroy
-1

Unit tests passed in CPU build.

I observe crashes related to openmp using cpp unit tests:

https://github.com/apache/incubator-mxnet/issues/17043

Pedro.

On Mon, Feb 3, 2020 at 6:44 PM Chaitanya Bapat  wrote:

> +1
> Successfully built MXNet 1.6.0rc2 on Linux
> Tested for OpPerf utility
> For CPU -
> https://gist.github.com/ChaiBapchya/d5ecc3e971c5a3c558d672477b4b6b9c
>
> Works well!
>
>
>
> On Mon, 3 Feb 2020 at 15:43, Lin Yuan  wrote:
>
> > +1
> >
> > Tested Horovod with mnist example. My compiler flags are below:
> >
> > [✔ CUDA, ✔ CUDNN, ✔ NCCL, ✔ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2,
> ✔
> > CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖
> CPU_AVX2, ✔
> > OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS, ✖
> BLAS_MKL, ✖
> > BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✔ OPENCV, ✖ CAFFE, ✖ PROFILER, ✔
> > DIST_KVSTORE, ✖ CXX14, ✖ INT64_TENSOR_SIZE, ✖ SIGNAL_HANDLER, ✖ DEBUG, ✖
> > TVM_OP]
> >
> > Lin
> >
> > On Sat, Feb 1, 2020 at 9:55 PM Tao Lv  wrote:
> >
> > > +1
> > >
> > > I tested below items:
> > > 1. download artifacts from Apache dist repo;
> > > 2. the signature looks good;
> > > 3. build from source code with MKL-DNN and MKL on centos;
> > > 4. run fp32 and int8 inference of ResNet50 under
> /example/quantization/.
> > >
> > > thanks,
> > > -tao
> > >
> > > On Sun, Feb 2, 2020 at 11:00 AM Tao Lv  wrote:
> > >
> > > > I see. I was looking at this page:
> > > > https://github.com/apache/incubator-mxnet/releases/tag/1.6.0.rc2
> > > >
> > > > On Sun, Feb 2, 2020 at 4:54 AM Przemysław Trędak  >
> > > > wrote:
> > > >
> > > >> Hi Tao,
> > > >>
> > > >> Could you tell me where did you look for it and did not find it? I
> > just
> > > >> checked and both
> > > >> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.6.0.rc2/
> and
> > > >> draft of the release on GitHub have them.
> > > >>
> > > >> Thank you
> > > >> Przemek
> > > >>
> > > >> On 2020/02/01 14:23:11, Tao Lv  wrote:
> > > >> > It seems the src tar and signature are missing from the tag.
> > > >> >
> > > >> > On Fri, Jan 31, 2020 at 11:09 AM Przemysław Trędak <
> > > ptre...@apache.org>
> > > >> > wrote:
> > > >> >
> > > >> > > Dear MXNet community,
> > > >> > >
> > > >> > > This is the vote to release Apache MXNet (incubating) version
> > 1.6.0.
> > > >> > > Voting starts today and will close on Monday 2/3/2020 23:59 PST.
> > > >> > >
> > > >> > > Link to release notes:
> > > >> > >
> > > https://cwiki.apache.org/confluence/display/MXNET/1.6.0+Release+notes
> > > >> > >
> > > >> > > Link to release candidate:
> > > >> > >
> https://github.com/apache/incubator-mxnet/releases/tag/1.6.0.rc2
> > > >> > >
> > > >> > > Link to source and signatures on apache dist server:
> > > >> > >
> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.6.0.rc2/
> > > >> > >
> > > >> > > The differences comparing to previous release candidate
> 1.6.0.rc1:
> > > >> > >  * Fixes for license issues (#17361, #17375, #17370, #17460)
> > > >> > >  * Bugfix for saving LSTM layer parameter (#17288)
> > > >> > >  * Bugfix for downloading the model from model zoo from multiple
> > > >> processes
> > > >> > > (#17372)
> > > >> > >  * Fixed a symbol.py in AMP for GluonNLP (#17408)
> > > >> > >
> > > >> > >
> > > >> > > Please remember to TEST first before voting accordingly:
> > > >> > > +1 = approve
> > > >> > > +0 = no opinion
> > > >> > > -1 = disapprove (provide reason)
> > > >> > >
> > > >> > >
> > > >> > > Best regards,
> > > >> > > Przemyslaw Tredak
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>
>
> --
> *Chaitanya Prakash Bapat*
> *+1 (973) 953-6299*
>
> [image: https://www.linkedin.com//in/chaibapat25]
> [image: https://www.facebook.com/chaibapat
> ]
> [image:
> https://twitter.com/ChaiBapchya] [image:
> https://www.linkedin.com//in/chaibapat25]
> 
>


[ANNOUNCE] Python2 is no longer supported after MXNet 1.6 release

2020-02-03 Thread Pedro Larroy
Hi all

As per https://github.com/apache/incubator-mxnet/pull/15990 merge and as
agreed with the community we will no longer support python2 in oncoming
releases of MXNet.

Special thanks to Leonard for facilitating this.

Pedro.


Re: MXNet 1.6 as last release with Python 2 support?

2020-01-23 Thread Pedro Larroy
 This is not good user experience. I have heard of impacts to some users /
projects.

Thanks.

On Tue, Jan 21, 2020 at 10:44 PM Skalicky, Sam 
wrote:

> Also, it has been reported that pip wheel installation with latest pip
> version 20.0.1 breaks installation of MXNet pip wheels which have py2.py3
> in the wheel name. This breaks all existing released versions. Work around
> is to install the older version of pip "pip install pip==19.3.1”.
>
> Sam
>
> > On Jan 21, 2020, at 4:35 PM, Chung, Alex 
> wrote:
> >
> > +1
> >
> > Sincerely,
> >
> > Alex Chung
> > Senior Product Manager | AWS AI
> >
> > 
> > From: shiwen hu 
> > Sent: Tuesday, January 21, 2020 4:26 PM
> > To: dev@mxnet.incubator.apache.org
> > Subject: Re: MXNet 1.6 as last release with Python 2 support?
> >
> > +1
> >
> > Lai Wei  于2020年1月18日周六 上午2:51写道:
> >
> >> +1
> >>
> >>
> >> Best Regards
> >>
> >> Lai
> >>
> >>
> >> On Fri, Jan 17, 2020 at 10:39 AM Lin Yuan  wrote:
> >>
> >>> +1
> >>>
> >>> On Fri, Jan 17, 2020 at 10:04 AM Xingjian SHI 
> >>> wrote:
> >>>
>  +1. We should move to support Python>=3.5 only.
> 
>  Get Outlook for iOS
>  
>  From: Lausen, Leonard 
>  Sent: Friday, January 17, 2020 10:02:30 AM
>  To: d...@mxnet.apache.org 
>  Subject: Re: MXNet 1.6 as last release with Python 2 support?
> 
>  If the lazy consensus passes, I believe the minimum Python version
>  supported
>  would be Python 3.5.
> 
>  Python 3.5 because it seems to be the minimum Python 3 version tested
> >> by
>  our CI,
>  specifically in the jobs running on Ubuntu 16.04.
> 
>  Best regards
>  Leonard
> 
>  On Fri, 2020-01-17 at 17:36 +, Lausen, Leonard wrote:
> > Dear MXNet community,
> >
> > as effective January 1, 2020, no new bug reports, fixes, or changes
> >>> will
>  be
> > made
> > to Python 2, and as MXNet 1.6 will be released after January 1,
> >> 2020, I
> > suggest
> > to announce in the MXNet 1.6 release notes that MXNet 1.6 is the last
>  release
> > supporting Python 2.
> >
> > We have previously reached consensus on announcing that Python 2 is
>  dropped in
> > the next major release (ie. MXNet 2), however, given the delay in 1.6
>  release,
> > the plan to release 1.7 in the future and that Python 2 is dead
> >>> already I
> > think
> > we can revisit this assumption.
> >
> > Advantages are
> > - Time savings for developers, as Python 3 standard library contains
> >>> more
> >  features than Python 2, and it is more efficient to target only 1
>  language
> >  (Python 3) instead of 2 languages (Python 2 & 3)
> > - Simplification and cost savings for CI
> >
> > I thus suggest 72h lazy consensus for announcing dropping of Python 2
> >>> as
> > described above. If you disagree, please veto (send "-1") and we can
>  continue
> > supporting Python 2 in all 1.x releases as per previous consensus.
> >> Note
>  that
> > at
> > the time of previous consensus, no 1.7 release was planned.
> >
> > Best regards
> > Leonard
> 
> >>>
> >>
>
>


Re: Stop redistributing source code of 3rdparty dependencies to avoid licensing issues

2020-01-19 Thread Pedro Larroy
-1

I think is brittle to download a piece of source code that needs network
connectivity to build. The network is always in flux. Source archives that
need to download too many dependencies to build will end up broken with
time. I would expect source to build with a reasonable set of well known
system dependencies.


On Friday, January 17, 2020, Marco de Abreu  wrote:
> I agree with Tianqi. We may change our build system, but this won't free
us
> from the necessity to validate the licenses of our dependencies.
>
> The question at this point is whether we are allowed to differentiate
> between our main-source and hold it to the strict standards while treating
> the third party folder as dependency, where we only have to verify that
the
> projects are licensed with an Apache compatible license.
>
> At the moment, the project already treats them different: our license
> checks exclude third party. I think this is where the disparity is coming
> from. I'd recommend we discuss with Apache how we can handle this
> situation: package third party code for user convenience while limiting
> responsibility.
>
> In the end, we still have to ensure that everything is licensed properly,
> so maybe we should try to align both processes to match the real world
> instead of changing the real world to match the process.
>
> -Marco
>
> Tianqi Chen  schrieb am Fr., 17. Jan. 2020,
20:44:
>
>> I don't have an opinion, but would like to list pros and cons of doing
so.
>>
>> The pro of doing so is that it indeed simplifies the release process, as
>> these additional dependencies becomes category-B level dependencies as in
>> https://www.apache.org/legal/resolved.html
>>
>> The con of doing so is that it brings additional burden to the users of
the
>> software to check the license of these dependencies, in some sense,
>> including these information in the
>> license actually gives an extra level of transparency.
>>
>> The copyright message in some of the dependencies are a bit unfortunate,
>> one potential way to run the check is to write a python script to go
>> through the files and detect the line Copyright and cross match and add
>> them.
>>
>> Note that good models to follow are
>> - hadoop: https://github.com/apache/hadoop/tree/trunk/licenses
>> - flink: https://github.com/apache/flink
>>
>> Each of the repo have a licenses folder that contains licenses, and
things
>> points to them.
>>
>> I am not a lawyer, but the case for ps-lite seems can be resolved as long
>> as we can confirm these files follows Apache-2.0, as
>> https://www.apache.org/licenses/LICENSE-2.0 only requires us to
>> redistribute
>> the license and anything in the NOTICE, but we do not have the obligation
>> to list all the copyright messages in the source content.
>>
>> TQ
>>
>> On Fri, Jan 17, 2020 at 11:10 AM Yuan Tang 
>> wrote:
>>
>> > +1
>> >
>> > On Fri, Jan 17, 2020 at 1:59 PM Chris Olivier 
>> > wrote:
>> >
>> > > +1
>> > >
>> > > On Fri, Jan 17, 2020 at 10:19 AM Lausen, Leonard
>> > > > > >
>> > > wrote:
>> > >
>> > > > Dear MXNet community,
>> > > >
>> > > > as per recent mail on gene...@incubator.apache.org [1] there are a
>> > > number
>> > > > of
>> > > > licensing issues in MXNet 1.6rc1. Based on anecdotal evidence I
>> believe
>> > > > there
>> > > > has been no release so far without any licensing issues, which is a
>> > > > blocker to
>> > > > MXNet graduating from it's incubating status. One contributing
factor
>> > is
>> > > > that we
>> > > > bundle 3rdparty source code in our releases [2].
>> > > >
>> > > > One key factor is that 3rdparty projects don't always enforce
>> licensing
>> > > > best
>> > > > practice in the way we do. For example, 3rdparty/ps-lite doesn't
>> > enforce
>> > > > license
>> > > > headers in the source files and there has been confusion about the
>> > > license
>> > > > of
>> > > > recent contributions by ByteDance (See [1]).
>> > > >
>> > > > To avoid such licensing issues in MXNet releases a simple solution
is
>> > to
>> > > > stop
>> > > > distributing the 3rdparty code in our source releases. Instead, we
>> can
>> > > > adapt our
>> > > > buildsystem to download 3rdparty code as part of the build
>> > configuration
>> > > > process. CMake makes this very easy with the FetchContent module
[3].
>> > > >
>> > > > For development purpose involving changes to the 3rdparty source or
>> > build
>> > > > systems that can't access the internet, there are easy means for
>> > > > specifying the
>> > > > location of local sources (instead of downloading), via the
>> > > > FETCHCONTENT_SOURCE_DIR_ variable [4].
>> > > >
>> > > > Would there be any concerns about such approach? Obviously it can
>> only
>> > be
>> > > > fully
>> > > > implemented as soon as the CMake build system is feature complete
and
>> > the
>> > > > Makefile build can be dropped. (Note that the Makefile build is
being
>> > > > deprecated
>> > > > and removed as part of MXNet 2 roadmap [5])
>> > > >
>> > > > Best regards
>> > > > Leonard
>> > > >

Re: CD with windows need a special jenkins slave machine like restricted-utility

2020-01-13 Thread Pedro Larroy
Thanks, it's working after updating to a 64 bit compiler.
https://github.com/apache/incubator-mxnet/pull/17206

On Mon, Jan 13, 2020 at 4:55 PM Pedro Larroy 
wrote:

> Isn't this something that gets selected through vcvars?
>
> On Fri, Jan 10, 2020 at 6:46 PM shiwen hu  wrote:
>
>> use x64 host msvc. cmake -T host=x64
>>
>> Pedro Larroy  于2020年1月10日周五 上午7:28写道:
>>
>> > Is there a solution for this error in VS2017?
>> >
>> > c:\users\administrator\mxnet\src\operator\mxnet_op.h(943) : fatal error
>> > C1002: compiler is out of heap space in pass 2
>> >
>> >
>> >
>> > On Tue, Jan 7, 2020 at 5:11 PM shiwen hu  wrote:
>> >
>> > > >
>> > > > I personally encountered the problem that 2015 can't compile in high
>> > > > version cuda. But I can't remember the details. We can continue to
>> use
>> > > 2015
>> > > > until we encounter problems.
>> > > >
>> > >
>> >
>>
>


Re: CD with windows need a special jenkins slave machine like restricted-utility

2020-01-13 Thread Pedro Larroy
Isn't this something that gets selected through vcvars?

On Fri, Jan 10, 2020 at 6:46 PM shiwen hu  wrote:

> use x64 host msvc. cmake -T host=x64
>
> Pedro Larroy  于2020年1月10日周五 上午7:28写道:
>
> > Is there a solution for this error in VS2017?
> >
> > c:\users\administrator\mxnet\src\operator\mxnet_op.h(943) : fatal error
> > C1002: compiler is out of heap space in pass 2
> >
> >
> >
> > On Tue, Jan 7, 2020 at 5:11 PM shiwen hu  wrote:
> >
> > > >
> > > > I personally encountered the problem that 2015 can't compile in high
> > > > version cuda. But I can't remember the details. We can continue to
> use
> > > 2015
> > > > until we encounter problems.
> > > >
> > >
> >
>


Re: CD with windows need a special jenkins slave machine like restricted-utility

2020-01-09 Thread Pedro Larroy
Is there a solution for this error in VS2017?

c:\users\administrator\mxnet\src\operator\mxnet_op.h(943) : fatal error
C1002: compiler is out of heap space in pass 2



On Tue, Jan 7, 2020 at 5:11 PM shiwen hu  wrote:

> >
> > I personally encountered the problem that 2015 can't compile in high
> > version cuda. But I can't remember the details. We can continue to use
> 2015
> > until we encounter problems.
> >
>


Re: Stopping nightly releases to Pypi

2020-01-08 Thread Pedro Larroy
Thanks for your detailed responses.

Having codebuild execute the recipe that is the apache repository is the
same effect and control that you would have in some service such as travis
CI. And the builds are fully reproducible. So it's under full control of
Apache the same way that any other hosted build solution is. Any
modification to the recipe would be executed on next commit, and the builds
are fully reproducible. There's no configuration in code build that would
be outside of the Apache MXNet repository in this case, since pipeline and
the config would be under the git repo.

And as you rightly pointed out, the Jenkins master is a weak point to the
restricted slaves. This was strongly criticized during the system review
and there is precedent of security flaws in the master. Insisting on mixing
CI and CD is not a good recommendation for what it has been explained above.

Pedro.

On Wed, Jan 8, 2020 at 2:41 PM Marco de Abreu 
wrote:

> Correct, I'm not bothered by the s3 bucket but by way how it gets
> published. It's not in Jenkins, so it's outside of the projects control.
>
> The security design due to the restricted nodes makes sure that no third
> party can gain access to these machines. They use separate caches, separate
> volumes, different instance profiles etc - I personally would consider the
> restricted slaves safe. If you're telling me that restricted slaves have
> been compromised with a crypto Miner, I'd be happy to discuss that matter
> and assist.
>
> Another attack vector is the Jenkins master, correct. If somebody
> infiltrates the Jenkins master, they can use that to jump onto the
> restricted slaves. They might modify the created artifacts, but once the
> system gets cleaned up, we're good to go again (You might rather want to
> consider a virus scan on the machines and created artifacts).
>
> But now let's say Jenkins master gets comprised. In that case, the
> artifacts are not the issue but the credentials. Jenkins contains committer
> credentials, which would allow to inject malware into our repository. Don't
> forget that a committer can add commits to other PRs, manually fake the CI
> status and then squash the PR to basically hide most of the traces. Unless
> someone reviews every single commit on master, we're basically out of luck.
>
> So yeah, that attack vector through the Jenkins master is valid, but
> considering that there are bigger risks involved in the system and the
> slaves themselves are pretty well protected, I'd not consider CD a severe
> issue in relation to the overall risk score of our system.
>
> So in order to make sure that we're well protected, I'd recommend to spend
> a bit of time on adapting the Jenkins pipeline to upload to s3 and then use
> all the remaining time to actually harden the Jenkins master and make sure
> that everything is constantly kept up to date. Security-wise, I'd consider
> that a way better investment than developing a new CD.
>
> -Marco
>
> Pedro Larroy  schrieb am Mi., 8. Jan. 2020,
> 22:49:
>
> > Marco, if you are fine publishing to an S3 bucket, what's your concern?
> > using a codebuild pipeline? The build logs could be push to the s3 bucket
> > if this is your concern.
> >
> > As I said before, having binary releases in the current CI doesn't stand
> a
> > chance to pass security review as it is today, it's not safe and is a bad
> > idea, alternatives are
> > 1 -Code Build (you don't support this because it's company owned, did I
> > understand this correctly?)
> > 2 - Apache owned Jenkins (can you help with this?)
> > 3 - Travis CI or similar, which in the end is similar to code build.
> > 4- Another Jenkins just for CD (who owns?)
> >
> > Pedro.
> >
> > On Wed, Jan 8, 2020 at 1:01 PM Marco de Abreu 
> > wrote:
> >
> > > The risk of the current CD via Jenkins is known and was accepted as
> part
> > of
> > > adopting Jenkins. The solution for the initial issue - no longer
> > publishing
> > > to pypi - is to add a step to the existing CD pipeline which publishes
> > the
> > > package to the s3 bucket instead of pypi.
> > >
> > > -Marco
> > >
> > > Pedro Larroy  schrieb am Mi., 8. Jan.
> > 2020,
> > > 21:55:
> > >
> > > > I understand your point. But you don't provide an alternative, and
> > > building
> > > > binary releases from the CI jenkins as it is today is a very bad idea
> > > since
> > > > it's an unsafe environment. I think it's fair to ask if you are
> vetoing
> > > > using codebuild for nightly releases you could provide an alternative
> > > > solution (for example Apache hosted Jen

Re: Stopping nightly releases to Pypi

2020-01-08 Thread Pedro Larroy
Marco, if you are fine publishing to an S3 bucket, what's your concern?
using a codebuild pipeline? The build logs could be push to the s3 bucket
if this is your concern.

As I said before, having binary releases in the current CI doesn't stand a
chance to pass security review as it is today, it's not safe and is a bad
idea, alternatives are
1 -Code Build (you don't support this because it's company owned, did I
understand this correctly?)
2 - Apache owned Jenkins (can you help with this?)
3 - Travis CI or similar, which in the end is similar to code build.
4- Another Jenkins just for CD (who owns?)

Pedro.

On Wed, Jan 8, 2020 at 1:01 PM Marco de Abreu 
wrote:

> The risk of the current CD via Jenkins is known and was accepted as part of
> adopting Jenkins. The solution for the initial issue - no longer publishing
> to pypi - is to add a step to the existing CD pipeline which publishes the
> package to the s3 bucket instead of pypi.
>
> -Marco
>
> Pedro Larroy  schrieb am Mi., 8. Jan. 2020,
> 21:55:
>
> > I understand your point. But you don't provide an alternative, and
> building
> > binary releases from the CI jenkins as it is today is a very bad idea
> since
> > it's an unsafe environment. I think it's fair to ask if you are vetoing
> > using codebuild for nightly releases you could provide an alternative
> > solution (for example Apache hosted Jenkins) or anything else. As you are
> > well aware non-committers can't communicate with Apache Infra or make
> > requests, so the onus is on you or other Apache person to provide a
> > solution that aligns with Apache values.
> >
> > So far I see Sam trying to help with codebuild managed binary releases
> and
> > this is taken as a tinfoil hat corporate conspiracy. It's a pity that you
> > claim to endorse Apache values but not support what's best for the
> project,
> > which is to have things clean and in working order. I don't think users
> > care where the binary releases are hosted.
> >
> > Pedro.
> >
> > On Sun, Jan 5, 2020 at 5:56 AM Marco de Abreu 
> > wrote:
> >
> > > Apache only cares about source releases as far as official releases are
> > > concerned. But Apache also cares about it's brand and image. You are
> > right
> > > that anybody can compile an Apache project and distribute it, but it's
> > > under the PMCs control what can be advertised as official. This
> includes
> > > the following examples:
> > >
> > > - The official MXNet pypi, dockerhub, maven, etc account
> > > - The MXNet website
> > > - anything advertising to be MXNet
> > >
> > > If you publish a binary release and call it "AwesomeSpaghettiBolognese"
> > > while it's MXNet under the hood, that's totally in line with the Apache
> > > license. But if you decide to publish an MXNet branded package, then
> > that's
> > > covered by the brand protection. I won't go into much more detail about
> > > legal reasons since that's not helping this discussion.
> > >
> > > I personally am vetoing a company-owned distribution channel to be
> > > advertised on the MXNet website or any official documentation. Also,
> I'd
> > > like to make sure that users do not mistake it for being a release that
> > is
> > > affiliated or endorsed by Apache MXNet.
> > >
> > > We are taking a step back here and it's a pity to see that some people
> > are
> > > still not endorsing the Apache values. This will be my last email
> > regarding
> > > that topic and I will only follow up with actions after the 15th of
> > January
> > > has been reached.
> > >
> > > Best regards
> > > Marco
> > >
> > >
> > > Pedro Larroy  schrieb am Sa., 4. Jan.
> > 2020,
> > > 02:38:
> > >
> > > > Hey Marco.
> > > >
> > > > As far as I have learned from other Apache mailing lists while
> lurking
> > is
> > > > that Apache only cares about making source releases, binaries are a
> > > > courtesy to users that some projects decide to do, but I'm not sure I
> > > > understand your concerns regarding the PMC and what exactly are you
> > > vetoing
> > > > here, since everyone can compile, build and package our project as
> per
> > > the
> > > > open source license. I would suggest to have a constructive approach
> > and
> > > > see how we can make this happen for the best of the project,
> specially
> > > > since somebody is volunteering to help with this and dedicate
>

Re: Stopping nightly releases to Pypi

2020-01-08 Thread Pedro Larroy
Is not about Jenkins the software, is about the CI environment, which is
not secure. Last week there was crypto mining activity on the dev
environment, code can be injected on binary releases very easily. It should
be a separate instance for CD, so maybe you can facilitate that with Apache
as part of your suggestion.

On Wed, Jan 8, 2020 at 1:01 PM Marco de Abreu 
wrote:

> The risk of the current CD via Jenkins is known and was accepted as part of
> adopting Jenkins. The solution for the initial issue - no longer publishing
> to pypi - is to add a step to the existing CD pipeline which publishes the
> package to the s3 bucket instead of pypi.
>
> -Marco
>
> Pedro Larroy  schrieb am Mi., 8. Jan. 2020,
> 21:55:
>
> > I understand your point. But you don't provide an alternative, and
> building
> > binary releases from the CI jenkins as it is today is a very bad idea
> since
> > it's an unsafe environment. I think it's fair to ask if you are vetoing
> > using codebuild for nightly releases you could provide an alternative
> > solution (for example Apache hosted Jenkins) or anything else. As you are
> > well aware non-committers can't communicate with Apache Infra or make
> > requests, so the onus is on you or other Apache person to provide a
> > solution that aligns with Apache values.
> >
> > So far I see Sam trying to help with codebuild managed binary releases
> and
> > this is taken as a tinfoil hat corporate conspiracy. It's a pity that you
> > claim to endorse Apache values but not support what's best for the
> project,
> > which is to have things clean and in working order. I don't think users
> > care where the binary releases are hosted.
> >
> > Pedro.
> >
> > On Sun, Jan 5, 2020 at 5:56 AM Marco de Abreu 
> > wrote:
> >
> > > Apache only cares about source releases as far as official releases are
> > > concerned. But Apache also cares about it's brand and image. You are
> > right
> > > that anybody can compile an Apache project and distribute it, but it's
> > > under the PMCs control what can be advertised as official. This
> includes
> > > the following examples:
> > >
> > > - The official MXNet pypi, dockerhub, maven, etc account
> > > - The MXNet website
> > > - anything advertising to be MXNet
> > >
> > > If you publish a binary release and call it "AwesomeSpaghettiBolognese"
> > > while it's MXNet under the hood, that's totally in line with the Apache
> > > license. But if you decide to publish an MXNet branded package, then
> > that's
> > > covered by the brand protection. I won't go into much more detail about
> > > legal reasons since that's not helping this discussion.
> > >
> > > I personally am vetoing a company-owned distribution channel to be
> > > advertised on the MXNet website or any official documentation. Also,
> I'd
> > > like to make sure that users do not mistake it for being a release that
> > is
> > > affiliated or endorsed by Apache MXNet.
> > >
> > > We are taking a step back here and it's a pity to see that some people
> > are
> > > still not endorsing the Apache values. This will be my last email
> > regarding
> > > that topic and I will only follow up with actions after the 15th of
> > January
> > > has been reached.
> > >
> > > Best regards
> > > Marco
> > >
> > >
> > > Pedro Larroy  schrieb am Sa., 4. Jan.
> > 2020,
> > > 02:38:
> > >
> > > > Hey Marco.
> > > >
> > > > As far as I have learned from other Apache mailing lists while
> lurking
> > is
> > > > that Apache only cares about making source releases, binaries are a
> > > > courtesy to users that some projects decide to do, but I'm not sure I
> > > > understand your concerns regarding the PMC and what exactly are you
> > > vetoing
> > > > here, since everyone can compile, build and package our project as
> per
> > > the
> > > > open source license. I would suggest to have a constructive approach
> > and
> > > > see how we can make this happen for the best of the project,
> specially
> > > > since somebody is volunteering to help with this and dedicate
> valuable
> > > > compute resources and people's time.
> > > >
> > > > Regarding manual changes I don't see any need to have access to a
> code
> > > > build control plane for *anybody*, for several reasons, first is that
> > > > ma

Re: Stopping nightly releases to Pypi

2020-01-08 Thread Pedro Larroy
I understand your point. But you don't provide an alternative, and building
binary releases from the CI jenkins as it is today is a very bad idea since
it's an unsafe environment. I think it's fair to ask if you are vetoing
using codebuild for nightly releases you could provide an alternative
solution (for example Apache hosted Jenkins) or anything else. As you are
well aware non-committers can't communicate with Apache Infra or make
requests, so the onus is on you or other Apache person to provide a
solution that aligns with Apache values.

So far I see Sam trying to help with codebuild managed binary releases and
this is taken as a tinfoil hat corporate conspiracy. It's a pity that you
claim to endorse Apache values but not support what's best for the project,
which is to have things clean and in working order. I don't think users
care where the binary releases are hosted.

Pedro.

On Sun, Jan 5, 2020 at 5:56 AM Marco de Abreu 
wrote:

> Apache only cares about source releases as far as official releases are
> concerned. But Apache also cares about it's brand and image. You are right
> that anybody can compile an Apache project and distribute it, but it's
> under the PMCs control what can be advertised as official. This includes
> the following examples:
>
> - The official MXNet pypi, dockerhub, maven, etc account
> - The MXNet website
> - anything advertising to be MXNet
>
> If you publish a binary release and call it "AwesomeSpaghettiBolognese"
> while it's MXNet under the hood, that's totally in line with the Apache
> license. But if you decide to publish an MXNet branded package, then that's
> covered by the brand protection. I won't go into much more detail about
> legal reasons since that's not helping this discussion.
>
> I personally am vetoing a company-owned distribution channel to be
> advertised on the MXNet website or any official documentation. Also, I'd
> like to make sure that users do not mistake it for being a release that is
> affiliated or endorsed by Apache MXNet.
>
> We are taking a step back here and it's a pity to see that some people are
> still not endorsing the Apache values. This will be my last email regarding
> that topic and I will only follow up with actions after the 15th of January
> has been reached.
>
> Best regards
> Marco
>
>
> Pedro Larroy  schrieb am Sa., 4. Jan. 2020,
> 02:38:
>
> > Hey Marco.
> >
> > As far as I have learned from other Apache mailing lists while lurking is
> > that Apache only cares about making source releases, binaries are a
> > courtesy to users that some projects decide to do, but I'm not sure I
> > understand your concerns regarding the PMC and what exactly are you
> vetoing
> > here, since everyone can compile, build and package our project as per
> the
> > open source license. I would suggest to have a constructive approach and
> > see how we can make this happen for the best of the project, specially
> > since somebody is volunteering to help with this and dedicate valuable
> > compute resources and people's time.
> >
> > Regarding manual changes I don't see any need to have access to a code
> > build control plane for *anybody*, for several reasons, first is that
> > manual access to production account is a discouraged practice and are
> best
> > managed through pipeline deployments, second is that Code build is a
> hosted
> > service which is basically just using a build description file to do the
> > work, there's no need to do any manual fiddling or triggering. If all the
> > CD and description files are in the apache repository you can use your
> own
> > account or compute resources to do your own build flavor if you so
> desire.
> >
> > Is your proposal to host this in Apache infrastructure?  Maybe I'm
> missing
> > something on this conversation
> >
> > Pedro.
> >
> >
> > On Fri, Jan 3, 2020 at 3:21 PM Marco de Abreu 
> > wrote:
> >
> > > Sam, while I understand that this solution was developed out of
> > necessity,
> > > my question why a new system has been developed instead of fixing the
> > > existing one or adapting the solution. CodeBuild is a scheduler in the
> > same
> > > fashion as Jenkins is. It runs code. So you can adapt it to Jenkins
> > without
> > > much hassle.
> > >
> > > I'm not volunteering for this - why should I? The role of a PMC member
> is
> > > to steer the direction of the project. Just because a manager points
> > > towards a certain direction, if doesn't mean that they're going to do
> it.
> > >
> > > Apparently there was enough time at some point to develop a new
> solu

Re: CD with windows need a special jenkins slave machine like restricted-utility

2020-01-07 Thread Pedro Larroy
I'm putting some efforts on the side to improve the state of this:

If you want to help:

https://github.com/apache/incubator-mxnet/pull/17206

https://github.com/aiengines/ci/tree/master/windows

Which of the cuda versions you listed it needs, I did some work on the side
to update VS and cmake to 3.16.2  you can test the scripts in the windows
folder above by using the three scripts in the windows folder in a fresh
windows instance. The older CMake version has a bug which introduces a
newline in the path and renders everything unusable, I installed 3.16.2 but
needs to be added to the path by the install script.

You can start a fresh gpu instance with this AMI:  aws ssm get-parameter
--name /aws/service/ami-windows-latest/Windows_Server-2019-English-Full-Base

Once this is working, we can update the AMI from CI. Also this needs to be
adjusted with the new VS 2019

https://github.com/apache/incubator-mxnet/blob/master/ci/build_windows.py#L42

To update cuda and nv driver, this two bundles should be added to the
script
https://github.com/aiengines/ci/blob/master/windows/windows_deps_headless_installer.py

https://windows-post-install.s3-us-west-2.amazonaws.com/cuda.zip

https://windows-post-install.s3-us-west-2.amazonaws.com/nv_driver_418.81.zip

Send PRs if you want to collaborate.

Pedro.




On Tue, Jan 7, 2020 at 6:13 AM Lausen, Leonard 
wrote:

> Regarding visual studio 2019: It seems we currently support Visual Studio
> 2015?
> Is there anything that Visual Studio 2015 can't do? If so, code and
> documentation should also be updated based on the new minimum version.
>
> On Tue, 2020-01-07 at 14:19 +0800, shiwen hu wrote:
> > it need visual studio 2019, cuda 9.0 9.2 10.0 10.1 10.2,
> > cmake 3.16.2,jom,opencv,openblas.
> > What do I need to do? Who should I contact?
>


Re: Stopping nightly releases to Pypi

2020-01-03 Thread Pedro Larroy
Hey Marco.

As far as I have learned from other Apache mailing lists while lurking is
that Apache only cares about making source releases, binaries are a
courtesy to users that some projects decide to do, but I'm not sure I
understand your concerns regarding the PMC and what exactly are you vetoing
here, since everyone can compile, build and package our project as per the
open source license. I would suggest to have a constructive approach and
see how we can make this happen for the best of the project, specially
since somebody is volunteering to help with this and dedicate valuable
compute resources and people's time.

Regarding manual changes I don't see any need to have access to a code
build control plane for *anybody*, for several reasons, first is that
manual access to production account is a discouraged practice and are best
managed through pipeline deployments, second is that Code build is a hosted
service which is basically just using a build description file to do the
work, there's no need to do any manual fiddling or triggering. If all the
CD and description files are in the apache repository you can use your own
account or compute resources to do your own build flavor if you so desire.

Is your proposal to host this in Apache infrastructure?  Maybe I'm missing
something on this conversation

Pedro.


On Fri, Jan 3, 2020 at 3:21 PM Marco de Abreu 
wrote:

> Sam, while I understand that this solution was developed out of necessity,
> my question why a new system has been developed instead of fixing the
> existing one or adapting the solution. CodeBuild is a scheduler in the same
> fashion as Jenkins is. It runs code. So you can adapt it to Jenkins without
> much hassle.
>
> I'm not volunteering for this - why should I? The role of a PMC member is
> to steer the direction of the project. Just because a manager points
> towards a certain direction, if doesn't mean that they're going to do it.
>
> Apparently there was enough time at some point to develop a new solution
> from scratch. It might have been a solution for your internal team and
> that's fine, but upgrading it "temporarily" to be the advertised way on the
> official website is something different.
>
> I won't argue about how the veto can be enforced. I think it's in the best
> interest of the project if we try working on a solution instead of spending
> time on trying to figure out the power of the PMC.
>
> Pedro, that's certainly a step towards the right direction. But committers
> would also need access to the control plane of the system - to trigger,
> stop and audit builds. We could go down that road, but i think the fewer
> systems, the better - also for the sake of maintainability.
>
> Best regards,
> Marco
>
>
>
> Pedro Larroy  schrieb am Fr., 3. Jan. 2020,
> 20:55:
>
> > I'm not involved in such efforts, but one possibility is to have the yaml
> > files that describe the pipelines for CD in the Apache repositories,
> would
> > that be acceptable from the Apache POV? In the end they should be very
> thin
> > and calling the scripts that are part of the CD packages.
> >
> > On Fri, Jan 3, 2020 at 6:56 AM Marco de Abreu 
> > wrote:
> >
> > > Agree, but the question how a non Amazonian is able to maintain and
> > access
> > > the system is still open. As it stands right now, the community has
> > taken a
> > > step back and loses some control if we continue down that road.
> > >
> > > I personally am disapproving of that approach since committers are no
> > > longer in control of that process. So far it seems like my questions
> were
> > > skipped and further actions have been taken. As openness and the
> > community
> > > having control are part of our graduation criteria, I'm putting in my
> > veto
> > > with a grace period until 15th of January. Please bring the system
> into a
> > > state that aligns with Apache values or revert the changes.
> > >
> > > -Marco
> > >
> > > Pedro Larroy  schrieb am Fr., 3. Jan.
> > 2020,
> > > 03:33:
> > >
> > > > CD should be separate from CI for security reasons in any case.
> > > >
> > > >
> > > > On Sat, Dec 7, 2019 at 10:04 AM Marco de Abreu <
> > marco.g.ab...@gmail.com>
> > > > wrote:
> > > >
> > > > > Could you elaborate how a non-Amazonian is able to access, maintain
> > and
> > > > > review the CodeBuild pipeline? How come we've diverted from the
> > > community
> > > > > agreed-on standard where the public Jenkins serves for the purpose
> of
> > > > > testing and releasing MXNet? I'd be curio

Re: Stopping nightly releases to Pypi

2020-01-02 Thread Pedro Larroy
CD should be separate from CI for security reasons in any case.


On Sat, Dec 7, 2019 at 10:04 AM Marco de Abreu 
wrote:

> Could you elaborate how a non-Amazonian is able to access, maintain and
> review the CodeBuild pipeline? How come we've diverted from the community
> agreed-on standard where the public Jenkins serves for the purpose of
> testing and releasing MXNet? I'd be curious about the issues you're
> encountering with Jenkins CI that led to a non-standard solution.
>
> -Marco
>
>
> Skalicky, Sam  schrieb am Sa., 7. Dez. 2019,
> 18:39:
>
> > Hi MXNet Community,
> >
> > We have been working on getting nightly builds fixed and made available
> > again. We’ve made another system using AWS CodeBuild & S3 to work around
> > the problems with Jenkins CI, PyPI, etc. It is currently building all the
> > flavors and publishing to an S3 bucket here:
> >
> >
> https://us-west-2.console.aws.amazon.com/s3/buckets/apache-mxnet/dist/?region=us-west-2
> >
> > There are folders for each set of nightly builds, try out the wheels
> > starting today 2019-12-07. Builds start at 1:30am PT (9:30am GMT) and
> > arrive in the bucket 30min-2hours later. Inside each folder are the
> wheels
> > for each flavor of MXNet. Currently we’re only building for linux, builds
> > for windows/Mac will come later.
> >
> > If you want to download the wheels easily you can use a URL in the form
> of:
> > https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/
> >
> /dist/-1.6.0b-py2.py3-none-manylinux1_x86_64.whl
> >
> > Heres a set of links for today’s builds
> >
> > (Plain mxnet, no mkl no cuda)
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > (mxnet-mkl
> > <
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl(mxnet-mkl
> >
> > )
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > (mxnet-cuXXX
> > <
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl(mxnet-cuXXX
> >
> > )
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu90-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu92-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu100-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu101-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> > (mxnet-cuXXXmkl
> > <
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu101-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl(mxnet-cuXXXmkl
> >
> > )
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu90mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu92mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu100mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu101mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> >
> > You can easily install these pip wheels in your system either by
> > downloading them to your machine first and then installing by doing:
> >
> > pip install /path/to/downloaded/wheel.whl
> >
> > Or you can install directly by just giving the link to pip like this:
> >
> > pip install
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> >
> > Credit goes to everyone involved (in no particular order)
> > Rakesh Vasudevan
> > Zach Kimberg
> > Manu Seth
> > Sheng Zha
> > Jun Wu
> > Pedro Larroy
> > Chaitanya Bapat
> >
> > Thanks!
> > Sam
> >
> >
> > On Dec 5, 2019, at 1:16 AM, Lausen, Leonard  > <mailto:lau...@amazon.com.INVALID>> wrote:
> >
> > We don't loose pip by hosting on S3. We just don't host nightly releases
> > on Pypi
> > servers and mirror them to

Re: windows ci, Cmake update, diverging scripts

2020-01-02 Thread Pedro Larroy
I cleaned up the windows setup and installation scripts. Now building MXNet
in windows can be done by executing just *2* scripts. One to setup the
dependencies and other to build.
I also modified the install instructions with this simplified setup. Please
help review the PR. This also updates CMake to 3.15 as requested by the
developers.

https://github.com/apache/incubator-mxnet/pull/17206

Afterwards I will configure the windows AMI pipeline to use this
environment so we can have CMake 3.15 in the windows AMI.

This is a streamlined workflow for developers using MXNet in windows which
might want to integrate with games or other commercial packages which need
deep learning.

Thanks.


On Mon, Dec 30, 2019 at 4:19 PM Pedro Larroy 
wrote:

> I have looked into this a bit, and seems the open source version which is
> in https://github.com/apache/incubator-mxnet-ci is older than what's
> already deployed.
> The root cause of the failure in the update job seems to be a hardcoded
> AMI which is no longer available. There seems to be a way now to query for
> the latest windows AMI:
> https://aws.amazon.com/blogs/mt/query-for-the-latest-windows-ami-using-systems-manager-parameter-store/
>
> On Mon, Dec 30, 2019 at 3:12 PM Pedro Larroy 
> wrote:
>
>> It's automated but broken as the execution is in failed state. I think we
>> will need an engineer to do repairs there.
>>
>> It's using systems manager automation to produce these AMIs.
>>
>> On Mon, Dec 30, 2019 at 1:44 PM Lausen, Leonard 
>> wrote:
>>
>>> Some more background:
>>>
>>> Since a few days, CI downloads and installs a more recent cmake version
>>> in the
>>> Windows job based on
>>>
>>> https://github.com/leezu/mxnet/blob/230ceee5d9e0e02e58be69dad1c4ffdadbaa1bd9/ci/build_windows.py#L148-L153
>>>
>>> This ad-hoc download and installation is not ideal and in fact a
>>> workaround
>>> until the base Windows AMI used by the CI server is updated. The script
>>> generating the base Windows AMI is tracked at
>>> https://github.com/apache/incubator-mxnet-ci and Shiwen Hu recently
>>> updated the
>>> script to include the updated cmake version:
>>> https://github.com/apache/incubator-mxnet-ci/pull/17
>>>
>>> It seems that this change needs to be deployed manually, which Pedro is
>>> attempting to do. But if I understand correctly Pedro found the public
>>> version
>>> of the AMI generation script and some currently used script diverged:
>>> http://ix.io/25WQ
>>>
>>>
>>>
>>> Questions:
>>> 1) Is there a git history associated with the version of the script that
>>> diverged?
>>>
>>> 2) According to
>>>
>>> https://github.com/apache/incubator-mxnet-ci/tree/master/services/jenkins-slave-creation-windows
>>> the Windows Base AMI should be created automatically. Why is it not done
>>> automatically anymore / why does the documentation claim it happens
>>> automatically but it doesn't?
>>>
>>> On Mon, 2019-12-30 at 12:11 -0800, Pedro Larroy wrote:
>>> > Hi
>>> >
>>> > I was looking at a request from Leonard for updating CMake on windows,
>>> and
>>> > I see that the post-install.py script which setups the windows
>>> environment
>>> > in CI has diverged significantly from the incubator-mxnet-ci and the
>>> > private repository that is used to deploy to production CI.
>>> >
>>> > https://github.com/apache/incubator-mxnet/pull/17031
>>> >
>>> > I see quite some patch of differences, there's also different directory
>>> > structure which Marco committed to incubator-mxnet-ci  and MKL seems
>>> to be
>>> > removed. My question why has this diverged so much, I was expecting to
>>> > transplant just a single patch to update CMake.
>>> >
>>> >
>>> > http://ix.io/25WQ
>>> >
>>> >
>>> > Pedro.
>>>
>>


Re: windows ci, Cmake update, diverging scripts

2019-12-30 Thread Pedro Larroy
I have looked into this a bit, and seems the open source version which is
in https://github.com/apache/incubator-mxnet-ci is older than what's
already deployed.
The root cause of the failure in the update job seems to be a hardcoded AMI
which is no longer available. There seems to be a way now to query for the
latest windows AMI:
https://aws.amazon.com/blogs/mt/query-for-the-latest-windows-ami-using-systems-manager-parameter-store/

On Mon, Dec 30, 2019 at 3:12 PM Pedro Larroy 
wrote:

> It's automated but broken as the execution is in failed state. I think we
> will need an engineer to do repairs there.
>
> It's using systems manager automation to produce these AMIs.
>
> On Mon, Dec 30, 2019 at 1:44 PM Lausen, Leonard 
> wrote:
>
>> Some more background:
>>
>> Since a few days, CI downloads and installs a more recent cmake version
>> in the
>> Windows job based on
>>
>> https://github.com/leezu/mxnet/blob/230ceee5d9e0e02e58be69dad1c4ffdadbaa1bd9/ci/build_windows.py#L148-L153
>>
>> This ad-hoc download and installation is not ideal and in fact a
>> workaround
>> until the base Windows AMI used by the CI server is updated. The script
>> generating the base Windows AMI is tracked at
>> https://github.com/apache/incubator-mxnet-ci and Shiwen Hu recently
>> updated the
>> script to include the updated cmake version:
>> https://github.com/apache/incubator-mxnet-ci/pull/17
>>
>> It seems that this change needs to be deployed manually, which Pedro is
>> attempting to do. But if I understand correctly Pedro found the public
>> version
>> of the AMI generation script and some currently used script diverged:
>> http://ix.io/25WQ
>>
>>
>>
>> Questions:
>> 1) Is there a git history associated with the version of the script that
>> diverged?
>>
>> 2) According to
>>
>> https://github.com/apache/incubator-mxnet-ci/tree/master/services/jenkins-slave-creation-windows
>> the Windows Base AMI should be created automatically. Why is it not done
>> automatically anymore / why does the documentation claim it happens
>> automatically but it doesn't?
>>
>> On Mon, 2019-12-30 at 12:11 -0800, Pedro Larroy wrote:
>> > Hi
>> >
>> > I was looking at a request from Leonard for updating CMake on windows,
>> and
>> > I see that the post-install.py script which setups the windows
>> environment
>> > in CI has diverged significantly from the incubator-mxnet-ci and the
>> > private repository that is used to deploy to production CI.
>> >
>> > https://github.com/apache/incubator-mxnet/pull/17031
>> >
>> > I see quite some patch of differences, there's also different directory
>> > structure which Marco committed to incubator-mxnet-ci  and MKL seems to
>> be
>> > removed. My question why has this diverged so much, I was expecting to
>> > transplant just a single patch to update CMake.
>> >
>> >
>> > http://ix.io/25WQ
>> >
>> >
>> > Pedro.
>>
>


Re: windows ci, Cmake update, diverging scripts

2019-12-30 Thread Pedro Larroy
It's automated but broken as the execution is in failed state. I think we
will need an engineer to do repairs there.

It's using systems manager automation to produce these AMIs.

On Mon, Dec 30, 2019 at 1:44 PM Lausen, Leonard 
wrote:

> Some more background:
>
> Since a few days, CI downloads and installs a more recent cmake version in
> the
> Windows job based on
>
> https://github.com/leezu/mxnet/blob/230ceee5d9e0e02e58be69dad1c4ffdadbaa1bd9/ci/build_windows.py#L148-L153
>
> This ad-hoc download and installation is not ideal and in fact a workaround
> until the base Windows AMI used by the CI server is updated. The script
> generating the base Windows AMI is tracked at
> https://github.com/apache/incubator-mxnet-ci and Shiwen Hu recently
> updated the
> script to include the updated cmake version:
> https://github.com/apache/incubator-mxnet-ci/pull/17
>
> It seems that this change needs to be deployed manually, which Pedro is
> attempting to do. But if I understand correctly Pedro found the public
> version
> of the AMI generation script and some currently used script diverged:
> http://ix.io/25WQ
>
>
>
> Questions:
> 1) Is there a git history associated with the version of the script that
> diverged?
>
> 2) According to
>
> https://github.com/apache/incubator-mxnet-ci/tree/master/services/jenkins-slave-creation-windows
> the Windows Base AMI should be created automatically. Why is it not done
> automatically anymore / why does the documentation claim it happens
> automatically but it doesn't?
>
> On Mon, 2019-12-30 at 12:11 -0800, Pedro Larroy wrote:
> > Hi
> >
> > I was looking at a request from Leonard for updating CMake on windows,
> and
> > I see that the post-install.py script which setups the windows
> environment
> > in CI has diverged significantly from the incubator-mxnet-ci and the
> > private repository that is used to deploy to production CI.
> >
> > https://github.com/apache/incubator-mxnet/pull/17031
> >
> > I see quite some patch of differences, there's also different directory
> > structure which Marco committed to incubator-mxnet-ci  and MKL seems to
> be
> > removed. My question why has this diverged so much, I was expecting to
> > transplant just a single patch to update CMake.
> >
> >
> > http://ix.io/25WQ
> >
> >
> > Pedro.
>


windows ci, Cmake update, diverging scripts

2019-12-30 Thread Pedro Larroy
Hi

I was looking at a request from Leonard for updating CMake on windows, and
I see that the post-install.py script which setups the windows environment
in CI has diverged significantly from the incubator-mxnet-ci and the
private repository that is used to deploy to production CI.

https://github.com/apache/incubator-mxnet/pull/17031

I see quite some patch of differences, there's also different directory
structure which Marco committed to incubator-mxnet-ci  and MKL seems to be
removed. My question why has this diverged so much, I was expecting to
transplant just a single patch to update CMake.


http://ix.io/25WQ


Pedro.


Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-27 Thread Pedro Larroy
Test

On Fri, Dec 27, 2019 at 11:54 AM Pedro Larroy 
wrote:

> Thanks for the explanation. I'm not so concerned about complexity of
> dispatching. If I understood you correctly the main benefit that you
> explain for the TVM project was not having to change the C API, but still
> you need to do type checking in both ends, or at least on the receiving end
> of the API, correct? I think we have discussed similar things in the past
> and we might have different views on strongly typed vs dynamic typed. A
> priori I prefer to see an API which can be evolved and changed, I find it
> more explicit and clearer that what I think you do with PackedFun which I
> have looked at briefly but not used extensively.  If one is going to call
> into the C API using pybind, does it make sense to layer a C++ API on top
> of the C API for this?
>
> Also these microbenchmarks are nice, but we also need to consider the
> overhead in typical workloads and see if it's still significant.
>
> CFFI is also another alternative.
>
> I couldn't access your pointers like:
>
> https://github.com/tqchen/tvm/tree/pyffi
>
> On Thu, Dec 26, 2019 at 2:00 PM Tianqi Chen 
> wrote:
>
>> @larroy indeed every solution has trade-offs, and these tradeoffs are
>> discussed in the above posts when we compare solutions, and they are backed
>> by benchmarks :) it would be great if you can also suggest potential
>> tradeoffs here.
>>
>> When you expose an API from typed language(c++) to a dynamic
>> language(python), you have to type erase it, given that the python
>> functions don't have the type, and you have to pass the information along.
>>
>> The only difference is where you do the type checking(that the python
>> type corresponds to the right c++ type), and translation(translating to the
>> c++ type).
>>
>> For example, in the case of pybind, the erasure is done implicitly when
>> you call the python function, then checking and translation happens when
>> you call into the c++ function.
>>
>> In the case of creating a C API for each feature and wrap things in the
>> python side, the type checking is done in the python side, and translation
>> as well.
>>
>> In the case of tvm ffi, the type translation is done in the python/cython
>> side,  while the type checking is done in the c++.
>>
>> To dive deeper into the tradeoffs for PackedFunc calling convention. The
>> convention erases the type by having the type code stored into the
>> arguments. This brings additional cost of passing arguments into heap, as
>> opposed to registers. So they might not be designed for inline functions
>> that needs to happen at the order of 1e-9s, however, for API functions that
>> needs to run around 1e-7 or even 1e-8 level, this convention is pretty good.
>>
>> In terms of the calling cost, it really depends on whether the caller and
>> callee are strongly typed.
>> - If caller is strongly typed, then assigning type code is O(1)
>> - If caller is a dynamic type(like python) then we need to have a
>> dispatcher to dispatch and select the right type code
>> - If callee is strongly typed, then the cost of checking is O(1) by just
>> check the code to be the correct one
>> - If the callee is dynamic type, then a dispatching need to happen, which
>> have another level of hashtable lookup O(1)
>>
>> As we can see, the only place where dispatching is necessary is the
>> dynamic type handling case. Even in these cases, if there is a strong need
>> of specialization, we can directly force the type by running checking on
>> the caller, and pass in the right type code (the engineering burden is the
>> same as wrapping the C API). However, the benchmark suggests that the
>> dynamic dispatching cost is reasonable, and satisfies the API speed.
>>
>> Coming back to the tradeoff, the main tradeoff here is the engineering
>> burden to keep an hourglass design(with fixed set of API) vs efficiency.
>> While my post did not suggest that TVM's ffi is a silver bullet, it does
>> works pretty well for our use cases. hope it helps
>>
>>
>> --
>> You are receiving this because you are subscribed to this thread.
>> Reply to this email directly or view it on GitHub:
>>
>> https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-569139957
>
>


Re: [VOTE] Release Apache MXNet (incubating) version 1.6.0.rc0

2019-12-27 Thread Pedro Larroy
Agree with Sheng, I think it would be good to have the nice fixes that
Leonard has done for 1.6 and not delay them to further releases since they
are beneficial to users and developers. Thanks Leonard for helping fix
these long standing issues.

On Fri, Dec 27, 2019 at 11:03 AM Lin Yuan  wrote:

> No, I just wanted to call it out because the title of the issue says
> "Failed
> OpenMP assertion when loading MXNet compiled with DEBUG=1
> <https://github.com/apache/incubator-mxnet/issues/10856#>".
> If this is considered a release blocker, I think we should backport it to
> 1.6.
>
> Thanks,
> Lin
>
> On Fri, Dec 27, 2019 at 10:47 AM Sheng Zha  wrote:
>
> > Reading these issues it’s pretty clear to me that these are fixes for
> > broken builds. I think we do consider broken builds to be release
> blockers.
> >
> > Lin, am I missing something on which you base your suggestion for
> delaying
> > these changes?
> >
> > -sz
> >
> > > On Dec 27, 2019, at 10:30 AM, Lin Yuan  wrote:
> > >
> > > Are these release blocker? It's very risky to make such last-minute
> big
> > > change after code freeze.
> > >
> > > Can we do this in the next release?
> > >
> > > Lin
> > >
> > >> On Fri, Dec 27, 2019 at 7:37 AM Lausen, Leonard
> > 
> > >> wrote:
> > >>
> > >> In case of backporting #17012, also
> > >> https://github.com/apache/incubator-mxnet/pull/17098 must be
> > backported.
> > >> The
> > >> updated OpenMP added a new target which is not used by MXNet but
> breaks
> > the
> > >> build on some systems with nvptx. #17098 disables building this unused
> > and
> > >> broken feature.
> > >>
> > >>> On Thu, 2019-12-26 at 12:55 -0800, Pedro Larroy wrote:
> > >>> https://github.com/apache/incubator-mxnet/pull/17012  should be also
> > >> ported
> > >>> to the release branch.
> > >>>
> > >>> On Fri, Dec 20, 2019 at 1:39 PM Przemysław Trędak <
> ptre...@apache.org>
> > >>> wrote:
> > >>>
> > >>>> That issue is now fixed in master, I am in the process of
> > >> cherry-picking
> > >>>> the fix to v1.6.x branch. I will prepare the RC1 once that is ready.
> > >>>>
> > >>>> Thanks
> > >>>> Przemek
> > >>>>
> > >>>> On 2019/12/20 20:07:36, Lin Yuan  wrote:
> > >>>>> What's the next step for the release? Should we continue testing
> > >> this and
> > >>>>> vote or wait until the
> > >>>>> https://github.com/apache/incubator-mxnet/issues/17105 is fixed?
> > >>>>>
> > >>>>> Thanks!
> > >>>>>
> > >>>>> Lin
> > >>>>>
> > >>>>> On Wed, Dec 18, 2019 at 12:55 AM Lausen, Leonard
> > >>>> 
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Thanks Przemysław for managing this release and everyone who
> > >>>> contributed
> > >>>>>> to it.
> > >>>>>>
> > >>>>>> Unfortunately Zechen Wang just discovered another issue with GPU
> > >>>> Pointwise
> > >>>>>> Fusion: https://github.com/apache/incubator-mxnet/issues/17105
> > >>>>>>
> > >>>>>> Thus, -1.
> > >>>>>>
> > >>>>>> Unfortunately, as the nightly release pipeline was broken until
> > >>>> recently
> > >>>>>> (and
> > >>>>>> still isn't re-set up completely yet), the issue hasn't been
> > >> discovered
> > >>>>>> earlier.
> > >>>>>>
> > >>>>>> Przemysław may have a quick fix for the issue. Another option
> > >> would be
> > >>>> to
> > >>>>>> release 1.6 with MXNET_USE_FUSION default to 0.
> > >>>>>>
> > >>>>>> Best regards
> > >>>>>> Leonard
> > >>>>>>
> > >>>>>> On Wed, 2019-12-18 at 05:30 +, Chen, Ciyong wrote:
> > >>>>>>> Appreciate Tredak to push out voting for 1.6 release.
> > >>>>>>>
> > >>>>>>> +1 

Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-26 Thread Pedro Larroy
Pybind is nice, I used Boost python many years ago, which I think is based
on. The problem with this is the hourglass C bindings, you have to go from
Python to C++ / Pybind, down to C and to the engine, this seems like a lot
of boilerplate.

On Mon, Dec 16, 2019 at 10:02 PM reminisce  wrote:

> MXNet imperative operator invocation overhead is as large as 30-60us,
> which is significant compared to the official NumPy operators with ~600ns
> overhead. This has negatively impacted the performance of applying MXNet to
> the models where many operators' kernel runtime duration is short,
> especially in the area of classic machine learning. We plan to address the
> problem in two steps:
>
>1.
>
>Short term: Use pybind11 to replace Python op API and ctypes/c api.
>Preliminary experiments show that the pure Python-C++ turnaround time by
>using Pybind is between 400-600ns, while the current Python op API using
>ctypes/c api costs more than 10us. We believe with the correct
>implementation, we can reduce the op invocation overhead to 2us including
>the time on FFI and engine.
>2.
>
>Long term: Adopt Python's C extension interface. NumPy did this by
>developing its own C API. This provides considerably less overhead compared
>to other solutions. However, it would cost much more engineering efforts by
>integrating this with our existing operator workflow in C++.
>
> @hzfan  @hgt312 
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> ,
> or unsubscribe
> 
> .
>


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-569135990

Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-26 Thread Pedro Larroy
What's the point of having an API if you type erase it? Then you might as
well have a single function API with a type erased callback name to select
the function to call. In the end you move the burden away from the API to
the callers and inside the API to the dispatchers. For going this route of
uber-clever template tricks to generate code, I think it's better to just
put in place proper code generation for maintainability. Could you provide
a bit more details about tradeoffs? Everything has tradeoffs, I don't
believe any solution which is sold as a panacea, there's no silver bullet.

On Thu, Dec 19, 2019 at 10:21 AM Tianqi Chen 
wrote:

> I have another candidate that would highly recommend: adopt TVM's FFI
> convention.
>
> The historical problem of MXNet FFI was the blowing amount of the C API
> bindings as we add new features. This creates a huge amount of maintenance
> burden.
>
> The real problem was not really about which FFI system to adopt(cython and
> pybind are fine in that end, except for the cost of compilation), but more
> of the cost to maintain the FFI. MXNet used to have a fast cython binding,
> but that was abandoned because we keep add new APIs we cannot keep up both
> ctypes and cython.
>
> When developing TVM we learnt from the lesson and restrict the API to a
> limited set of runtime APIs that does not change, and have a stable cython,
> ctypes binding for them. The runtime support a type-erased
> function(PackedFunc), which can be efficiently called from any of the
> frontend language, and all the APIs are exposed through the PackedFunc. On
> the python side an additional wrapping is created for better documentation
> and call into the PackedFunc. See more in
> https://docs.tvm.ai/dev/runtime.html The system works great for over a
> few years now.
>
> Of course I understand there has been legacy issues in MXNet that is why I
> did not bring this proposal up. But given this is a proposal for 2.0, I
> would encourage everyone to give a serious thought about this possibility.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> ,
> or unsubscribe
> 
> .
>


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-569135511

Re: [VOTE] Release Apache MXNet (incubating) version 1.6.0.rc0

2019-12-26 Thread Pedro Larroy
https://github.com/apache/incubator-mxnet/pull/17012  should be also ported
to the release branch.

On Fri, Dec 20, 2019 at 1:39 PM Przemysław Trędak 
wrote:

> That issue is now fixed in master, I am in the process of cherry-picking
> the fix to v1.6.x branch. I will prepare the RC1 once that is ready.
>
> Thanks
> Przemek
>
> On 2019/12/20 20:07:36, Lin Yuan  wrote:
> > What's the next step for the release? Should we continue testing this and
> > vote or wait until the
> > https://github.com/apache/incubator-mxnet/issues/17105 is fixed?
> >
> > Thanks!
> >
> > Lin
> >
> > On Wed, Dec 18, 2019 at 12:55 AM Lausen, Leonard
> 
> > wrote:
> >
> > > Thanks Przemysław for managing this release and everyone who
> contributed
> > > to it.
> > >
> > > Unfortunately Zechen Wang just discovered another issue with GPU
> Pointwise
> > > Fusion: https://github.com/apache/incubator-mxnet/issues/17105
> > >
> > > Thus, -1.
> > >
> > > Unfortunately, as the nightly release pipeline was broken until
> recently
> > > (and
> > > still isn't re-set up completely yet), the issue hasn't been discovered
> > > earlier.
> > >
> > > Przemysław may have a quick fix for the issue. Another option would be
> to
> > > release 1.6 with MXNET_USE_FUSION default to 0.
> > >
> > > Best regards
> > > Leonard
> > >
> > > On Wed, 2019-12-18 at 05:30 +, Chen, Ciyong wrote:
> > > > Appreciate Tredak to push out voting for 1.6 release.
> > > >
> > > > +1 as we've done lots of tests with expected performance in many
> > > different
> > > > scenarios including both single-node and multi-node (horovod based),
> > > both FP32
> > > > and INT8 precision on many topologies.
> > > >
> > > > -Ciyong
> > > >
> > > > -Original Message-
> > > > From: Zhao, Patric 
> > > > Sent: Tuesday, December 17, 2019 8:51 AM
> > > > To: dev@mxnet.incubator.apache.org; d...@mxnet.apache.org
> > > > Subject: RE: [VOTE] Release Apache MXNet (incubating) version
> 1.6.0.rc0
> > > >
> > > > Thanks, Tredak, I will add some words for the new feature in the
> release
> > > note.
> > > >
> > > > +1 for voting because we have ran multiple time of tests in local and
> > > got the
> > > > expected performance boost.
> > > >
> > > > --Patric
> > > >
> > > > > -Original Message-
> > > > > From: Przemysław Trędak 
> > > > > Sent: Tuesday, December 17, 2019 4:49 AM
> > > > > To: d...@mxnet.apache.org
> > > > > Subject: [VOTE] Release Apache MXNet (incubating) version 1.6.0.rc0
> > > > >
> > > > > Dear MXNet community,
> > > > >
> > > > > This is the vote to release Apache MXNet (incubating) version
> 1.6.0.
> > > > > Voting starts now and will close on Friday, 20th December 2019
> > > 23:59:59 PST.
> > > > >
> > > > > Link to release notes:
> > > > >
> https://cwiki.apache.org/confluence/display/MXNET/1.6.0+Release+notes
> > > > >
> > > > > Link to release candidate:
> > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.6.0.rc0
> > > > >
> > > > > Link to source and signatures on apache dist server:
> > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.6.0.rc0/
> > > > >
> > > > > Please remember to TEST first before voting accordingly:
> > > > > +1 = approve
> > > > > +0 = no opinion
> > > > > -1 = disapprove (provide reason)
> > > > >
> > > > > Additional notes:
> > > > >  - There was an issue[1] raised that 1.6.0.rc0 does not build with
> > > > > clang on FreeBSD - I decided to not block the voting for this and
> > > > > instead let the Community decide whether this is a blocker for the
> > > release.
> > > > >  - Patric Zhao and Tao Lv - could you help preparing a paragraph on
> > > > > MKLDNN
> > > > > 1.0 update in the New features section in the release notes?
> > > > >
> > > > > [1] https://github.com/apache/incubator-mxnet/issues/17076
> > > > >
> > > > > Best regards,
> > > > > Przemyslaw Tredak
> > >
> >
>


Re: [apache/incubator-mxnet] [RFC] Custom Operator Part 2 (#17006)

2019-12-26 Thread Pedro Larroy
@wkcn could you explain your suggestion? calling gemm back into the framework 
which gets dispatched to GPU or CPU?

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17006#issuecomment-569131388

[discuss] add lgtm.com to mxnet

2019-12-18 Thread Pedro Larroy
Shall we add lgtm to mxnet?  https://lgtm.com/


The essence of deep learning, autodiff and higher order gradients

2019-12-18 Thread Pedro Larroy
Hi

I published the slides I presented at the last MXNet meetup on automatic
differentiation and higher order gradients. If you want to get more
insights to understand some PRs which have been sent or future directions
on this area for 2.0. I also compare implementation across major deep
learning frameworks. Let me know if you have any questions or feedbacks and
please click like or share my post.

https://www.linkedin.com/posts/pedrolarroy_the-essence-of-deep-learning-automatic-differentiation-activity-6613142805923536896-PuI5/

Pedro.


Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Pedro Larroy
Great investigation thank you. I have to agree with your analysis and for
helping resolving this long standing issue.

This will not repair the damage made to the community of losing 3-4
valuable contributors. Introducing a library that causes bugs then blocking
changes and locking gh issues which attempt to remove or workaround the
issues in addition to making rude comments and worse things that are better
left out is still not acceptable and begs for an apology from Chris.

P.




On Sunday, December 8, 2019, Lausen, Leonard 
wrote:
> Thanks Pedro and Chris for your responses.
>
> After further investigation I find:
>
> 1) I don't think https://github.com/apache/incubator-mxnet/issues/14979 is
> caused by any incompatibility between gomp and llvm / intel omp. Rather
it's
> simply a problem of llvm / intel omp. See my comment to the issue for the
> methodology to arrive at this claim.
>
> 2) Regarding the assertion failure when compiling with (llvm)
3rdparty/openmp,
> it can be fixed by updating the by now 2 years old llvm openmp code to the
> newest released version. I went ahead and opened a PR
> https://github.com/apache/incubator-mxnet/pull/17012
>
> Based on the investigation described in 1), I think Chris is right that
the
> assertion failure is not due to some interaction between gomp and llvm
omp.
> However, I'm not sure about Chris's suggestion that the assertion failure
is due
> to a bug in MXNet. In fact, the failure goes away when updating the llvm
openmp
> code. So I think it's just due to a bug in the 2 years old code.
>
> @Chris, I think updating 3rdparty/openmp to fix the assertion issue is not
> contentious. Thus let's do it via lazy consensus (72 hours) or just
approve the
> PR and merge it.
>
> Please also take a look at my comment at #14979 and let everyone know if
you see
> any option to fix the bug while keeping 3rdparty/openmp. As this bug
affects an
> important use-case, I beleive we need to remove 3rdparty/openmp from the
CMake
> build as long as we don't find a solution for making #14979 work with
> 3rdparty/openmp.
>
> In fact, removing 3rdparty/openmp will then match the current Makefile
setup
> that according to my understanding is used to build the nightly releases
used by
> the majority of developers. Ie. most users actually don't use the CMake
build
> with 3rdparty/openmp. You can consider rescinding your veto on removing
> 3rdparty/openmp after reading through the evidence in that issue. If you
don't
> provide any evidence for why the methodology/conclusion in #14979 is
flawed, I
> will assume your previous veto is void based on Apache Voting rule as it
lacks
> technical justification and in any case was motivated by the assertion
issue,
> which I agree with you, is likely not due to gomp / omp interaction.
>
> Thank you
> Leonard
>
>
> On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
>> Stop disseminating false information:
>>
>> https://github.com/apache/incubator-mxnet/issues/14979
>>
>>
>> On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier 
wrote:
>>
>> > -1
>> >
>> > mkldnn removed omp5 for licencing issues
>> > no bugs have actually been traced to the use of llvm openmp. only an
assert
>> > caused by an actual bug in mxnet code. there are suitable workarounds.
>> >
>> > over time llvm omp has simply been used as a “catch all” for random
>> > problems that aren’t related at all (such as getenv race condition in
an
>> > atfork call that isn’t even part of an omp parallel region).
>> >
>> > proposal is now and has always been roughly equivalent to the idea of
>> > “comment out an assert rather than fix the bug it’s reporting”.
>> >
>> > Up until very recently, Makefile version of mxnet used libomp5 for
YEARS
>> > and not libgomp, with no issue reported (omp not built in debug mode),
so
>> > the equivalent configuration from CMake mysteriously causing myriads if
>> > problems has questionable merit and smells more like a hubris
situation.
>> >
>> > I use tensorflow as well and it links to libomp5 rather than libgomp.
>> >
>> > if the assert problem is really a problem, the bug being reported
would be
>> > prioritized and fixed. it should be fixed regardless. all the time
spent by
>> > some CI people trying to remove this could have simply fixed the
actual bug
>> > in a small fraction of the time.
>> >
>> >
>> > On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard

>> > wrote:
>> >
>> > > I think it's reasonable to assume that the Intel MKLDNN team is an
>> > > "authorative"
>> > >

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Pedro Larroy
Hi Leonard.

Are you saying that you have updated this library and the problems desribed
in the related tickets are no longer present?

P.

On Sunday, December 8, 2019, Lausen, Leonard 
wrote:
> Thanks Pedro and Chris for your responses.
>
> After further investigation I find:
>
> 1) I don't think https://github.com/apache/incubator-mxnet/issues/14979 is
> caused by any incompatibility between gomp and llvm / intel omp. Rather
it's
> simply a problem of llvm / intel omp. See my comment to the issue for the
> methodology to arrive at this claim.
>
> 2) Regarding the assertion failure when compiling with (llvm)
3rdparty/openmp,
> it can be fixed by updating the by now 2 years old llvm openmp code to the
> newest released version. I went ahead and opened a PR
> https://github.com/apache/incubator-mxnet/pull/17012
>
> Based on the investigation described in 1), I think Chris is right that
the
> assertion failure is not due to some interaction between gomp and llvm
omp.
> However, I'm not sure about Chris's suggestion that the assertion failure
is due
> to a bug in MXNet. In fact, the failure goes away when updating the llvm
openmp
> code. So I think it's just due to a bug in the 2 years old code.
>
> @Chris, I think updating 3rdparty/openmp to fix the assertion issue is not
> contentious. Thus let's do it via lazy consensus (72 hours) or just
approve the
> PR and merge it.
>
> Please also take a look at my comment at #14979 and let everyone know if
you see
> any option to fix the bug while keeping 3rdparty/openmp. As this bug
affects an
> important use-case, I beleive we need to remove 3rdparty/openmp from the
CMake
> build as long as we don't find a solution for making #14979 work with
> 3rdparty/openmp.
>
> In fact, removing 3rdparty/openmp will then match the current Makefile
setup
> that according to my understanding is used to build the nightly releases
used by
> the majority of developers. Ie. most users actually don't use the CMake
build
> with 3rdparty/openmp. You can consider rescinding your veto on removing
> 3rdparty/openmp after reading through the evidence in that issue. If you
don't
> provide any evidence for why the methodology/conclusion in #14979 is
flawed, I
> will assume your previous veto is void based on Apache Voting rule as it
lacks
> technical justification and in any case was motivated by the assertion
issue,
> which I agree with you, is likely not due to gomp / omp interaction.
>
> Thank you
> Leonard
>
>
> On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
>> Stop disseminating false information:
>>
>> https://github.com/apache/incubator-mxnet/issues/14979
>>
>>
>> On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier 
wrote:
>>
>> > -1
>> >
>> > mkldnn removed omp5 for licencing issues
>> > no bugs have actually been traced to the use of llvm openmp. only an
assert
>> > caused by an actual bug in mxnet code. there are suitable workarounds.
>> >
>> > over time llvm omp has simply been used as a “catch all” for random
>> > problems that aren’t related at all (such as getenv race condition in
an
>> > atfork call that isn’t even part of an omp parallel region).
>> >
>> > proposal is now and has always been roughly equivalent to the idea of
>> > “comment out an assert rather than fix the bug it’s reporting”.
>> >
>> > Up until very recently, Makefile version of mxnet used libomp5 for
YEARS
>> > and not libgomp, with no issue reported (omp not built in debug mode),
so
>> > the equivalent configuration from CMake mysteriously causing myriads if
>> > problems has questionable merit and smells more like a hubris
situation.
>> >
>> > I use tensorflow as well and it links to libomp5 rather than libgomp.
>> >
>> > if the assert problem is really a problem, the bug being reported
would be
>> > prioritized and fixed. it should be fixed regardless. all the time
spent by
>> > some CI people trying to remove this could have simply fixed the
actual bug
>> > in a small fraction of the time.
>> >
>> >
>> > On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard

>> > wrote:
>> >
>> > > I think it's reasonable to assume that the Intel MKLDNN team is an
>> > > "authorative"
>> > > source about the issue of compilation with OpenMP and the OpenMP
runtime
>> > > library
>> > > related issues. Thus I suggest we follow the recommendation of Intel
>> > > MKLDNN team
>> > > within the MXNet project.
>> > >
>> > > Looking through the Intel MKLDNN documentation, I find [1]:
>&

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Pedro Larroy
e CMake
>> build
>> with 3rdparty/openmp. You can consider rescinding your veto on removing
>> 3rdparty/openmp after reading through the evidence in that issue. If you
>> don't
>> provide any evidence for why the methodology/conclusion in #14979 is
>> flawed, I
>> will assume your previous veto is void based on Apache Voting rule as it
>> lacks
>> technical justification and in any case was motivated by the assertion
>> issue,
>> which I agree with you, is likely not due to gomp / omp interaction.
>>
>> Thank you
>> Leonard
>>
>>
>> On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
>> > Stop disseminating false information:
>> >
>> > https://github.com/apache/incubator-mxnet/issues/14979
>> >
>> >
>> > On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier 
>> wrote:
>> >
>> > > -1
>> > >
>> > > mkldnn removed omp5 for licencing issues
>> > > no bugs have actually been traced to the use of llvm openmp. only an
>> assert
>> > > caused by an actual bug in mxnet code. there are suitable
workarounds.
>> > >
>> > > over time llvm omp has simply been used as a “catch all” for random
>> > > problems that aren’t related at all (such as getenv race condition in
>> an
>> > > atfork call that isn’t even part of an omp parallel region).
>> > >
>> > > proposal is now and has always been roughly equivalent to the idea of
>> > > “comment out an assert rather than fix the bug it’s reporting”.
>> > >
>> > > Up until very recently, Makefile version of mxnet used libomp5 for
>> YEARS
>> > > and not libgomp, with no issue reported (omp not built in debug
mode),
>> so
>> > > the equivalent configuration from CMake mysteriously causing myriads
if
>> > > problems has questionable merit and smells more like a hubris
>> situation.
>> > >
>> > > I use tensorflow as well and it links to libomp5 rather than libgomp.
>> > >
>> > > if the assert problem is really a problem, the bug being reported
>> would be
>> > > prioritized and fixed. it should be fixed regardless. all the time
>> spent by
>> > > some CI people trying to remove this could have simply fixed the
>> actual bug
>> > > in a small fraction of the time.
>> > >
>> > >
>> > > On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard
>> 
>> > > wrote:
>> > >
>> > > > I think it's reasonable to assume that the Intel MKLDNN team is an
>> > > > "authorative"
>> > > > source about the issue of compilation with OpenMP and the OpenMP
>> runtime
>> > > > library
>> > > > related issues. Thus I suggest we follow the recommendation of
Intel
>> > > > MKLDNN team
>> > > > within the MXNet project.
>> > > >
>> > > > Looking through the Intel MKLDNN documentation, I find [1]:
>> > > >
>> > > > > DNNL uses OpenMP runtime library provided by the compiler.
>> > > >
>> > > > as well as
>> > > >
>> > > > > it's important to ensure that only one OpenMP runtime is used
>> > > throughout
>> > > > the
>> > > > > application. Having more than one OpenMP runtime linked to an
>> > > executable
>> > > > may
>> > > > > lead to undefined behavior including incorrect results or
crashes.
>> > > >
>> > > > To keep our project maintainable and error free, I thus suggest we
>> follow
>> > > > DNNL
>> > > > and use the OpenMP runtime library provided by the compiler.
>> > > > We have limited ressources and finding the root cause for any bugs
>> > > > resulting
>> > > > from linking multiple OpenMP libraries as currently done is, in my
>> > > > opinion. not
>> > > > a good use of time. We know it's due to undefined behavior and we
>> know
>> > > > it's best
>> > > > practice to use OpenMP runtime library provided by the compiler. So
>> let's
>> > > > just
>> > > > do that.
>> > > >
>> > > > I think given that MKL-DNN has also adopted the "OpenMP runtime
>> library
>> > > > provided
>> > > > by the compiler" approach, this issue is not contentious 

Re: Please remove conflicting Open MP version from CMake builds

2019-12-07 Thread Pedro Larroy
Stop disseminating false information:

https://github.com/apache/incubator-mxnet/issues/14979


On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier  wrote:

> -1
>
> mkldnn removed omp5 for licencing issues
> no bugs have actually been traced to the use of llvm openmp. only an assert
> caused by an actual bug in mxnet code. there are suitable workarounds.
>
> over time llvm omp has simply been used as a “catch all” for random
> problems that aren’t related at all (such as getenv race condition in an
> atfork call that isn’t even part of an omp parallel region).
>
> proposal is now and has always been roughly equivalent to the idea of
> “comment out an assert rather than fix the bug it’s reporting”.
>
> Up until very recently, Makefile version of mxnet used libomp5 for YEARS
> and not libgomp, with no issue reported (omp not built in debug mode), so
> the equivalent configuration from CMake mysteriously causing myriads if
> problems has questionable merit and smells more like a hubris situation.
>
> I use tensorflow as well and it links to libomp5 rather than libgomp.
>
> if the assert problem is really a problem, the bug being reported would be
> prioritized and fixed. it should be fixed regardless. all the time spent by
> some CI people trying to remove this could have simply fixed the actual bug
> in a small fraction of the time.
>
>
> On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard 
> wrote:
>
> > I think it's reasonable to assume that the Intel MKLDNN team is an
> > "authorative"
> > source about the issue of compilation with OpenMP and the OpenMP runtime
> > library
> > related issues. Thus I suggest we follow the recommendation of Intel
> > MKLDNN team
> > within the MXNet project.
> >
> > Looking through the Intel MKLDNN documentation, I find [1]:
> >
> > > DNNL uses OpenMP runtime library provided by the compiler.
> >
> > as well as
> >
> > > it's important to ensure that only one OpenMP runtime is used
> throughout
> > the
> > > application. Having more than one OpenMP runtime linked to an
> executable
> > may
> > > lead to undefined behavior including incorrect results or crashes.
> >
> > To keep our project maintainable and error free, I thus suggest we follow
> > DNNL
> > and use the OpenMP runtime library provided by the compiler.
> > We have limited ressources and finding the root cause for any bugs
> > resulting
> > from linking multiple OpenMP libraries as currently done is, in my
> > opinion. not
> > a good use of time. We know it's due to undefined behavior and we know
> > it's best
> > practice to use OpenMP runtime library provided by the compiler. So let's
> > just
> > do that.
> >
> > I think given that MKL-DNN has also adopted the "OpenMP runtime library
> > provided
> > by the compiler" approach, this issue is not contentious anymore and
> > qualifies
> > for lazy consensus.
> >
> > Thus if there is no objection within 72 hours (lazy consensus), let's
> drop
> > bundled LLVM OpenMP from master [2]. If we find any issues due to
> > droppeing the
> > bundled LLVM OpenMP, we can always add it back prior to the next release.
> >
> > Best regards
> > Leonard
> >
> > [1]:
> >
> >
> https://github.com/intel/mkl-dnn/blob/433e086bf5d9e5ccfc9ec0b70322f931b6b1921d/doc/build/build_options.md#openmp
> > (This is the updated reference from Anton's previous comment, based on
> the
> > changes in MKLDNN done in the meantime
> >
> https://github.com/apache/incubator-mxnet/pull/12160#issuecomment-415078066
> > )
> > [2]: Alike https://github.com/apache/incubator-mxnet/pull/12160
> >
> >
> > On Fri, 2019-12-06 at 12:16 -0800, Pedro Larroy wrote:
> > > I will try to stay on the sidelines for now since previous
> conversations
> > > about OMP have not been productive here and I have spent way too much
> > time
> > > on this already, I'm not the first one giving up on trying to help with
> > > this topic.
> > >
> > > I would be glad if you guys can work together and find a solution. I
> will
> > > just put my understanding of the big picture hoping that it helps move
> it
> > > forward.
> > >
> > >
> > > Recently the intel omp library which seemed to have the best
> performance
> > of
> > > the 3 was removed from MKL.
> > >
> > > - There's 3 libraries in play, GNU Omp which is shipped with gcc
> (gomp),
> > > LLVM openmp in 3rdparty (llvm-omp), Intel OMP when using MKL, w

Re: Please remove conflicting Open MP version from CMake builds

2019-12-06 Thread Pedro Larroy
I will try to stay on the sidelines for now since previous conversations
about OMP have not been productive here and I have spent way too much time
on this already, I'm not the first one giving up on trying to help with
this topic.

I would be glad if you guys can work together and find a solution. I will
just put my understanding of the big picture hoping that it helps move it
forward.


Recently the intel omp library which seemed to have the best performance of
the 3 was removed from MKL.

- There's 3 libraries in play, GNU Omp which is shipped with gcc (gomp),
LLVM openmp in 3rdparty (llvm-omp), Intel OMP when using MKL, which is
recently removed (iomp)

- IOMP seems to have the best performance, there's stability issues
producing crashes sometimes but the impact seems relatively small for users
and developers. In general seems linking with a different OMP version that
the one shipped with the compiler is known to cause stability issues but
it's done anyway.

- LLVM-OMP used when building with CMake, not used in the PIP releases or
when building with Make. Has stability issues, hangs when running in debug
mode during test execution and produces tons of assertions in debug mode.
Might have some small performance gains but there is no clear cut data that
showcases significant performance gains.

- GOMP is the version shipped with GCC and the PIP wheels without MKL, has
no stability problems.

As a ballpark, IOMP might give 10% performance improvement in some cases.

We need to document well how users should tune and configure MXNet when
using OMP.

As a developer, the safest bet is to use GOMP to be able to debug and
develop without issues. As a user of CPU inference / training you want to
run MKL so depends on how the Intel guys want to do things. My preference
as an engineer is always stability > speed.

Related tickets:

https://github.com/apache/incubator-mxnet/issues/16891

https://github.com/apache/incubator-mxnet/issues/10856#issuecomment-562637931


https://github.com/apache/incubator-mxnet/issues/11417

https://github.com/apache/incubator-mxnet/issues/15690



On Fri, Dec 6, 2019 at 12:39 AM Lausen, Leonard 
wrote:

> Is this related to https://github.com/apache/incubator-mxnet/issues/10856?
>
> I unlocked that Github issue based on the Apache Code of Conduct
> https://www.apache.org/foundation/policies/conduct#specific-guidelines
>
>
> On Sat, 2019-11-30 at 02:47 -0800, Pedro Larroy wrote:
> > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6 (upstream_master)+$ ldd
> > build/libmxnet.so| grep -i openmp
> > libomp.so =>
> > /home/piotr/mxnet_1.6/build/3rdparty/openmp/runtime/src/libomp.so
> > (0x7fde0991d000)
> > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6 (upstream_master)+$ python
> > ~/deeplearning-benchmark/image_classification/infer_imagenet.py --use-rec
> > --batch-size 256 --dtype float32 --num-data-workers 40 --mode hybrid
> > --model resnet50_v2 --use-pretrained --kvstore local --log-interval 1
> > --rec-val ~/data/val-passthrough.rec --rec-val-idx
> > ~/data/val-passthrough.idx
> > INFO:root:Namespace(batch_norm=False, batch_size=256,
> > data_dir='~/.mxnet/datasets/imagenet', dataset_size=32, dtype='float32',
> > kvstore='local', last_gamma=False, log_interval=1, logging_dir='logs',
> > lr=0.1, lr_decay=0.1, lr_decay_epoch='40,60', lr_mode='step',
> > lr_poly_power=2, mode='hybrid', model='resnet50_v2', momentum=0.9,
> > num_epochs=3, num_gpus=0, num_workers=40,
> > rec_val='/home/piotr/data/val-passthrough.rec',
> > rec_val_idx='/home/piotr/data/val-passthrough.idx', save_dir='params',
> > save_frequency=0, top_k=0, use_pretrained=True, use_rec=True,
> use_se=False,
> > warmup_epochs=0, warmup_lr=0.0, wd=0.0001)
> > [10:42:02] ../src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2:
> > /home/piotr/data/val-passthrough.rec, use 36 threads for decoding..
> > INFO:root:Batch [0]
> > INFO:root:Top 1 accuracy: 0
> > INFO:root:warmup_throughput: 5 samples/sec warmup_time 43.150922
> > INFO:root:Batch [1]
> > INFO:root:Top 1 accuracy: 0
> > INFO:root:warmup_throughput: 6 samples/sec warmup_time 37.971927
> > INFO:root:Batch [2]
> > INFO:root:Top 1 accuracy: 0
> > INFO:root:warmup_throughput: 7 samples/sec warmup_time 35.755363
> >
> >
> >
> >
> >
> >
> >
> > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6_plat_omp
> (upstream_master)+$
> > git st
> > On branch upstream_master
> > Your branch is up to date with 'origin/upstream_master'.
> >
> > Changes not staged for commit:
> >   (use "git add/rm ..." to update what will be committed)
> >   (use "git checkout -- ..." to discard changes in working
> directory)
> >
> > delete

Re: Can upgrade windows CI cmake?

2019-12-06 Thread Pedro Larroy
CMake shipped with ubuntu has issues when compiling with CUDA on GPU
instances.  I wouldn't recommend anything older than 3.12 for Linux GPU

https://github.com/apache/incubator-mxnet/blob/master/ci/docker/install/ubuntu_core.sh#L63

I don't know about windows CMake version but would make sense to require a
newer version.

On Thu, Dec 5, 2019 at 7:26 PM Lausen, Leonard 
wrote:

> Currently we declare cmake_minimum_required(VERSION 3.0.2)
>
> I'm in favor of updating our CMake requirement. The main question may be
> what
> new version to pick as minimum requirement.
>
> In general, there is the guideline
>
> > You really should at least use a version of CMake that came out after
> your
> > compiler, since it needs to know compiler flags, etc, for that version.
> And,
> > since CMake will dumb itself down to the minimum required version in your
> > CMake file, installing a new CMake, even system wide, is pretty safe. You
> > should at least install it locally. It's easy (1-2 lines in many cases),
> and
> > you'll find that 5 minutes of work will save you hundreds of lines and
> hours
> > of CMakeLists.txt writing, and will be much easier to maintain in the
> long
> > run.
> https://cliutils.gitlab.io/modern-cmake/
>
> https://cliutils.gitlab.io/modern-cmake/chapters/intro/newcmake.html
> gives a
> short overview of all the improvements made to CMake over the past 6 years.
>
> It's easy for users to upgrade their cmake version with pip:
>   pip install --upgrade --user cmake
> Thus it wouldn't be overly problematic to rely on a very recent version of
> cmake, if indeed it's required.
>
> Nevertheless, if an earlier version fixes the problems, let's rather pick
> that
> one. Did you confirm which version is required to fix the problem?
>
> For now you could try if the CMake version shipped in the oldest supported
> Ubuntu LTS release (Ubuntu 16.04) is fixing your problem (CMake 3.5)? If
> not,
> please test if CMake version shipped in Ubuntu 18.04 (CMake 3.10) fixes
> your
> issue.
>
> Thanks
> Leonard
>
> On Fri, 2019-12-06 at 08:45 +0800, shiwen hu wrote:
> > i am send a pr  https://github.com/apache/incubator-mxnet/pull/16980 to
> > change windows build system.but now ci cmake version seems to be a bug.
> > can't to compile.can upgrade to 3.16.0?
>


Re: CI Update

2019-12-06 Thread Pedro Larroy
Hi all. CI is back to normal after Jake's commit:
https://github.com/apache/incubator-mxnet/pull/16968 please merge from
master.  If someone could look into the TVM building issues  described
above would be great.

On Tue, Dec 3, 2019 at 11:11 AM Pedro Larroy 
wrote:

> Some PRs were experiencing build timeouts in the past. I have diagnosed
> this to be a saturation of the EFS volume holding the compilation cache.
> Once CI is back online this problem is very likely to be solved and you
> should not see any more build timeout issues.
>
> On Tue, Dec 3, 2019 at 10:18 AM Pedro Larroy 
> wrote:
>
>> Also please take note that there's a stage building TVM which is
>> executing compilation serially and takes a lot of time which impacts CI
>> turnaround time:
>>
>> https://github.com/apache/incubator-mxnet/issues/16962
>>
>> Pedro
>>
>> On Tue, Dec 3, 2019 at 9:49 AM Pedro Larroy 
>> wrote:
>>
>>> Hi MXNet community. We are in the process of updating the base AMIs for
>>> CI with an updated CUDA driver to fix the CI blockage.
>>>
>>> We would need help from the community to diagnose some of the build
>>> errors which don't seem related to the infrastructure.
>>>
>>> I have observed this build failure with tvm when not installing the cuda
>>> driver in the container:
>>>
>>>
>>> https://pastebin.com/bQA0W2U4
>>>
>>> centos gpu builds and tests seem to run with the updated AMI and changes
>>> to the container.
>>>
>>>
>>> Thanks.
>>>
>>>
>>> On Mon, Dec 2, 2019 at 12:11 PM Pedro Larroy <
>>> pedro.larroy.li...@gmail.com> wrote:
>>>
>>>> Small update about CI, which is blocked.
>>>>
>>>> Seems there's a nvidia driver compatibility problem in the base AMI
>>>> that is running in GPU instances and the nvidia docker images that we use
>>>> for building and testing.
>>>>
>>>> We are working on providing a fix by updating the base images as
>>>> doesn't seem to be easy to fix by just changing the container.
>>>>
>>>> Thanks.
>>>>
>>>> Pedro.
>>>>
>>>


Re: CI Update

2019-12-03 Thread Pedro Larroy
Some PRs were experiencing build timeouts in the past. I have diagnosed
this to be a saturation of the EFS volume holding the compilation cache.
Once CI is back online this problem is very likely to be solved and you
should not see any more build timeout issues.

On Tue, Dec 3, 2019 at 10:18 AM Pedro Larroy 
wrote:

> Also please take note that there's a stage building TVM which is executing
> compilation serially and takes a lot of time which impacts CI turnaround
> time:
>
> https://github.com/apache/incubator-mxnet/issues/16962
>
> Pedro
>
> On Tue, Dec 3, 2019 at 9:49 AM Pedro Larroy 
> wrote:
>
>> Hi MXNet community. We are in the process of updating the base AMIs for
>> CI with an updated CUDA driver to fix the CI blockage.
>>
>> We would need help from the community to diagnose some of the build
>> errors which don't seem related to the infrastructure.
>>
>> I have observed this build failure with tvm when not installing the cuda
>> driver in the container:
>>
>>
>> https://pastebin.com/bQA0W2U4
>>
>> centos gpu builds and tests seem to run with the updated AMI and changes
>> to the container.
>>
>>
>> Thanks.
>>
>>
>> On Mon, Dec 2, 2019 at 12:11 PM Pedro Larroy <
>> pedro.larroy.li...@gmail.com> wrote:
>>
>>> Small update about CI, which is blocked.
>>>
>>> Seems there's a nvidia driver compatibility problem in the base AMI that
>>> is running in GPU instances and the nvidia docker images that we use for
>>> building and testing.
>>>
>>> We are working on providing a fix by updating the base images as doesn't
>>> seem to be easy to fix by just changing the container.
>>>
>>> Thanks.
>>>
>>> Pedro.
>>>
>>


Re: CI Update

2019-12-03 Thread Pedro Larroy
Also please take note that there's a stage building TVM which is executing
compilation serially and takes a lot of time which impacts CI turnaround
time:

https://github.com/apache/incubator-mxnet/issues/16962

Pedro

On Tue, Dec 3, 2019 at 9:49 AM Pedro Larroy 
wrote:

> Hi MXNet community. We are in the process of updating the base AMIs for CI
> with an updated CUDA driver to fix the CI blockage.
>
> We would need help from the community to diagnose some of the build errors
> which don't seem related to the infrastructure.
>
> I have observed this build failure with tvm when not installing the cuda
> driver in the container:
>
>
> https://pastebin.com/bQA0W2U4
>
> centos gpu builds and tests seem to run with the updated AMI and changes
> to the container.
>
>
> Thanks.
>
>
> On Mon, Dec 2, 2019 at 12:11 PM Pedro Larroy 
> wrote:
>
>> Small update about CI, which is blocked.
>>
>> Seems there's a nvidia driver compatibility problem in the base AMI that
>> is running in GPU instances and the nvidia docker images that we use for
>> building and testing.
>>
>> We are working on providing a fix by updating the base images as doesn't
>> seem to be easy to fix by just changing the container.
>>
>> Thanks.
>>
>> Pedro.
>>
>


Re: CI Update

2019-12-03 Thread Pedro Larroy
Hi MXNet community. We are in the process of updating the base AMIs for CI
with an updated CUDA driver to fix the CI blockage.

We would need help from the community to diagnose some of the build errors
which don't seem related to the infrastructure.

I have observed this build failure with tvm when not installing the cuda
driver in the container:


https://pastebin.com/bQA0W2U4

centos gpu builds and tests seem to run with the updated AMI and changes to
the container.


Thanks.


On Mon, Dec 2, 2019 at 12:11 PM Pedro Larroy 
wrote:

> Small update about CI, which is blocked.
>
> Seems there's a nvidia driver compatibility problem in the base AMI that
> is running in GPU instances and the nvidia docker images that we use for
> building and testing.
>
> We are working on providing a fix by updating the base images as doesn't
> seem to be easy to fix by just changing the container.
>
> Thanks.
>
> Pedro.
>


CI Update

2019-12-02 Thread Pedro Larroy
Small update about CI, which is blocked.

Seems there's a nvidia driver compatibility problem in the base AMI that is
running in GPU instances and the nvidia docker images that we use for
building and testing.

We are working on providing a fix by updating the base images as doesn't
seem to be easy to fix by just changing the container.

Thanks.

Pedro.


Please remove conflicting Open MP version from CMake builds

2019-11-30 Thread Pedro Larroy
(py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6 (upstream_master)+$ ldd
build/libmxnet.so| grep -i openmp
libomp.so =>
/home/piotr/mxnet_1.6/build/3rdparty/openmp/runtime/src/libomp.so
(0x7fde0991d000)
(py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6 (upstream_master)+$ python
~/deeplearning-benchmark/image_classification/infer_imagenet.py --use-rec
--batch-size 256 --dtype float32 --num-data-workers 40 --mode hybrid
--model resnet50_v2 --use-pretrained --kvstore local --log-interval 1
--rec-val ~/data/val-passthrough.rec --rec-val-idx
~/data/val-passthrough.idx
INFO:root:Namespace(batch_norm=False, batch_size=256,
data_dir='~/.mxnet/datasets/imagenet', dataset_size=32, dtype='float32',
kvstore='local', last_gamma=False, log_interval=1, logging_dir='logs',
lr=0.1, lr_decay=0.1, lr_decay_epoch='40,60', lr_mode='step',
lr_poly_power=2, mode='hybrid', model='resnet50_v2', momentum=0.9,
num_epochs=3, num_gpus=0, num_workers=40,
rec_val='/home/piotr/data/val-passthrough.rec',
rec_val_idx='/home/piotr/data/val-passthrough.idx', save_dir='params',
save_frequency=0, top_k=0, use_pretrained=True, use_rec=True, use_se=False,
warmup_epochs=0, warmup_lr=0.0, wd=0.0001)
[10:42:02] ../src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2:
/home/piotr/data/val-passthrough.rec, use 36 threads for decoding..
INFO:root:Batch [0]
INFO:root:Top 1 accuracy: 0
INFO:root:warmup_throughput: 5 samples/sec warmup_time 43.150922
INFO:root:Batch [1]
INFO:root:Top 1 accuracy: 0
INFO:root:warmup_throughput: 6 samples/sec warmup_time 37.971927
INFO:root:Batch [2]
INFO:root:Top 1 accuracy: 0
INFO:root:warmup_throughput: 7 samples/sec warmup_time 35.755363







(py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6_plat_omp (upstream_master)+$
git st
On branch upstream_master
Your branch is up to date with 'origin/upstream_master'.

Changes not staged for commit:
  (use "git add/rm ..." to update what will be committed)
  (use "git checkout -- ..." to discard changes in working directory)

deleted:3rdparty/openmp

no changes added to commit (use "git add" and/or "git commit -a")
(py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6_plat_omp (upstream_master)+$
ldd build/libmxnet.so | grep -i omp
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
(0x7f941241c000)

(py3_venv) piotr@34-215-197-42:130:~/mxnet_1.6_plat_omp (upstream_master)+$
python ~/deeplearning-benchmark/image_classification/infer_imagenet.py
--use-rec --batch-size 256 --dtype float32 --num-data-workers 40 --mode
hybrid --model resnet50_v2 --use-pretrained --kvstore local --log-interval
1 --rec-val ~/data/val-passthrough.rec --rec-val-idx
~/data/val-passthrough.idx
INFO:root:warmup_throughput: 147 samples/sec warmup_time 1.735117
INFO:root:Batch [16]
INFO:root:Top 1 accuracy: 0
INFO:root:warmup_throughput: 143 samples/sec warmup_time 1.785760
INFO:root:Batch [17]
INFO:root:Top 1 accuracy: 0
INFO:root:warmup_throughput: 148 samples/sec warmup_time 1.729033


Re: [Discuss] MXNet Python 3.6 Support Deprecation

2019-11-06 Thread Pedro Larroy
In Numpy they are considering dropping 3.5 support for 1.18 or 1.19.

P.

On Tue, Nov 5, 2019 at 11:15 PM Xingjian SHI  wrote:

> I don’t think we should drop Python 3.5 now because Ubuntu 16.04 ships
> with that version. I suggest that we should revisit it next year.
>
> Best,
> Xingjian
> 
> From: Sheng Zha 
> Sent: Tuesday, August 27, 2019 10:49 AM
> To: d...@mxnet.apache.org
> Subject: Re: [Discuss] MXNet Python  3.6 Support Deprecation
>
> Good summary. At the start the discussion thread my ask is to announce the
> intention of py2 deprecation in the next release, and then actually
> deprecate py2 in the next major release. Thus, the appropriate timing for
> dropping py2 support in CI should be the start of the next major release.
> The py35 vs py36 discussion will not affect the outcome of py2 deprecation.
>
> BTW, one alternative option to a formal voting in the Apache way is to
> through lazy consensus [1], which could apply more in our project. Given
> the positive feedback in this discussion thread, I will assume lazy
> consensus in 72hrs on py2 deprecation as defined above.
>
> [1] https://community.apache.org/committers/lazyConsensus.html
>
> On 2019/08/27 00:19:14, Marco de Abreu  wrote:
> > Pedro,
> >
> > thanks for already starting these efforts, but it might be too early for
> > that. Right now, this is a discussion thread where we try to gather
> > different opinions in order to lay a good base for a future voting
> thread.
> > In there, we would define the detailed timeline, versions etc. Until the
> > vote has passed, I'd say that it's too early to draw any conclusions. So
> > far, there are two open discussion points:
> >
> > 1. Which Python version to support. 3.5 vs 3.6 is currently in the
> > discussion due to Ubuntu 16.04 being shipped with 3.5 while the biggest
> > market share being 3.6 as of now.
> > 2. When to do the deprecation. EOY to match with official Python 2
> > deprecation, in 1.5 years to be in line with Ubuntu 16.04 LTS or with the
> > next major release (2.0) to adhere to semantic versioning.
> >
> > Once these points (and any future ones) have been properly discussed and
> > the community came to an agreement, we can formalize it with a voting
> > thread. Until then, I'd recommend to refrain from any actions or
> > user-facing communication regarding this topic.
> >
> > Best regards,
> > Marco
> >
> > On Tue, Aug 27, 2019 at 1:29 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> > wrote:
> >
> > > I have sent a PR that removes Python2 from CI. But was closed. I
> thought
> > > everyone was +1 on this one. This would remove quite a bit of load on
> CI:
> > >
> > > https://github.com/apache/incubator-mxnet/pull/15990
> > >
> > > If it's not the right time to do this, what steps do we need to take?
> > >
> > > Pedro.
> > >
> > >
> > > On Mon, Aug 26, 2019 at 1:27 AM Leonard Lausen 
> wrote:
> > >
> > > > Lieven Govaerts  writes:
> > > > > Hi,
> > > > >
> > > > > On Thu, 22 Aug 2019 at 17:01, Leonard Lausen 
> > > wrote:
> > > > >
> > > > >> Hi,
> > > > >>
> > > > >> Pedro stated "Seems 3.6 is a reasonable choice." and there have
> been a
> > > > >> few +1 after Chaitanya's reply to Pedro. I would like to check if
> > > these
> > > > >> only refer to Chaitanya's mail about a dedicated "improvement"
> effort
> > > or
> > > > >> about dropping 3.5.
> > > > >>
> > > > >> Thus two questions:
> > > > >>
> > > > >> 1) Are there any concerns about dropping Python 3.5? Now is your
> > > chance
> > > > to
> > > > >> speak up if you think so.
> > > > >>
> > > > >>
> > > > > Ubuntu 16.04 LTS defaults to Python 3.5.x . The LTS releases are
> > > > supported
> > > > > for 5 years, so for 16.04 LTS it ends in 1.5 years.
> > > > >
> > > > > I'm not saying you should wait for 1.5 more years, people can
> upgrade
> > > to
> > > > > 18.04 LTS after all, but may I suggest you make this switch in a
> major
> > > > > release only? More specifically, ensure that Python 3.6-only code
> > > doesn't
> > > > > accidentally gets merged into a 1.5.X patch release.
&

Re: [DISCUSS] CI Access Control

2019-09-27 Thread Pedro Larroy
We will address the shortcomings that Marco outlined by using a pipeline to
deploy the CI infrastructure. Which will allow for contributions and easy
redeployment and rollback in the case of issues.

I would recommend planning a migration towards Drone IO or similar, with an
initial prototype to validate that the main use cases are covered.

Pedro.

On Thu, Sep 19, 2019 at 2:29 PM Sheng Zha  wrote:

> Hi Marco,
>
> Thank you for sharing the insights. The discussion is intended for setting
> goals so that future design improvement to the CI can take these goals into
> consideration. Thus, while I fully recognize that there could be difficulty
> in implementation, I'd still like to confirm with the community if the
> outlined access control recommendation is at the right level.
>
> To summarize your concerns:
> - opening up access control should be conditioned on having good version
> control and roll-back mechanism to ease the operation burden from breakage,
> which is more likely given larger user base.
> - upgrades to the system would be better managed as planned and collective
> efforts instead of adhoc tasks performed by uncoordinated individuals.
>
> You also mentioned that "changes to the system should only be done by the
> administrators". It's exactly the intention of this thread is to define who
> would qualify as administrators. Currently, such qualification is opaque,
> and only happens within a group in Amazon.
>
> On the other hand, this current way can, and already has caused friction.
> When this project's daily activity of validating and merging code is
> affected due to the system's instability, the community members have no
> choice but to wait for the issues to be resolved by the current system
> administrators. Other affected community members have no way to help even
> if they wish to.
>
> Given the existing Apache project governance model, I'd recommend that the
> goal for CI access control be set so that committer and PMC member who
> wishes to be involved should have the right to help.
>
> -sz
>
> On 2019/09/17 12:49:20, Marco de Abreu  wrote:
> > Ah, with regards to #1 and #2: Currently, we don't have any plugins that
> > control the actions of a single user and allows us to monitor and rate
> > limit them. Just giving trigger permission (which is also tied with
> > abort-permission if I recall correctly), would allow a malicious user to
> > start a huge number of jobs and thus either create immense costs or bring
> > down the system. Also, we'd have to check how we can restrict the trigger
> > permission to specific jobs.
> >
> > -Marco
> >
> > On Tue, Sep 17, 2019 at 2:47 PM Marco de Abreu 
> > wrote:
> >
> > > Hi Sheng,
> > >
> > > will I'm in general all in favour of widening the access to distribute
> the
> > > tasks, the situation around the CI system in particular is a bit more
> > > difficult.
> > >
> > > As far as I know, the creation of the CI system is neither automated,
> > > versioned nor backed up or safeguarded. This means that if somebody
> makes a
> > > change that breaks something, we're left with a broken system we can't
> > > recover from. Thus, I preferred it in the past to restrict the access
> as
> > > much as possible (at least to Prod) to avoid these situations from
> > > happening. While #1 and #2 are already possible today (we have two
> roles
> > > for committers and regular users that allow this already), #3 and #4
> come
> > > with a significant risk for the stability of the system.
> > >
> > > As soon as a job is added or changed, a lot of things happen in
> Jenkins -
> > > one of these tasks is the SCM scan which tries to determine the
> branches
> > > the job should run on. For somebody who is inexperienced, the first
> pitfall
> > > is that suddenly hundreds of jobs are being spawned which will
> certainly
> > > overload Jenkins and render it unusable. There are a lot of tricks and
> I
> > > could elaborate them, but basically the bottom line is that the
> > > configuration interface of Jenkins is far from fail-proof and exposes a
> > > significant risk if accessed by somebody who doesn't exactly know what
> > > they're doing - speak, we would need to design some kind of training
> and
> > > even that would not safeguard us from these fatal events.
> > >
> > > There's the whole security aspect around user-facing artifact
> generation
> > > of CI/CD and the possibility of them being tampered, but I don't think
> I
> > > have to elaborate that.
> > >
> > > With regards to #4 especially, I'd say that the risk of somebody just
> > > upgrading the system or changing plugins inherits an even bigger risk.
> > > Plugins are notoriously unsafe and system updates have also shown to
> not
> > > really go like a breeze. I'd argue that changes to the system should
> only
> > > be done by the administrators of it since they have a bigger overview
> over
> > > all the things that are currently going on while also having the full
> > > access (backups before 

[DISCUSS] Remove amalgamation

2019-09-11 Thread Pedro Larroy
Hi Anirudh

Appreciate your feedback and sorry if my email came across that way to you,
I think you might miss some context. I don't think calling something hacky
is anything bad and isn't supposed to be the topic of the discussion. It
was reported as not working by users, hence the original thread. It was a
request for opinions from people who might actually have tried to work in
Mxnet on Android.

Thanks.

Pedro.


On Tuesday, September 10, 2019, Anirudh Subramanian 
wrote:
> Hi Pedro,
>
> I don't see anything "destructive" with Chris asking for justification for
> you calling something "hacky". The only email in this thread where I see
ad
> hominems and disrespectful comments is your email.
>
> On Sat, Sep 7, 2019, 10:18 PM Pedro Larroy 
> wrote:
>
>> Apache mentors should have a look at these reincident harassment and
>> destructive behaviors which demotivate contributions and take action. It
>> takes only one bad apple to ruin a community.
>>
>> The mobile solution that is known to work as of know is cross compiling
>> with "ci/build.py -p build.android_armv8" or "build.android_armv7". The
>> only advantage of amalgamation is to provide a smaller binary that we
could
>> accomplish with the C preprocessor.
>>
>> My technical contributions speak for themselves, including porting MXNet
to
>> Android and ARM and helping many users run MXNet in Jetson, Raspberry Pi
>> and Android amongst many other topics. I have never been disrespectful to
>> anyone. I'm entitled to my own technical opinions about amalgamation or
any
>> other piece of code whatsoever, that's no personal disrespect to anyone
and
>> perfectly valid. If you are not interested in this project anymore, do us
>> all a favor and stop trolling and being toxic. If you want my respect,
step
>> up your technical contributions, be positive and encourage others, this
>> including commits, I haven't seen for many months, please be positive and
>> constructive. This scorched-earth attitude is only reflecting bad on you.
>> I'm certainly not interested in your ad-hominems or unasked for technical
>> advice, which to be honest,  showing poor judgment and ignorance. Myself
>> and others have come up with numbers, graphs, metrics and arguments and
>> have been met with dismissal, trolling and sea-lioning. I have recieved
>> your insults via public and private channels (such as linkedin) as have
>> others. This is not ok and has to stop. If you have something personal
>> against me or against your former employer, this is not the right place
or
>> forum.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Sep 6, 2019 at 3:56 PM Chris Olivier 
>> wrote:
>>
>> > Hi Pedro,
>> >
>> > While I was not involved with amalgamation or its development in any
way,
>> > can you please refrain from referring to the work of others as a "hacky
>> > solution"?  This is derogatory slang and the statement was not
supported
>> > with any justification for such name-calling.  Someone spent a good
deal
>> of
>> > time on this solution at some point in time and I am sure it worked for
>> its
>> > purpose at that time -- I think it was used in the original javascript
>> port
>> > as well, actually -- and it is disrespectful to call their efforts
>> > "hacky".  Please respect what came before.
>> >
>> > Thanks for understanding,
>> >
>> > -Chris
>> >
>> >
>> > On Fri, Sep 6, 2019 at 3:07 PM Pedro Larroy <
>> pedro.larroy.li...@gmail.com>
>> > wrote:
>> >
>> > > Hi
>> > >
>> > > I would like to propose to remove amalgamation from MXNet and CI,
users
>> > > have reported that they couldn't use it successfully in Android, and
>> > > instead they were able to use the cross compiled docker build
>> > successfully.
>> > >
>> > > Any reason why we shouldn't remove this hacky solution?
>> > >
>> > > Pedro.
>> > >
>> >
>>
>


Re: [DISCUSS] Remove amalgamation

2019-09-07 Thread Pedro Larroy
Apache mentors should have a look at these reincident harassment and
destructive behaviors which demotivate contributions and take action. It
takes only one bad apple to ruin a community.

The mobile solution that is known to work as of know is cross compiling
with "ci/build.py -p build.android_armv8" or "build.android_armv7". The
only advantage of amalgamation is to provide a smaller binary that we could
accomplish with the C preprocessor.

My technical contributions speak for themselves, including porting MXNet to
Android and ARM and helping many users run MXNet in Jetson, Raspberry Pi
and Android amongst many other topics. I have never been disrespectful to
anyone. I'm entitled to my own technical opinions about amalgamation or any
other piece of code whatsoever, that's no personal disrespect to anyone and
perfectly valid. If you are not interested in this project anymore, do us
all a favor and stop trolling and being toxic. If you want my respect, step
up your technical contributions, be positive and encourage others, this
including commits, I haven't seen for many months, please be positive and
constructive. This scorched-earth attitude is only reflecting bad on you.
I'm certainly not interested in your ad-hominems or unasked for technical
advice, which to be honest,  showing poor judgment and ignorance. Myself
and others have come up with numbers, graphs, metrics and arguments and
have been met with dismissal, trolling and sea-lioning. I have recieved
your insults via public and private channels (such as linkedin) as have
others. This is not ok and has to stop. If you have something personal
against me or against your former employer, this is not the right place or
forum.















On Fri, Sep 6, 2019 at 3:56 PM Chris Olivier  wrote:

> Hi Pedro,
>
> While I was not involved with amalgamation or its development in any way,
> can you please refrain from referring to the work of others as a "hacky
> solution"?  This is derogatory slang and the statement was not supported
> with any justification for such name-calling.  Someone spent a good deal of
> time on this solution at some point in time and I am sure it worked for its
> purpose at that time -- I think it was used in the original javascript port
> as well, actually -- and it is disrespectful to call their efforts
> "hacky".  Please respect what came before.
>
> Thanks for understanding,
>
> -Chris
>
>
> On Fri, Sep 6, 2019 at 3:07 PM Pedro Larroy 
> wrote:
>
> > Hi
> >
> > I would like to propose to remove amalgamation from MXNet and CI, users
> > have reported that they couldn't use it successfully in Android, and
> > instead they were able to use the cross compiled docker build
> successfully.
> >
> > Any reason why we shouldn't remove this hacky solution?
> >
> > Pedro.
> >
>


[DISCUSS] Remove amalgamation

2019-09-06 Thread Pedro Larroy
Hi

I would like to propose to remove amalgamation from MXNet and CI, users
have reported that they couldn't use it successfully in Android, and
instead they were able to use the cross compiled docker build successfully.

Any reason why we shouldn't remove this hacky solution?

Pedro.


Re: [VOTE] Python 2 Removal for MXNet 1.6

2019-09-06 Thread Pedro Larroy
Did this vote pass? Can we remove Python2 support from master?

On Tue, Aug 27, 2019 at 2:51 PM Pedro Larroy 
wrote:

> +1
>
> On Tue, Aug 27, 2019 at 3:49 AM Leonard Lausen  wrote:
>
>> Due to References: header the prior email was still sorted in the
>> discussion thread. Cancelling this and resending without that header.
>>
>> Leonard Lausen  writes:
>>
>> > Marco de Abreu  writes:
>> >> 1. Which Python version to support. 3.5 vs 3.6 is currently in the
>> >> discussion due to Ubuntu 16.04 being shipped with 3.5 while the biggest
>> >> market share being 3.6 as of now.
>> >
>> > We could drop Python 2 even before deciding when to drop 3.5.
>> >
>> >> 2. When to do the deprecation. EOY to match with official Python 2
>> >> deprecation, in 1.5 years to be in line with Ubuntu 16.04 LTS or with
>> the
>> >> next major release (2.0) to adhere to semantic versioning.
>> >
>> > From a Semantic Versioning standepoint, "Given a version number
>> > MAJOR.MINOR.PATCH, increment the: MAJOR version when you make
>> > incompatible API changes, MINOR version when you add functionality in a
>> > backwards compatible manner, [...]" [1].
>> >
>> > Based on Semantic Versioning, the question is if we consider Python 2
>> > support to be part of our API, or rather independent. In the latter
>> > case, dropping for 1.6 is fine.
>> >
>> > From a user-experience perspective, users that want to continue using
>> > Python 2 for the next 127 days (until EOL date) currently have bigger
>> > worries than needing to upgrade to the next upcoming MXNet release. They
>> > must transition their codebase to Py3 within 127 days. For those days,
>> > they may just stay on MXNet 1.5?
>> >
>> > [1]: https://semver.org/
>> >
>> >> Once these points (and any future ones) have been properly discussed
>> and
>> >> the community came to an agreement, we can formalize it with a voting
>> >> thread. Until then, I'd recommend to refrain from any actions or
>> >> user-facing communication regarding this topic.
>> >
>> > Thus, let's start a vote on dropping Python 2 for MXNet 1.6.
>> > It's fine if this vote fails, but we need to get a clear understanding
>> > how we want to move forward.
>> >
>> > For better visibility, I'm removing the In-Reply-To: header, which was
>> > pointing to
>> cahtwjdorqsrbau0a89xjwasawgbvgz7bojsu6tkmxdl+ruh...@mail.gmail.com
>> >
>> >> On Tue, Aug 27, 2019 at 1:29 AM Pedro Larroy <
>> pedro.larroy.li...@gmail.com>
>> >> wrote:
>> >>
>> >>> I have sent a PR that removes Python2 from CI. But was closed. I
>> thought
>> >>> everyone was +1 on this one. This would remove quite a bit of load on
>> CI:
>> >>>
>> >>> https://github.com/apache/incubator-mxnet/pull/15990
>> >>>
>> >>> If it's not the right time to do this, what steps do we need to take?
>> >>>
>> >>> Pedro.
>>
>


Re: new website

2019-09-06 Thread Pedro Larroy
The new website looks great Aaron. Nice work to everyone involved !

On Thu, Aug 29, 2019 at 5:26 PM Aaron Markham 
wrote:

> Hi everyone,
>
> I'm very excited to share a preview and the pull requests for a new
> website and new documentation pipelines.
>
> The following link is using Apache's new staging site setup. It is
> built from the new docs publishing pipelines in CI where a Jekyll
> website is built, and documentation artifacts from Clojure, CPP, Java,
> Julia, Python, R, and Scala are combined into one website.
>
> https://mxnet-beta.staged.apache.org
>
> It is the culmination of a lot of effort of several MXNet contributors.
>
> * A huge shout out goes to Thomas Delteil for the work on the new
> Jekyll-backend and beautiful-looking website, and for helping me out
> whenever I'd get stuck on revamping the 7 different API docs systems
> in CI.
> * Soji Adeshina and Vishaal Kapoor both helping me with the system
> design for the new docs pipelines.
> * Per Goncalves da Silva and Marco de Abreu both helped me with
> figuring out CI issues.
> * We also ported over Mu Li's beta site for the Python & R APIs which
> had many contributors there. Thanks goes to Mu, Ivy Bazan, Jonas
> Mueller, Aston Zhang, and Zhi Zhang for their help & contributions. I
> apologize in advance if I missed anyone.
>
> Highlights:
>
> * R docs are now generated as part of CI. There were issues with R
> docs coming from beta repo. They were not reproducible. So I began the
> process of creating the pdf doc that is expected by R users as an
> alternative. Thomas fixed a CPP bug that was blocking 90% of the docs
> from appearing. The R docs are 10x in length compared to the pdf we're
> hosting now!
>
> * Each other API is built in a micro-site fashion. You will notice
> that the reference API links will open up the site that is generated
> by that language's docs tools. We tried to keep the navigation common
> and do this for the Python API. This is something that can be expanded
> on for the other APIs in later updates to the website.
>
> * Each doc set can be generated separately with functions that will
> run in Docker and generate the docs artifacts. This means you can now
> focus on your preferred API and not have to deal with anything else.
>
> * Website changes are now much easier. You can serve Jekyll locally,
> and have it do incremental updates, so you can see your changes live
> without having to build MXNet or anything else. It's a pure front-end
> setup.
>
> * For website publishing, the MXNet binary is built once and then
> shared with the other docs generation pipelines.
>
> * For individual docs runs, you can run a "lite" binary build, then
> follow it up with the docs run you want.
>
> ---
>
> For example to build MXNet:
>
> ci/build.py --docker-registry mxnetcidev --platform ubuntu_cpu_lite
> /work/runtime_functions.sh build_ubuntu_cpu_docs
>
> Then to build the R docs:
>
> ci/build.py --docker-registry mxnetcidev --platform ubuntu_cpu_r
> /work/runtime_functions.sh build_r_docs
>
> There is now a Docker image and a runtime_function for each API
> (except Perl which is built offsite). Python is like this:
>
> ci/build.py --docker-registry mxnetcidev --platform ubuntu_cpu_python
> /work/runtime_functions.sh build_python_docs
>
> The pattern for platform is ubuntu_cpu_{api} and runtime_functions.sh
> is build_{api}_docs.
>
> Further information is on the developer wiki:
> https://cwiki.apache.org/confluence/display/MXNET/Building+the+New+Website
> 
>
> Ok, now this is where YOU come in. We need reviewers and testers.
>
> There are a lot of changes. My original PR was over 1,000 files with
> 83k additions and 55k deletions. So, Thomas broke this up into three
> pull requests that stack.
>
> Step 1 New Content https://github.com/apache/incubator-mxnet/pull/15884
> Step 2 Remove Old Content
> https://github.com/apache/incubator-mxnet/pull/15885
> Step 3 Setup New Jenkins
> https://github.com/apache/incubator-mxnet/pull/15886
>
> For reviewing purposes, start with the new content - what's easily
> visible on the preview website. This is mostly happening in the first
> PR:
> https://github.com/apache/incubator-mxnet/pull/15884
> You can also look at these helper PRs that show you the differences so
> it is easier to review what's happening in Steps 2 and 3. You can
> review these now as well.
> Step 1->2: https://github.com/ThomasDelteil/incubator-mxnet/pull/5
> Step 2->3: https://github.com/ThomasDelteil/incubator-mxnet/pull/6
>
> I really appreciate everyone's support on this effort.
>
> Cheers,
> Aaron
>


Re: [DISCUSS] Slim down scope of CI

2019-08-28 Thread Pedro Larroy
Gathering this data would be useful to make a decision.

I myself have invested lots of time porting to other platforms such as PI,
JETSON, Arm or Android. If the community is interested in maintaining a
platform there should be action. For a long time I haven't seen much effort
there. Ideally I like to see several platforms windows for portability
reasons. But I question there's bandwidth for that.

On Wednesday, August 28, 2019, Aaron Markham 
wrote:
> I have an open issue about gathering data per platform install so there
can
> be an informed discussion on prioritization or even cutting platforms.
> Until then... I wouldn't cut one.
>
> I would like to hear the pros and cons for dropping some native platform
> support in favor of containers. But... For windows I believe Nvidia still
> hasn't provided a way to access the GPU from docker on most windows OSs.
If
> that's true, that would imply we shouldn't drop windows assuming that
> windows users will be able to run things in docker. That being said, why
> support every range of possibilty of blas or mkl or opencv? Why not come
up
> with a set of strict dependencies for windows to simplify the build, CI,
> and binary distribution?
>
>
> On Wed, Aug 28, 2019, 09:30 Pedro Larroy 
> wrote:
>
>> Hi
>>
>> I would like to propose a discussion to slim down CI by dropping some
jobs
>> which are of questionable value, other which have received little
community
>> support:
>>
>> - Drop centos: we can't support every distro, if anything we should
clearly
>> specify which versions of base libraries are needed and test for the
latest
>> stable release of Ubuntu. I don't see much value in testing different
>> distribution flavours. Not sure what was the rationale for adding CentOS
in
>> the past but this might no longer apply.
>>
>> - Drop windows: windows has received little attention from the community
>> and mostly has served as a gatekeeping from CI.
>>
>> Dropping windows would allow to migrate to a more advanced CI framework
>> such as Drone or a fully container based one given that Jenkins has
quite a
>> bit of baggage and for the most part we are testing on containers.
>>
>> Pedro.
>>
>


[DISCUSS] Slim down scope of CI

2019-08-28 Thread Pedro Larroy
Hi

I would like to propose a discussion to slim down CI by dropping some jobs
which are of questionable value, other which have received little community
support:

- Drop centos: we can't support every distro, if anything we should clearly
specify which versions of base libraries are needed and test for the latest
stable release of Ubuntu. I don't see much value in testing different
distribution flavours. Not sure what was the rationale for adding CentOS in
the past but this might no longer apply.

- Drop windows: windows has received little attention from the community
and mostly has served as a gatekeeping from CI.

Dropping windows would allow to migrate to a more advanced CI framework
such as Drone or a fully container based one given that Jenkins has quite a
bit of baggage and for the most part we are testing on containers.

Pedro.


Re: [VOTE] Python 2 Removal for MXNet 1.6

2019-08-27 Thread Pedro Larroy
+1

On Tue, Aug 27, 2019 at 3:49 AM Leonard Lausen  wrote:

> Due to References: header the prior email was still sorted in the
> discussion thread. Cancelling this and resending without that header.
>
> Leonard Lausen  writes:
>
> > Marco de Abreu  writes:
> >> 1. Which Python version to support. 3.5 vs 3.6 is currently in the
> >> discussion due to Ubuntu 16.04 being shipped with 3.5 while the biggest
> >> market share being 3.6 as of now.
> >
> > We could drop Python 2 even before deciding when to drop 3.5.
> >
> >> 2. When to do the deprecation. EOY to match with official Python 2
> >> deprecation, in 1.5 years to be in line with Ubuntu 16.04 LTS or with
> the
> >> next major release (2.0) to adhere to semantic versioning.
> >
> > From a Semantic Versioning standepoint, "Given a version number
> > MAJOR.MINOR.PATCH, increment the: MAJOR version when you make
> > incompatible API changes, MINOR version when you add functionality in a
> > backwards compatible manner, [...]" [1].
> >
> > Based on Semantic Versioning, the question is if we consider Python 2
> > support to be part of our API, or rather independent. In the latter
> > case, dropping for 1.6 is fine.
> >
> > From a user-experience perspective, users that want to continue using
> > Python 2 for the next 127 days (until EOL date) currently have bigger
> > worries than needing to upgrade to the next upcoming MXNet release. They
> > must transition their codebase to Py3 within 127 days. For those days,
> > they may just stay on MXNet 1.5?
> >
> > [1]: https://semver.org/
> >
> >> Once these points (and any future ones) have been properly discussed and
> >> the community came to an agreement, we can formalize it with a voting
> >> thread. Until then, I'd recommend to refrain from any actions or
> >> user-facing communication regarding this topic.
> >
> > Thus, let's start a vote on dropping Python 2 for MXNet 1.6.
> > It's fine if this vote fails, but we need to get a clear understanding
> > how we want to move forward.
> >
> > For better visibility, I'm removing the In-Reply-To: header, which was
> > pointing to
> cahtwjdorqsrbau0a89xjwasawgbvgz7bojsu6tkmxdl+ruh...@mail.gmail.com
> >
> >> On Tue, Aug 27, 2019 at 1:29 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> >> wrote:
> >>
> >>> I have sent a PR that removes Python2 from CI. But was closed. I
> thought
> >>> everyone was +1 on this one. This would remove quite a bit of load on
> CI:
> >>>
> >>> https://github.com/apache/incubator-mxnet/pull/15990
> >>>
> >>> If it's not the right time to do this, what steps do we need to take?
> >>>
> >>> Pedro.
>


Re: [Discussion] MXNet 1.5.1 release

2019-08-27 Thread Pedro Larroy
Ok. I was just asking if we want this fix in 1.5.1 since it addresses
crashes using multiprocessing. The problem with cherry picking is that the
patch contains the dynamic load change which shouldn't impact anything else
but is not supposed to go in a release branch.

On Tue, Aug 27, 2019 at 1:19 PM Lin Yuan  wrote:

> https://github.com/apache/incubator-mxnet/pull/15762  contains some
> unrelated changes which is being reverted. Please do not cherry pick it
> yet.
>
> On Mon, Aug 26, 2019 at 4:25 PM Pedro Larroy  >
> wrote:
>
> > There's a fix that I did which seems to still produce crashes in 1.5 for
> > some users, which I got notice today and is fixed in master.
> >
> > Might be useful to put in 1.5.1:
> > https://github.com/apache/incubator-mxnet/pull/15762   ?
> >
> > Pedro.
> >
> > On Tue, Aug 20, 2019 at 7:49 AM Tao Lv  wrote:
> >
> > > Hi dev,
> > >
> > > Here is an update for the 1.5.1 patch release.
> > >
> > > 1. Thanks for the effort from whole community, we have cherry picked a
> > > bunch of fixes to v1.5.x branch. So far, the branch looks healthy:
> > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/activity/
> > > 2. https://github.com/apache/incubator-mxnet/pull/15803 cannot pass
> the
> > > CI;
> > > 3. I hope julia folks can take a look at the back porting for
> > > https://github.com/apache/incubator-mxnet/pull/15609 and
> > > https://github.com/apache/incubator-mxnet/pull/15608 - do we still
> need
> > > them?
> > > 4. License issue of cub and pybind is still not fixed. We also has a
> > > license issue of a cat image in julia examples.
> > > https://github.com/apache/incubator-mxnet/issues/15542
> > > 5. Still no progress for the sidebar issue:
> > > https://github.com/apache/incubator-mxnet/issues/15200
> > > 6. There is a GPU OOM issue in 1.5.0 release and already root caused by
> > > Lin:
> > >
> > >
> >
> https://github.com/apache/incubator-mxnet/issues/15703#issuecomment-522780492
> > > .
> > > We need decide whether we want to get it fixed in the 1.5.1 patch
> > release.
> > >
> > > Please find details in
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/1.5.1+Release+Plan+and+Status
> > > .
> > >
> > > Thanks,
> > > -tao
> > >
> > > On Mon, Aug 12, 2019 at 9:57 PM Zhao, Patric 
> > > wrote:
> > >
> > > > Thanks for the explanation, Marco & Tao. Sounds great!
> > > >
> > > > > -Original Message-
> > > > > From: Tao Lv 
> > > > > Sent: Monday, August 12, 2019 9:54 PM
> > > > > To: dev@mxnet.incubator.apache.org
> > > > > Subject: Re: [Discussion] MXNet 1.5.1 release
> > > > >
> > > > > > Regarding the open issue, is there default code owner/maintainer?
> > If
> > > > > > so, he/she will be the right people to look into the issue.
> > > > > > https://github.com/apache/incubator-mxnet/blob/master/CODEOWNERS
> > > > > >
> > > > >
> > > > > I have no idea. But the CODEOWNERS is used to receive change
> > > > notificaitons,
> > > > > not actually indicates the maintainer of a piece of code.
> > > > >
> > > > > Do we have regularly build, run, functionality and performance
> > testing
> > > > for
> > > > > > this release?
> > > > >
> > > > >
> > > > > As Marco mentioned, build, run and functionality of v1.5.x branch
> are
> > > > tracked
> > > > > automatically by the CI for each cherry pick pull request and the
> > > > nightly tests
> > > > > here:
> > > > > http://jenkins.mxnet-ci.amazon-
> > > > > ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/activity
> .
> > > > > I see it's healthy so far.
> > > > >
> > > > > For performance, Shufan will track CPU performance with his test
> > suite
> > > > and
> > > > > send out the report once the branch is frozen. I'm not sure if
> there
> > > are
> > > > any
> > > > > other performance tests.
> > > > >
> > > > > On Mon, Aug 12, 2019 at 9:36 PM Marco de Abreu
> > > > > 
> > > > &

Re: [Discuss] MXNet Python < 3.6 Support Deprecation

2019-08-26 Thread Pedro Larroy
I have sent a PR that removes Python2 from CI. But was closed. I thought
everyone was +1 on this one. This would remove quite a bit of load on CI:

https://github.com/apache/incubator-mxnet/pull/15990

If it's not the right time to do this, what steps do we need to take?

Pedro.


On Mon, Aug 26, 2019 at 1:27 AM Leonard Lausen  wrote:

> Lieven Govaerts  writes:
> > Hi,
> >
> > On Thu, 22 Aug 2019 at 17:01, Leonard Lausen  wrote:
> >
> >> Hi,
> >>
> >> Pedro stated "Seems 3.6 is a reasonable choice." and there have been a
> >> few +1 after Chaitanya's reply to Pedro. I would like to check if these
> >> only refer to Chaitanya's mail about a dedicated "improvement" effort or
> >> about dropping 3.5.
> >>
> >> Thus two questions:
> >>
> >> 1) Are there any concerns about dropping Python 3.5? Now is your chance
> to
> >> speak up if you think so.
> >>
> >>
> > Ubuntu 16.04 LTS defaults to Python 3.5.x . The LTS releases are
> supported
> > for 5 years, so for 16.04 LTS it ends in 1.5 years.
> >
> > I'm not saying you should wait for 1.5 more years, people can upgrade to
> > 18.04 LTS after all, but may I suggest you make this switch in a major
> > release only? More specifically, ensure that Python 3.6-only code doesn't
> > accidentally gets merged into a 1.5.X patch release.
> >
> > thanks,
> >
> > Lieven
>
> Hi Lieven,
>
> thanks. I believe the Python version compatibility falls under the
> semantic versioning umbrella of things not to break within any 1.x
> release. Thus above suggestion would be with respect to a 2.x release or
> experimental / preview / new features added to 1.x, without affecting
> existing 1.x features. It would not affect 1.5.x patch releases.
>
> Best regards,
> Leonard
>
>
> >> 2) Should new MXNet 1.x (experimental?) functionality (for example numpy
> >> compatible interface) only target the Python versions to be supported in
> >> MXNet 2? The current plan is to make many MXNet 2 features available as
> >> "opt-in" in MXNet 1.x. Supporting older Python versions on MXNet 1 for
> >> these features may impact design and functionality and create
> >> unnecessary technical debt.
>


Re: [Discussion] MXNet 1.5.1 release

2019-08-26 Thread Pedro Larroy
There's a fix that I did which seems to still produce crashes in 1.5 for
some users, which I got notice today and is fixed in master.

Might be useful to put in 1.5.1:
https://github.com/apache/incubator-mxnet/pull/15762   ?

Pedro.

On Tue, Aug 20, 2019 at 7:49 AM Tao Lv  wrote:

> Hi dev,
>
> Here is an update for the 1.5.1 patch release.
>
> 1. Thanks for the effort from whole community, we have cherry picked a
> bunch of fixes to v1.5.x branch. So far, the branch looks healthy:
>
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/activity/
> 2. https://github.com/apache/incubator-mxnet/pull/15803 cannot pass the
> CI;
> 3. I hope julia folks can take a look at the back porting for
> https://github.com/apache/incubator-mxnet/pull/15609 and
> https://github.com/apache/incubator-mxnet/pull/15608 - do we still need
> them?
> 4. License issue of cub and pybind is still not fixed. We also has a
> license issue of a cat image in julia examples.
> https://github.com/apache/incubator-mxnet/issues/15542
> 5. Still no progress for the sidebar issue:
> https://github.com/apache/incubator-mxnet/issues/15200
> 6. There is a GPU OOM issue in 1.5.0 release and already root caused by
> Lin:
>
> https://github.com/apache/incubator-mxnet/issues/15703#issuecomment-522780492
> .
> We need decide whether we want to get it fixed in the 1.5.1 patch release.
>
> Please find details in
>
> https://cwiki.apache.org/confluence/display/MXNET/1.5.1+Release+Plan+and+Status
> .
>
> Thanks,
> -tao
>
> On Mon, Aug 12, 2019 at 9:57 PM Zhao, Patric 
> wrote:
>
> > Thanks for the explanation, Marco & Tao. Sounds great!
> >
> > > -Original Message-
> > > From: Tao Lv 
> > > Sent: Monday, August 12, 2019 9:54 PM
> > > To: dev@mxnet.incubator.apache.org
> > > Subject: Re: [Discussion] MXNet 1.5.1 release
> > >
> > > > Regarding the open issue, is there default code owner/maintainer? If
> > > > so, he/she will be the right people to look into the issue.
> > > > https://github.com/apache/incubator-mxnet/blob/master/CODEOWNERS
> > > >
> > >
> > > I have no idea. But the CODEOWNERS is used to receive change
> > notificaitons,
> > > not actually indicates the maintainer of a piece of code.
> > >
> > > Do we have regularly build, run, functionality and performance testing
> > for
> > > > this release?
> > >
> > >
> > > As Marco mentioned, build, run and functionality of v1.5.x branch are
> > tracked
> > > automatically by the CI for each cherry pick pull request and the
> > nightly tests
> > > here:
> > > http://jenkins.mxnet-ci.amazon-
> > > ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/activity.
> > > I see it's healthy so far.
> > >
> > > For performance, Shufan will track CPU performance with his test suite
> > and
> > > send out the report once the branch is frozen. I'm not sure if there
> are
> > any
> > > other performance tests.
> > >
> > > On Mon, Aug 12, 2019 at 9:36 PM Marco de Abreu
> > > 
> > > wrote:
> > >
> > > > Hi Patric,
> > > >
> > > > CI should automatically pick up the branch and validate it as usual.
> > > >
> > > > Best regards,
> > > > Marco
> > > >
> > > > Zhao, Patric  schrieb am Mo., 12. Aug. 2019,
> > 15:22:
> > > >
> > > > > It's great works, Tao 
> > > > >
> > > > > Regarding the open issue, is there default code owner/maintainer?
> If
> > > > > so, he/she will be the right people to look into the issue.
> > > > > https://github.com/apache/incubator-
> > > mxnet/blob/master/CODEOWNERS
> > > > >
> > > > > Do we have regularly build, run, functionality and performance
> > > > > testing
> > > > for
> > > > > this release?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > --Patric
> > > > >
> > > > > > -Original Message-
> > > > > > From: Tao Lv 
> > > > > > Sent: Monday, August 12, 2019 8:59 PM
> > > > > > To: dev@mxnet.incubator.apache.org
> > > > > > Subject: Re: [Discussion] MXNet 1.5.1 release
> > > > > >
> > > > > > Update:
> > > > > >
> > > > > > We're cherry picking fixes from the master to the v1.5.x branch.
> > > > > > Some
> > > > of
> > > > > > them are already merged. Please find details on the cwiki page:
> > > > > >
> https://cwiki.apache.org/confluence/display/MXNET/1.5.1+Release+Pl
> > > > > > an+a
> > > > > > nd+Status
> > > > > >
> > > > > >
> > > > > >  There are still 3 opens:
> > > > > > 1. Nightly test failure on CI (
> > > > > > https://github.com/apache/incubator-mxnet/issues/15374): The
> issue
> > > > > > is
> > > > > still
> > > > > > open. I'm wondering if it has been fixed or not. If not, is there
> > > > anyone
> > > > > > working on it?
> > > > > > 2. Broken Sidebar on website API for master and 1.5.0 (
> > > > > > https://github.com/apache/incubator-mxnet/issues/15200): I don't
> > > > > > see
> > > > any
> > > > > > progress on this issue? Do we still want to include it into 1.5.1
> > > > > > patch
> > > > > release?
> > > > > > 3. License issues need to be fixed before 1.6 release (
> > > > > > 

Re: CI and PRs

2019-08-26 Thread Pedro Larroy
Hi Chris, you are reading some confrontational or negative things where
there is no bad intention and just diverse opinions and ways to express
them.

We went with Marco for a beer and dinner together and talked about this and
we had a good exchange of technical ideas and opinions with mutual respect,
often is much easier to talk  in person than getting the wrong
interpretation over email. (Isn't Apache about community over code?) You
and me should do it sometime if you want. I send you an initial beer emoji
 as a friendly token of good intention.

Marco's role in the project and in CI, his PMC role or contributions were
never put into question. The question is how can we have a more diverse
contributions and making it easier to do so to grow the community and help
people contribute. Giving credit, acknowledging that many activities are a
team effort and supporting them are some ideas that I think might be useful
looking forward. My proposal is to acknowledge those contributions and be
more inclusive now that the remaining infrastructure is open sourced.


Pedro.










On Fri, Aug 23, 2019 at 7:43 PM Chris Olivier  wrote:

> Pedro,
>
> I don’t see where Marco says that he “designed and implemented all aspects
> of CI by himself”.  I do think, however, that it’s fair to say that Marco
> was in charge of the design and most likely made the majority of design
> decisions as the CI was being built, especially around those tenents that
> he mentioned.  I know this because before I submitted Marco as a committer,
> I asked some his teammates whether Marco was really responsible for CI and
> the answer by all I asked were that CI was Marco's baby and he did most of
> it by some large margin (I am paraphrasing).  Taking other design inputs
> and examples (i.e. Apache CI) is all part of any responsible design
> process.
>
> In addition, I am not understanding the obfuscation of “people who
> contributed to CI”, “person/people who designed CI”, or even
> "person who oversees CI" as it is weaponized in your email.  Again, nowhere
> did Marco say that he did everything back then or since then.  I don't
> think it's fair to try to modify what Marco wrote and then try to turn it
> against him.  Reminds me of the techniques of network news these days,
> quite frankly (whichever side you're "on" doesn't matter, because both
> sides do it).
>
> -Chris
>
>
>
>
>
> On Fri, Aug 23, 2019 at 3:56 PM Pedro Larroy  >
> wrote:
>
> > Thanks for your response Marco, I think you have totally missed my
> original
> > point which was basically that someone volunteering effort on the CI is
> as
> > important as someone contributing a feature. From my perspective this
> > hasn't been the case, and we had to rely a lot on you and Sheng to submit
> > fixes which required access, also to relay communication with Apache
> infra.
> > Also in many cases we had to rely on you to channel fixes, PRs, disable
> > tests etc. If the community is fine having this kind of bottleneck, fine
> > with me. From my point of view and the feedback from myself and other
> > people which contributed to CI this was not always a good experience.
> > Having a welcoming and inclusive community is very important. I don't
> want
> > to start a discussion on this, but invite the community to do a bit of
> soul
> > searching on this topic, now that the infrastructure is open source.
> >
> > Also I find surprising that you claim that you designed the CI yourself,
> > when this was a joint work of many individuals, including the old Apache
> CI
> > and additional contributions and code reviewers, people who were oncall
> for
> > this service or the autoscaling approach which if I remember correctly
> came
> > from a humble servant. Kellen did a lot of pair programming and code
> > reviews. Obviously you have a done a lot of work on CI which has had a
> huge
> > positive impact on the project and your recognition is well deserved. The
> > technical details you mention on your email are perfectly true and valid.
> >
> > Below is a rough list of individuals who contributed to CI, I would like
> to
> > thank all of them since without this work, we wouldn't be able to deliver
> > with the quality that we have done in the past.
> >
> >
> > pllarroy@mac:0: ~/d/m/ci [fc_higher_order_grad_2]> git log
> > --pretty=format:%aN . | sort | uniq -c | sort -n | tail -n 10
> >6 Zach Kimberg
> >6 stu1130
> >7 Jake Lee
> >8 Aaron Markham
> >   11 Lanking
> >   12 Anton Chernov
> >   13 perdasilva
> >   26 Kellen Sunderland
> >   34 Marco de Abreu
> >   46 Pedro Larroy
> >
> > pll

Re: CI and PRs

2019-08-23 Thread Pedro Larroy
Thanks for your response Marco, I think you have totally missed my original
point which was basically that someone volunteering effort on the CI is as
important as someone contributing a feature. From my perspective this
hasn't been the case, and we had to rely a lot on you and Sheng to submit
fixes which required access, also to relay communication with Apache infra.
Also in many cases we had to rely on you to channel fixes, PRs, disable
tests etc. If the community is fine having this kind of bottleneck, fine
with me. From my point of view and the feedback from myself and other
people which contributed to CI this was not always a good experience.
Having a welcoming and inclusive community is very important. I don't want
to start a discussion on this, but invite the community to do a bit of soul
searching on this topic, now that the infrastructure is open source.

Also I find surprising that you claim that you designed the CI yourself,
when this was a joint work of many individuals, including the old Apache CI
and additional contributions and code reviewers, people who were oncall for
this service or the autoscaling approach which if I remember correctly came
from a humble servant. Kellen did a lot of pair programming and code
reviews. Obviously you have a done a lot of work on CI which has had a huge
positive impact on the project and your recognition is well deserved. The
technical details you mention on your email are perfectly true and valid.

Below is a rough list of individuals who contributed to CI, I would like to
thank all of them since without this work, we wouldn't be able to deliver
with the quality that we have done in the past.


pllarroy@mac:0: ~/d/m/ci [fc_higher_order_grad_2]> git log
--pretty=format:%aN . | sort | uniq -c | sort -n | tail -n 10
   6 Zach Kimberg
   6 stu1130
   7 Jake Lee
   8 Aaron Markham
  11 Lanking
  12 Anton Chernov
  13 perdasilva
  26 Kellen Sunderland
  34 Marco de Abreu
  46 Pedro Larroy

pllarroy@mac:0: ~/d/mxnet_ci_general [master]> git log --pretty=format:%aN
| sort | uniq -c | sort -n
   1 Gavin M. Bell
   1 de Abreu
   6 Bair
   7 Kellen Sunderland
   8 Jose Luis Contreras
  14 perdasilva
  20 Per Goncalves da Silva
  29 Anton Chernov
  39 Chance Bair
  96 Pedro Larroy
 209 Marco de Abreu



Pedro.

On Fri, Aug 23, 2019 at 3:18 PM Marco de Abreu 
wrote:

> I've heard this request multiple times and so far, I'm having issues
> understanding the direct correlation between having committer permissions
> and being able to manage CI.
>
> When I designed the CI, one of the tenets was maintainability and
> accessbility for the community: I wanted to avoid that somebody needs
> certain privileges in order to execute regular actions. The result was the
> strong usage of Jenkinsfiles, Dockerfiles and the runtime functions. The
> combination of these techniques allowed somebody to create a job from the
> process flow level (Jenkinsfile), over the environment level (Dockerfile)
> to the individual action level (runtime functions). This design basically
> gives the community full access over the entire flow.
>
> The jobs that are configured to source only Jenkinsfile. Jenkins supports a
> lot of different ways how to define pipelines, but I have made sure to
> encourage everybody to use only Jenkinsfiles. This makes sure that no
> configuration is done in the web-interface. This firs of all alleviates the
> permission issue since there's literally no config in the web interface and
> second it allows auditing since all changes have to be done in the MXNet
> GitHub repository.
>
> Committers have elevated permissions in Jenkins. These contain the
> permission to run, stop and configure jobs. All other permissions are
> restricted to system administrators for the sake of ensuring stability of
> the system. On the dev-CI on the other hand, we're happy to add people so
> they can experiment as much as they want. The transition to prod-CI is then
> assisted by me to ensure smooth operations and adhering to the best
> practices (like using our Jenkinsfiles and Docker structure, for example).
>
> The only case where somebody would need elevated permissions is if they
> would like to change system settings. But at that point, we're talking
> about instance settings and AWS account configuration. Since that now
> reaches into the next permission level, which is restricted to the donor of
> the CI system - Amazon Web Services - this is something that not even PMC
> members will receive. The same policy is in place for the official Apache
> CI: Committers/PMCs can configure their job, but don't have system level
> access to either Jenkins or the underlying AWS account for obvious reasons.
> We're trying to stay in line with the same policy, but in the past I've
> granted Jenkins administrator access to people who required elevated access
> to properly do t

Re: CI and PRs

2019-08-23 Thread Pedro Larroy
As Marco has open sourced the bulk of the CI infrastructure donated from
Amazon to the community, I would like to raise the recommendation that the
community takes action to help volunteers working on the CI have a better
experience. In the past, it's my impression that there hasn't been much
action granting PMC or committer privileges to engineers volunteering to
help CI other than Marco. This would encourage more contributions and help
expedite critical fixes and corrective actions. I think this has not
properly enabled those individuals to be as effective as they could, as
well as the lack of recognition for such a critical activity. I'm not sure
about the cause but I believe this is something that should be rectified
for future contributions and help on the CI front if improvements are
desired.

In spanish we have a saying: "es de bien nacido ser agradecido".

Pedro.

On Fri, Aug 16, 2019 at 4:03 PM Pedro Larroy 
wrote:

> Hi Aaron. This is difficult to diagnose, because I don't know what to do
> when the hash of the layer in docker doesn't match and decides to rebuild
> it. the r script seems not to have changed. I have observed this in the
> past and I think is due to bugs in docker.   Maybe Kellen is able to give
> some tips here.
>
> In this case you should use -R which is already in master. (you can always
> copy the script on top if you are in an older revision).
>
> Another thing that worked for me in the past was to completely nuke the
> docker cache, so it redonwloads from the CI repo. After that it worked fine
> in some cases.
>
> These two workarounds are not ideal, but should unblock you.
>
> Pedro.
>
> On Fri, Aug 16, 2019 at 11:39 AM Aaron Markham 
> wrote:
>
>> Is -R already in there?
>>
>> Here's an example of it happening to me right now I am making
>> minor changes to the runtime_functions logic for handling the R docs
>> output. I pull the fix, then run the container, but I see the R deps
>> layer re-running. I didn't touch that. Why it that running again?
>>
>> From https://github.com/aaronmarkham/incubator-mxnet
>>f71cc6d..deec6aa  new_website_pipeline_2_aaron_rdocs ->
>> origin/new_website_pipeline_2_aaron_rdocs
>> Updating f71cc6d..deec6aa
>> Fast-forward
>>  ci/docker/runtime_functions.sh | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>> (base) ubuntu@ip-172-31-47-182:~/aaron/ci$ ./build.py
>> --docker-registry mxnetci --platform ubuntu_cpu_r
>> --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh
>> build_r_docs
>> build.py: 2019-08-16 18:34:44,639Z INFO MXNet container based build tool.
>> build.py: 2019-08-16 18:34:44,641Z INFO Docker cache download is
>> enabled from registry mxnetci
>> build.py: 2019-08-16 18:34:44,641Z INFO Loading Docker cache for
>> mxnetci/build.ubuntu_cpu_r from mxnetci
>> Using default tag: latest
>> latest: Pulling from mxnetci/build.ubuntu_cpu_r
>> Digest:
>> sha256:7dc515c288b3e66d96920eb8975f985a501bb57f70595fbe0cb1c4fcd8d4184b
>> Status: Downloaded newer image for mxnetci/build.ubuntu_cpu_r:latest
>> build.py: 2019-08-16 18:34:44,807Z INFO Successfully pulled docker cache
>> build.py: 2019-08-16 18:34:44,807Z INFO Building docker container
>> tagged 'mxnetci/build.ubuntu_cpu_r' with docker
>> build.py: 2019-08-16 18:34:44,807Z INFO Running command: 'docker build
>> -f docker/Dockerfile.build.ubuntu_cpu_r --build-arg USER_ID=1000
>> --build-arg GROUP_ID=1000 --cache-from mxnetci/build.ubuntu_cpu_r -t
>> mxnetci/build.ubuntu_cpu_r docker'
>> Sending build context to Docker daemon  289.8kB
>> Step 1/15 : FROM ubuntu:16.04
>>  ---> 5e13f8dd4c1a
>> Step 2/15 : WORKDIR /work/deps
>>  ---> Using cache
>>  ---> afc2a135945d
>> Step 3/15 : COPY install/ubuntu_core.sh /work/
>>  ---> Using cache
>>  ---> da2b2e7f35e1
>> Step 4/15 : RUN /work/ubuntu_core.sh
>>  ---> Using cache
>>  ---> d1e88b26b1d2
>> Step 5/15 : COPY install/deb_ubuntu_ccache.sh /work/
>>  ---> Using cache
>>  ---> 3aa97dea3b7b
>> Step 6/15 : RUN /work/deb_ubuntu_ccache.sh
>>  ---> Using cache
>>  ---> bec503f1d149
>> Step 7/15 : COPY install/ubuntu_r.sh /work/
>>  ---> c5e77c38031d
>> Step 8/15 : COPY install/r.gpg /work/
>>  ---> d8cdbf015d2b
>> Step 9/15 : RUN /work/ubuntu_r.sh
>>  ---> Running in c6c90b9e1538
>> ++ dirname /work/ubuntu_r.sh
>> + cd /work
>> + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
>> + apt-key add r.gpg
>> OK
>> + add-apt-repository 'deb [arch=amd64,i386]
>> https://cran.rstudio.com/bin/linux/ubuntu xeni

Re: CI and PRs

2019-08-16 Thread Pedro Larroy
Hi Aaron. This is difficult to diagnose, because I don't know what to do
when the hash of the layer in docker doesn't match and decides to rebuild
it. the r script seems not to have changed. I have observed this in the
past and I think is due to bugs in docker.   Maybe Kellen is able to give
some tips here.

In this case you should use -R which is already in master. (you can always
copy the script on top if you are in an older revision).

Another thing that worked for me in the past was to completely nuke the
docker cache, so it redonwloads from the CI repo. After that it worked fine
in some cases.

These two workarounds are not ideal, but should unblock you.

Pedro.

On Fri, Aug 16, 2019 at 11:39 AM Aaron Markham 
wrote:

> Is -R already in there?
>
> Here's an example of it happening to me right now I am making
> minor changes to the runtime_functions logic for handling the R docs
> output. I pull the fix, then run the container, but I see the R deps
> layer re-running. I didn't touch that. Why it that running again?
>
> From https://github.com/aaronmarkham/incubator-mxnet
>f71cc6d..deec6aa  new_website_pipeline_2_aaron_rdocs ->
> origin/new_website_pipeline_2_aaron_rdocs
> Updating f71cc6d..deec6aa
> Fast-forward
>  ci/docker/runtime_functions.sh | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> (base) ubuntu@ip-172-31-47-182:~/aaron/ci$ ./build.py
> --docker-registry mxnetci --platform ubuntu_cpu_r
> --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh
> build_r_docs
> build.py: 2019-08-16 18:34:44,639Z INFO MXNet container based build tool.
> build.py: 2019-08-16 18:34:44,641Z INFO Docker cache download is
> enabled from registry mxnetci
> build.py: 2019-08-16 18:34:44,641Z INFO Loading Docker cache for
> mxnetci/build.ubuntu_cpu_r from mxnetci
> Using default tag: latest
> latest: Pulling from mxnetci/build.ubuntu_cpu_r
> Digest:
> sha256:7dc515c288b3e66d96920eb8975f985a501bb57f70595fbe0cb1c4fcd8d4184b
> Status: Downloaded newer image for mxnetci/build.ubuntu_cpu_r:latest
> build.py: 2019-08-16 18:34:44,807Z INFO Successfully pulled docker cache
> build.py: 2019-08-16 18:34:44,807Z INFO Building docker container
> tagged 'mxnetci/build.ubuntu_cpu_r' with docker
> build.py: 2019-08-16 18:34:44,807Z INFO Running command: 'docker build
> -f docker/Dockerfile.build.ubuntu_cpu_r --build-arg USER_ID=1000
> --build-arg GROUP_ID=1000 --cache-from mxnetci/build.ubuntu_cpu_r -t
> mxnetci/build.ubuntu_cpu_r docker'
> Sending build context to Docker daemon  289.8kB
> Step 1/15 : FROM ubuntu:16.04
>  ---> 5e13f8dd4c1a
> Step 2/15 : WORKDIR /work/deps
>  ---> Using cache
>  ---> afc2a135945d
> Step 3/15 : COPY install/ubuntu_core.sh /work/
>  ---> Using cache
>  ---> da2b2e7f35e1
> Step 4/15 : RUN /work/ubuntu_core.sh
>  ---> Using cache
>  ---> d1e88b26b1d2
> Step 5/15 : COPY install/deb_ubuntu_ccache.sh /work/
>  ---> Using cache
>  ---> 3aa97dea3b7b
> Step 6/15 : RUN /work/deb_ubuntu_ccache.sh
>  ---> Using cache
>  ---> bec503f1d149
> Step 7/15 : COPY install/ubuntu_r.sh /work/
>  ---> c5e77c38031d
> Step 8/15 : COPY install/r.gpg /work/
>  ---> d8cdbf015d2b
> Step 9/15 : RUN /work/ubuntu_r.sh
>  ---> Running in c6c90b9e1538
> ++ dirname /work/ubuntu_r.sh
> + cd /work
> + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
> + apt-key add r.gpg
> OK
> + add-apt-repository 'deb [arch=amd64,i386]
> https://cran.rstudio.com/bin/linux/ubuntu xenial/'
> + apt-get update
> Ign:1 http://cran.rstudio.com/bin/linux/ubuntu trusty/ InRelease
>
> On Fri, Aug 16, 2019 at 11:32 AM Pedro Larroy
>  wrote:
> >
> > Also, I forgot, another workaround is that I added the -R flag to the
> build
> > logic (build.py) so the container is not rebuilt for manual use.
> >
> > On Fri, Aug 16, 2019 at 11:18 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> > wrote:
> >
> > >
> > > Hi Aaron.
> > >
> > > As Marco explained, if you are in master the cache usually works,
> there's
> > > two issues that I have observed:
> > >
> > > 1 - Docker doesn't automatically pull the base image (ex.
> ubuntu:16.04) so
> > > if your cached base which is used in the FROM statement becomes
> outdated
> > > your caching won't work. (Using docker pull ubuntu:16.04) or the base
> > > images from the container helps with this.
> > >
> > > 2 - There's another situation where the above doesn't help which seems
> to
> > > be an unidentified issue with the docker cache:
> > > https://github.com/docker/docker.github.io/issues/8886
> > >
> > &g

Re: CI and PRs

2019-08-16 Thread Pedro Larroy
Also, I forgot, another workaround is that I added the -R flag to the build
logic (build.py) so the container is not rebuilt for manual use.

On Fri, Aug 16, 2019 at 11:18 AM Pedro Larroy 
wrote:

>
> Hi Aaron.
>
> As Marco explained, if you are in master the cache usually works, there's
> two issues that I have observed:
>
> 1 - Docker doesn't automatically pull the base image (ex. ubuntu:16.04) so
> if your cached base which is used in the FROM statement becomes outdated
> your caching won't work. (Using docker pull ubuntu:16.04) or the base
> images from the container helps with this.
>
> 2 - There's another situation where the above doesn't help which seems to
> be an unidentified issue with the docker cache:
> https://github.com/docker/docker.github.io/issues/8886
>
> We can get a short term workaround for #1 by explicitly pulling bases from
> the script, but I think docker should do it when using --cache-from so
> maybe contributing a patch to docker would the best approach.
>
> Pedro
>
> On Thu, Aug 15, 2019 at 7:06 PM Aaron Markham 
> wrote:
>
>> When you create a new Dockerfile and use that on CI, it doesn't seem
>> to cache some of the steps... like this:
>>
>> Step 13/15 : RUN /work/ubuntu_docs.sh
>>  ---> Running in a1e522f3283b
>>  [91m+ echo 'Installing dependencies...'
>> + apt-get update
>>  [0mInstalling dependencies.
>>
>> Or this
>>
>> Step 4/13 : RUN /work/ubuntu_core.sh
>>  ---> Running in e7882d7aa750
>>  [91m+ apt-get update
>>
>> I get if I was changing those scripts, but then I'd think it should
>> cache after running it once... but, no.
>>
>>
>> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu 
>> wrote:
>> >
>> > Do I understand it correctly that you are saying that the Docker cache
>> > doesn't work properly and regularly reinstalls dependencies? Or do you
>> mean
>> > that you only have cache misses when you modify the dependencies - which
>> > would be expected?
>> >
>> > -Marco
>> >
>> > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <
>> aaron.s.mark...@gmail.com>
>> > wrote:
>> >
>> > > Many of the CI pipelines follow this pattern:
>> > > Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why
>> > > repeat steps 1-3 over and over?
>> > >
>> > > Now, some tests use a stashed binary and docker cache. And I see this
>> work
>> > > locally, but for the most part, on CI, you're gonna sit through a
>> > > dependency install.
>> > >
>> > > I noticed that almost all jobs use an ubuntu setup that is fully
>> loaded.
>> > > Without cache, it can take 10 or more minutes to build.  So I made a
>> lite
>> > > version. Takes only a few minutes instead.
>> > >
>> > > In some cases archiving worked great to share across pipelines, but as
>> > > Marco mentioned we need a storage solution to make that happen. We
>> can't
>> > > archive every intermediate artifact for each PR.
>> > >
>> > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy <
>> pedro.larroy.li...@gmail.com>
>> > > wrote:
>> > >
>> > > > Hi Aaron. Why speeds things up? What's the difference?
>> > > >
>> > > > Pedro.
>> > > >
>> > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <
>> aaron.s.mark...@gmail.com
>> > > >
>> > > > wrote:
>> > > >
>> > > > > The PRs Thomas and I are working on for the new docs and website
>> share
>> > > > the
>> > > > > mxnet binary in the new CI pipelines we made. Speeds things up a
>> lot.
>> > > > >
>> > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier 
>> > > wrote:
>> > > > >
>> > > > > > I see it done daily now, and while I can’t share all the
>> details,
>> > > it’s
>> > > > > not
>> > > > > > an incredibly complex thing, and involves not much more than
>> nfs/efs
>> > > > > > sharing and remote ssh commands.  All it takes is a little
>> ingenuity
>> > > > and
>> > > > > > some imagination.
>> > > > > >
>> > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
>> > > > > pedro.larroy.li...@gmail.com
>> > > > > > >
>&g

Re: CI and PRs

2019-08-16 Thread Pedro Larroy
Hi Aaron.

As Marco explained, if you are in master the cache usually works, there's
two issues that I have observed:

1 - Docker doesn't automatically pull the base image (ex. ubuntu:16.04) so
if your cached base which is used in the FROM statement becomes outdated
your caching won't work. (Using docker pull ubuntu:16.04) or the base
images from the container helps with this.

2 - There's another situation where the above doesn't help which seems to
be an unidentified issue with the docker cache:
https://github.com/docker/docker.github.io/issues/8886

We can get a short term workaround for #1 by explicitly pulling bases from
the script, but I think docker should do it when using --cache-from so
maybe contributing a patch to docker would the best approach.

Pedro

On Thu, Aug 15, 2019 at 7:06 PM Aaron Markham 
wrote:

> When you create a new Dockerfile and use that on CI, it doesn't seem
> to cache some of the steps... like this:
>
> Step 13/15 : RUN /work/ubuntu_docs.sh
>  ---> Running in a1e522f3283b
>  [91m+ echo 'Installing dependencies...'
> + apt-get update
>  [0mInstalling dependencies.
>
> Or this
>
> Step 4/13 : RUN /work/ubuntu_core.sh
>  ---> Running in e7882d7aa750
>  [91m+ apt-get update
>
> I get if I was changing those scripts, but then I'd think it should
> cache after running it once... but, no.
>
>
> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu 
> wrote:
> >
> > Do I understand it correctly that you are saying that the Docker cache
> > doesn't work properly and regularly reinstalls dependencies? Or do you
> mean
> > that you only have cache misses when you modify the dependencies - which
> > would be expected?
> >
> > -Marco
> >
> > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <
> aaron.s.mark...@gmail.com>
> > wrote:
> >
> > > Many of the CI pipelines follow this pattern:
> > > Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why
> > > repeat steps 1-3 over and over?
> > >
> > > Now, some tests use a stashed binary and docker cache. And I see this
> work
> > > locally, but for the most part, on CI, you're gonna sit through a
> > > dependency install.
> > >
> > > I noticed that almost all jobs use an ubuntu setup that is fully
> loaded.
> > > Without cache, it can take 10 or more minutes to build.  So I made a
> lite
> > > version. Takes only a few minutes instead.
> > >
> > > In some cases archiving worked great to share across pipelines, but as
> > > Marco mentioned we need a storage solution to make that happen. We
> can't
> > > archive every intermediate artifact for each PR.
> > >
> > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy  >
> > > wrote:
> > >
> > > > Hi Aaron. Why speeds things up? What's the difference?
> > > >
> > > > Pedro.
> > > >
> > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <
> aaron.s.mark...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > The PRs Thomas and I are working on for the new docs and website
> share
> > > > the
> > > > > mxnet binary in the new CI pipelines we made. Speeds things up a
> lot.
> > > > >
> > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier 
> > > wrote:
> > > > >
> > > > > > I see it done daily now, and while I can’t share all the details,
> > > it’s
> > > > > not
> > > > > > an incredibly complex thing, and involves not much more than
> nfs/efs
> > > > > > sharing and remote ssh commands.  All it takes is a little
> ingenuity
> > > > and
> > > > > > some imagination.
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> > > > > pedro.larroy.li...@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Sounds good in theory. I think there are complex details with
> > > regards
> > > > > of
> > > > > > > resource sharing during parallel execution. Still I think both
> ways
> > > > can
> > > > > > be
> > > > > > > explored. I think some tests run for unreasonably long times
> for
> > > what
> > > > > > they
> > > > > > > are doing. We already scale parts of the pipeline horizontally
> > > across
> > > > > > > workers.
> > &g

Re: MXNet CI repository

2019-08-15 Thread Pedro Larroy
Nice.

On Thu, Aug 15, 2019 at 12:47 PM Marco de Abreu 
wrote:

> Repository has been created: https://github.com/apache/incubator-mxnet-ci
>
> I will fill it soon.
>
> -Marco
>
> On Thu, Aug 15, 2019 at 8:43 PM Carin Meier  wrote:
>
> > +1
> >
> > On Thu, Aug 15, 2019 at 2:37 PM Chaitanya Bapat 
> > wrote:
> >
> > > +1
> > > LGTM!
> > >
> > > On Thu, 15 Aug 2019 at 11:01, Marco de Abreu 
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I'd like to propose a repository where CI infrastructure code can be
> > > > stored. I'd propose "incubator-mxnet-ci". Is everybody fine with that
> > > name
> > > > or has a better idea?
> > > >
> > > > Best regards
> > > > Marco
> > > >
> > >
> > >
> > > --
> > > *Chaitanya Prakash Bapat*
> > > *+1 (973) 953-6299*
> > >
> > > [image: https://www.linkedin.com//in/chaibapat25]
> > > [image:
> > https://www.facebook.com/chaibapat
> > > ]
> > > [image:
> > > https://twitter.com/ChaiBapchya]  > >[image:
> > > https://www.linkedin.com//in/chaibapat25]
> > > 
> > >
> >
>


Re: CI and PRs

2019-08-15 Thread Pedro Larroy
Hi Chris.
I suggest you send a PR to illustrate your proposal so we have a concrete
example to look into.
Pedro.

On Wed, Aug 14, 2019 at 6:16 PM Chris Olivier  wrote:

> I see it done daily now, and while I can’t share all the details, it’s not
> an incredibly complex thing, and involves not much more than nfs/efs
> sharing and remote ssh commands.  All it takes is a little ingenuity and
> some imagination.
>
> On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy  >
> wrote:
>
> > Sounds good in theory. I think there are complex details with regards of
> > resource sharing during parallel execution. Still I think both ways can
> be
> > explored. I think some tests run for unreasonably long times for what
> they
> > are doing. We already scale parts of the pipeline horizontally across
> > workers.
> >
> >
> > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier 
> > wrote:
> >
> > > +1
> > >
> > > Rather than remove tests (which doesn’t scale as a solution), why not
> > scale
> > > them horizontally so that they finish more quickly? Across processes or
> > > even on a pool of machines that aren’t necessarily the build machine?
> > >
> > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> marco.g.ab...@gmail.com
> > >
> > > wrote:
> > >
> > > > With regards to time I rather prefer us spending a bit more time on
> > > > maintenance than somebody running into an error that could've been
> > caught
> > > > with a test.
> > > >
> > > > I mean, our Publishing pipeline for Scala GPU has been broken for
> quite
> > > > some time now, but nobody noticed that. Basically my stance on that
> > > matter
> > > > is that as soon as something is not blocking, you can also just
> > > deactivate
> > > > it since you don't have a forcing function in an open source project.
> > > > People will rarely come back and fix the errors of some nightly test
> > that
> > > > they introduced.
> > > >
> > > > -Marco
> > > >
> > > > Carin Meier  schrieb am Mi., 14. Aug. 2019,
> > 21:59:
> > > >
> > > > > If a language binding test is failing for a not important reason,
> > then
> > > it
> > > > > is too brittle and needs to be fixed (we have fixed some of these
> > with
> > > > the
> > > > > Clojure package [1]).
> > > > > But in general, if we thinking of the MXNet project as one project
> > that
> > > > is
> > > > > across all the language bindings, then we want to know if some
> > > > fundamental
> > > > > code change is going to break a downstream package.
> > > > > I can't speak for all the high level package binding maintainers,
> but
> > > I'm
> > > > > always happy to pitch in to provide code fixes to help the base PR
> > get
> > > > > green.
> > > > >
> > > > > The time costs to maintain such a large CI project obviously needs
> to
> > > be
> > > > > considered as well.
> > > > >
> > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > >
> > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > > pedro.larroy.li...@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > From what I have seen Clojure is 15 minutes, which I think is
> > > > reasonable.
> > > > > > The only question is that when a binding such as R, Perl or
> Clojure
> > > > > fails,
> > > > > > some devs are a bit confused about how to fix them since they are
> > not
> > > > > > familiar with the testing tools and the language.
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> carinme...@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Great idea Marco! Anything that you think would be valuable to
> > > share
> > > > > > would
> > > > > > > be good. The duration of each node in the test stage sounds
> like
> > a
> > > > good
> > > > > > > start.
> > > > > > >
> > > > > > > - Carin
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 2:48 PM 

Re: CI and PRs

2019-08-15 Thread Pedro Larroy
Hi Aaron. Why speeds things up? What's the difference?

Pedro.

On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham 
wrote:

> The PRs Thomas and I are working on for the new docs and website share the
> mxnet binary in the new CI pipelines we made. Speeds things up a lot.
>
> On Wed, Aug 14, 2019, 18:16 Chris Olivier  wrote:
>
> > I see it done daily now, and while I can’t share all the details, it’s
> not
> > an incredibly complex thing, and involves not much more than nfs/efs
> > sharing and remote ssh commands.  All it takes is a little ingenuity and
> > some imagination.
> >
> > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> pedro.larroy.li...@gmail.com
> > >
> > wrote:
> >
> > > Sounds good in theory. I think there are complex details with regards
> of
> > > resource sharing during parallel execution. Still I think both ways can
> > be
> > > explored. I think some tests run for unreasonably long times for what
> > they
> > > are doing. We already scale parts of the pipeline horizontally across
> > > workers.
> > >
> > >
> > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier 
> > > wrote:
> > >
> > > > +1
> > > >
> > > > Rather than remove tests (which doesn’t scale as a solution), why not
> > > scale
> > > > them horizontally so that they finish more quickly? Across processes
> or
> > > > even on a pool of machines that aren’t necessarily the build machine?
> > > >
> > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > marco.g.ab...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > With regards to time I rather prefer us spending a bit more time on
> > > > > maintenance than somebody running into an error that could've been
> > > caught
> > > > > with a test.
> > > > >
> > > > > I mean, our Publishing pipeline for Scala GPU has been broken for
> > quite
> > > > > some time now, but nobody noticed that. Basically my stance on that
> > > > matter
> > > > > is that as soon as something is not blocking, you can also just
> > > > deactivate
> > > > > it since you don't have a forcing function in an open source
> project.
> > > > > People will rarely come back and fix the errors of some nightly
> test
> > > that
> > > > > they introduced.
> > > > >
> > > > > -Marco
> > > > >
> > > > > Carin Meier  schrieb am Mi., 14. Aug. 2019,
> > > 21:59:
> > > > >
> > > > > > If a language binding test is failing for a not important reason,
> > > then
> > > > it
> > > > > > is too brittle and needs to be fixed (we have fixed some of these
> > > with
> > > > > the
> > > > > > Clojure package [1]).
> > > > > > But in general, if we thinking of the MXNet project as one
> project
> > > that
> > > > > is
> > > > > > across all the language bindings, then we want to know if some
> > > > > fundamental
> > > > > > code change is going to break a downstream package.
> > > > > > I can't speak for all the high level package binding maintainers,
> > but
> > > > I'm
> > > > > > always happy to pitch in to provide code fixes to help the base
> PR
> > > get
> > > > > > green.
> > > > > >
> > > > > > The time costs to maintain such a large CI project obviously
> needs
> > to
> > > > be
> > > > > > considered as well.
> > > > > >
> > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > > > pedro.larroy.li...@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > From what I have seen Clojure is 15 minutes, which I think is
> > > > > reasonable.
> > > > > > > The only question is that when a binding such as R, Perl or
> > Clojure
> > > > > > fails,
> > > > > > > some devs are a bit confused about how to fix them since they
> are
> > > not
> > > > > > > familiar with t

Re: CI and PRs

2019-08-14 Thread Pedro Larroy
Hi Marco.

I have to agree with you on that, from past experience.
What do you suggest for maintenance?  Do we need a watermark that fails the
validation if the total runtime exceeds a high threshold?

Pedro.

On Wed, Aug 14, 2019 at 1:03 PM Marco de Abreu 
wrote:

> With regards to time I rather prefer us spending a bit more time on
> maintenance than somebody running into an error that could've been caught
> with a test.
>
> I mean, our Publishing pipeline for Scala GPU has been broken for quite
> some time now, but nobody noticed that. Basically my stance on that matter
> is that as soon as something is not blocking, you can also just deactivate
> it since you don't have a forcing function in an open source project.
> People will rarely come back and fix the errors of some nightly test that
> they introduced.
>
> -Marco
>
> Carin Meier  schrieb am Mi., 14. Aug. 2019, 21:59:
>
> > If a language binding test is failing for a not important reason, then it
> > is too brittle and needs to be fixed (we have fixed some of these with
> the
> > Clojure package [1]).
> > But in general, if we thinking of the MXNet project as one project that
> is
> > across all the language bindings, then we want to know if some
> fundamental
> > code change is going to break a downstream package.
> > I can't speak for all the high level package binding maintainers, but I'm
> > always happy to pitch in to provide code fixes to help the base PR get
> > green.
> >
> > The time costs to maintain such a large CI project obviously needs to be
> > considered as well.
> >
> > [1] https://github.com/apache/incubator-mxnet/pull/15579
> >
> > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> pedro.larroy.li...@gmail.com
> > >
> > wrote:
> >
> > > From what I have seen Clojure is 15 minutes, which I think is
> reasonable.
> > > The only question is that when a binding such as R, Perl or Clojure
> > fails,
> > > some devs are a bit confused about how to fix them since they are not
> > > familiar with the testing tools and the language.
> > >
> > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier 
> > wrote:
> > >
> > > > Great idea Marco! Anything that you think would be valuable to share
> > > would
> > > > be good. The duration of each node in the test stage sounds like a
> good
> > > > start.
> > > >
> > > > - Carin
> > > >
> > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > marco.g.ab...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > we record a bunch of metrics about run statistics (down to the
> > duration
> > > > of
> > > > > every individual step). If you tell me which ones you're
> particularly
> > > > > interested in (probably total duration of each node in the test
> > stage),
> > > > I'm
> > > > > happy to provide them.
> > > > >
> > > > > Dimensions are (in hierarchical order):
> > > > > - job
> > > > > - branch
> > > > > - stage
> > > > > - node
> > > > > - step
> > > > >
> > > > > Unfortunately I don't have the possibility to export them since we
> > > store
> > > > > them in CloudWatch Metrics which afaik doesn't offer raw exports.
> > > > >
> > > > > Best regards,
> > > > > Marco
> > > > >
> > > > > Carin Meier  schrieb am Mi., 14. Aug. 2019,
> > > 19:43:
> > > > >
> > > > > > I would prefer to keep the language binding in the PR process.
> > > Perhaps
> > > > we
> > > > > > could do some analytics to see how much each of the language
> > bindings
> > > > is
> > > > > > contributing to overall run time.
> > > > > > If we have some metrics on that, maybe we can come up with a
> > > guideline
> > > > of
> > > > > > how much time each should take. Another possibility is leverage
> the
> > > > > > parallel builds more.
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > > pedro.larroy.li...@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Carin.
> > > > > > >
> >

Re: CI and PRs

2019-08-14 Thread Pedro Larroy
>From what I have seen Clojure is 15 minutes, which I think is reasonable.
The only question is that when a binding such as R, Perl or Clojure fails,
some devs are a bit confused about how to fix them since they are not
familiar with the testing tools and the language.

On Wed, Aug 14, 2019 at 11:57 AM Carin Meier  wrote:

> Great idea Marco! Anything that you think would be valuable to share would
> be good. The duration of each node in the test stage sounds like a good
> start.
>
> - Carin
>
> On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu 
> wrote:
>
> > Hi,
> >
> > we record a bunch of metrics about run statistics (down to the duration
> of
> > every individual step). If you tell me which ones you're particularly
> > interested in (probably total duration of each node in the test stage),
> I'm
> > happy to provide them.
> >
> > Dimensions are (in hierarchical order):
> > - job
> > - branch
> > - stage
> > - node
> > - step
> >
> > Unfortunately I don't have the possibility to export them since we store
> > them in CloudWatch Metrics which afaik doesn't offer raw exports.
> >
> > Best regards,
> > Marco
> >
> > Carin Meier  schrieb am Mi., 14. Aug. 2019, 19:43:
> >
> > > I would prefer to keep the language binding in the PR process. Perhaps
> we
> > > could do some analytics to see how much each of the language bindings
> is
> > > contributing to overall run time.
> > > If we have some metrics on that, maybe we can come up with a guideline
> of
> > > how much time each should take. Another possibility is leverage the
> > > parallel builds more.
> > >
> > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > pedro.larroy.li...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi Carin.
> > > >
> > > > That's a good point, all things considered would your preference be
> to
> > > keep
> > > > the Clojure tests as part of the PR process or in Nightly?
> > > > Some options are having notifications here or in slack. But if we
> think
> > > > breakages would go unnoticed maybe is not a good idea to fully remove
> > > > bindings from the PR process and just streamline the process.
> > > >
> > > > Pedro.
> > > >
> > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier 
> > > wrote:
> > > >
> > > > > Before any binding tests are moved to nightly, I think we need to
> > > figure
> > > > > out how the community can get proper notifications of failure and
> > > success
> > > > > on those nightly runs. Otherwise, I think that breakages would go
> > > > > unnoticed.
> > > > >
> > > > > -Carin
> > > > >
> > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > > pedro.larroy.li...@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi
> > > > > >
> > > > > > Seems we are hitting some problems in CI. I propose the following
> > > > action
> > > > > > items to remedy the situation and accelerate turn around times in
> > CI,
> > > > > > reduce cost, complexity and probability of failure blocking PRs
> and
> > > > > > frustrating developers:
> > > > > >
> > > > > > * Upgrade Windows visual studio from VS 2015 to VS 2017. The
> > > > > > build_windows.py infrastructure should easily work with the new
> > > > version.
> > > > > > Currently some PRs are blocked by this:
> > > > > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > * Move Gluon Model zoo tests to nightly. Tracked at
> > > > > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > * Move non-python bindings tests to nightly. If a commit is
> > touching
> > > > > other
> > > > > > bindings, the reviewer should ask for a full run which can be
> done
> > > > > locally,
> > > > > > use the label bot to trigger a full CI build, or defer to
> nightly.
> > > > > > * Provide a couple of basic sanity performance tests on small
> > models
> > > > that
> > > > > > are run on CI and can be echoed by the label bot as a comment for
> > > PRs.
> &

Re: CI and PRs

2019-08-14 Thread Pedro Larroy
Hi Carin.

That's a good point, all things considered would your preference be to keep
the Clojure tests as part of the PR process or in Nightly?
Some options are having notifications here or in slack. But if we think
breakages would go unnoticed maybe is not a good idea to fully remove
bindings from the PR process and just streamline the process.

Pedro.

On Wed, Aug 14, 2019 at 5:09 AM Carin Meier  wrote:

> Before any binding tests are moved to nightly, I think we need to figure
> out how the community can get proper notifications of failure and success
> on those nightly runs. Otherwise, I think that breakages would go
> unnoticed.
>
> -Carin
>
> On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy  >
> wrote:
>
> > Hi
> >
> > Seems we are hitting some problems in CI. I propose the following action
> > items to remedy the situation and accelerate turn around times in CI,
> > reduce cost, complexity and probability of failure blocking PRs and
> > frustrating developers:
> >
> > * Upgrade Windows visual studio from VS 2015 to VS 2017. The
> > build_windows.py infrastructure should easily work with the new version.
> > Currently some PRs are blocked by this:
> > https://github.com/apache/incubator-mxnet/issues/13958
> > * Move Gluon Model zoo tests to nightly. Tracked at
> > https://github.com/apache/incubator-mxnet/issues/15295
> > * Move non-python bindings tests to nightly. If a commit is touching
> other
> > bindings, the reviewer should ask for a full run which can be done
> locally,
> > use the label bot to trigger a full CI build, or defer to nightly.
> > * Provide a couple of basic sanity performance tests on small models that
> > are run on CI and can be echoed by the label bot as a comment for PRs.
> > * Address unit tests that take more than 10-20s, streamline them or move
> > them to nightly if it can't be done.
> > * Open sourcing the remaining CI infrastructure scripts so the community
> > can contribute.
> >
> > I think our goal should be turnaround under 30min.
> >
> > I would also like to touch base with the community that some PRs are not
> > being followed up by committers asking for changes. For example this PR
> is
> > importtant and is hanging for a long time.
> >
> > https://github.com/apache/incubator-mxnet/pull/15051
> >
> > This is another, less important but more trivial to review:
> >
> > https://github.com/apache/incubator-mxnet/pull/14940
> >
> > I think comitters requesting changes and not folllowing up in reasonable
> > time is not healthy for the project. I suggest configuring github
> > Notifications for a good SNR and following up.
> >
> > Regards.
> >
> > Pedro.
> >
>


CI and PRs

2019-08-13 Thread Pedro Larroy
Hi

Seems we are hitting some problems in CI. I propose the following action
items to remedy the situation and accelerate turn around times in CI,
reduce cost, complexity and probability of failure blocking PRs and
frustrating developers:

* Upgrade Windows visual studio from VS 2015 to VS 2017. The
build_windows.py infrastructure should easily work with the new version.
Currently some PRs are blocked by this:
https://github.com/apache/incubator-mxnet/issues/13958
* Move Gluon Model zoo tests to nightly. Tracked at
https://github.com/apache/incubator-mxnet/issues/15295
* Move non-python bindings tests to nightly. If a commit is touching other
bindings, the reviewer should ask for a full run which can be done locally,
use the label bot to trigger a full CI build, or defer to nightly.
* Provide a couple of basic sanity performance tests on small models that
are run on CI and can be echoed by the label bot as a comment for PRs.
* Address unit tests that take more than 10-20s, streamline them or move
them to nightly if it can't be done.
* Open sourcing the remaining CI infrastructure scripts so the community
can contribute.

I think our goal should be turnaround under 30min.

I would also like to touch base with the community that some PRs are not
being followed up by committers asking for changes. For example this PR is
importtant and is hanging for a long time.

https://github.com/apache/incubator-mxnet/pull/15051

This is another, less important but more trivial to review:

https://github.com/apache/incubator-mxnet/pull/14940

I think comitters requesting changes and not folllowing up in reasonable
time is not healthy for the project. I suggest configuring github
Notifications for a good SNR and following up.

Regards.

Pedro.


Evolving the computational graph

2019-07-23 Thread Pedro Larroy
Hi dev@

I have observed some architectural limitations on MXNet's architecture that
would be beneficial to address in future releases. For example during
calculation of higher order gradients it would be needed to access the
graph and shape information from FGradient function to be able to do some
operations in symbolic.

There's also some other activities such as GPU pointwise fusion
 which
also need advanced transformations.

I would suggest we should collect ideas and requirements in the wiki to
have an overview of the scope to make informed decisions when the time
comes to make these architectural changes.

Maybe relay solve all of these problems? Would be in any case good to have
requirements in any case.

Any thoughts on this?


Re: [Discuss] MXNet Python 2 Support Deprecation

2019-07-18 Thread Pedro Larroy
Seems 3.6 is a reasonable choice.

On Thu, Jul 18, 2019 at 2:15 PM Marco de Abreu  wrote:
>
> Looking at EOL is certainly a good idea! I think once we get closer to
> deprecation, we can check adoption statistics to make a well-informed
> decision that gives us the most advantages without dropping the ball on a
> majority of users (or supporting a branch that is going EOL soon). A survey
> from 2018 [1] determined the following distribution:
> 3.5: 11%
> 3.6: 54%
> 3.7: 30%
>
> Deprecation for 3.5 is scheduled for 2020-09-13 [2]. Deprecation for 3.6 is
> scheduled for 2021-12-23 [2].Deprecation for 3.7 is scheduled
> for 2023-06-27 [2].
>
> Following the trend, I'd say that it would be a decision between Python 3.6
> and 3.7. Later on, I'd propose to check recent surveys and also have a
> separate thread to determine if there's anything we're missing (e.g. a big
> company being unable to use Python 3.7). What do you think?
>
> Best regards,
> Marco
>
> [1]: https://www.jetbrains.com/research/python-developers-survey-2018/
> [2]: https://devguide.python.org/#status-of-python-branches
>
> On Thu, Jul 18, 2019 at 9:42 PM Yuan Tang  wrote:
>
> > I would suggest supporting Python 3.5+ since the earlier versions have
> > reached end-of-life status:
> > https://devguide.python.org/devcycle/#end-of-life-branches
> >
> > On Thu, Jul 18, 2019 at 3:36 PM Pedro Larroy  > >
> > wrote:
> >
> > > +1
> > >
> > > This would simplify CI, reduce costs and more. I think a followup
> > > question is what would be the mininum Python3 version supported?
> > > Depending on that we might be able to use type annotations for example
> > > or other features.
> > >
> > > Pedro.
> > >
> > > On Thu, Jul 18, 2019 at 12:07 PM Yuan Tang 
> > > wrote:
> > > >
> > > > +1
> > > >
> > > > On Thu, Jul 18, 2019 at 2:51 PM Yuxi Hu  wrote:
> > > >
> > > > > +1
> > > > >
> > > > > On Thu, Jul 18, 2019 at 11:31 AM Tong He 
> > wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > Best regards,
> > > > > >
> > > > > > Tong He
> > > > > >
> > > > > >
> > > > > > Jake Lee  于2019年7月18日周四 上午11:29写道:
> > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > On Thu, Jul 18, 2019 at 11:27 AM Junru Shao <
> > > junrushao1...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1
> > > > > > > >
> > > > > > > > On Thu, Jul 18, 2019 at 11:12 AM Anirudh Acharya <
> > > > > > anirudhk...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > +1
> > > > > > > > >
> > > > > > > > > On Thu, Jul 18, 2019 at 11:03 AM Marco de Abreu <
> > > > > > > marco.g.ab...@gmail.com
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > +1
> > > > > > > > > >
> > > > > > > > > > -Marco
> > > > > > > > > >
> > > > > > > > > > Sheng Zha  schrieb am Do., 18. Juli
> > > 2019,
> > > > > > > 19:59:
> > > > > > > > > >
> > > > > > > > > > > Dear MXNet community,
> > > > > > > > > > >
> > > > > > > > > > > I'd like to reopen the discussion on deprecating python2
> > > > > support.
> > > > > > > > This
> > > > > > > > > > > would help modernize the design and engineering practice
> > in
> > > > > MXNet
> > > > > > > to
> > > > > > > > > help
> > > > > > > > > > > improve speed and quality.
> > > > > > > > > > >
> > > > > > > > > > > For this purpose, I reopened the issue on this here:
> > > > > > > > > > > https://github.com/apache/incubator-mxnet/issues/8703
> > > > > > > > > > >
> > > > > > > > > > > If the consensus is towards the direction of dropping
> > > python2
> > > > > > > > support,
> > > > > > > > > I
> > > > > > > > > > > suggest we announce our plan to drop python2 support in
> > the
> > > > > next
> > > > > > > > > release,
> > > > > > > > > > > and actually drop the support in the next major version.
> > > > > Thanks.
> > > > > > > > > > >
> > > > > > > > > > > -sz
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Yuxi(Darren) Hu, Ph.D.
> > > > > Software Development Engineer
> > > > > Amazon Web Services
> > > > >
> > >
> >


Re: [Discuss] MXNet Python 2 Support Deprecation

2019-07-18 Thread Pedro Larroy
+1

This would simplify CI, reduce costs and more. I think a followup
question is what would be the mininum Python3 version supported?
Depending on that we might be able to use type annotations for example
or other features.

Pedro.

On Thu, Jul 18, 2019 at 12:07 PM Yuan Tang  wrote:
>
> +1
>
> On Thu, Jul 18, 2019 at 2:51 PM Yuxi Hu  wrote:
>
> > +1
> >
> > On Thu, Jul 18, 2019 at 11:31 AM Tong He  wrote:
> >
> > > +1
> > >
> > > Best regards,
> > >
> > > Tong He
> > >
> > >
> > > Jake Lee  于2019年7月18日周四 上午11:29写道:
> > >
> > > > +1
> > > >
> > > > On Thu, Jul 18, 2019 at 11:27 AM Junru Shao 
> > > > wrote:
> > > >
> > > > > +1
> > > > >
> > > > > On Thu, Jul 18, 2019 at 11:12 AM Anirudh Acharya <
> > > anirudhk...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > On Thu, Jul 18, 2019 at 11:03 AM Marco de Abreu <
> > > > marco.g.ab...@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > -Marco
> > > > > > >
> > > > > > > Sheng Zha  schrieb am Do., 18. Juli 2019,
> > > > 19:59:
> > > > > > >
> > > > > > > > Dear MXNet community,
> > > > > > > >
> > > > > > > > I'd like to reopen the discussion on deprecating python2
> > support.
> > > > > This
> > > > > > > > would help modernize the design and engineering practice in
> > MXNet
> > > > to
> > > > > > help
> > > > > > > > improve speed and quality.
> > > > > > > >
> > > > > > > > For this purpose, I reopened the issue on this here:
> > > > > > > > https://github.com/apache/incubator-mxnet/issues/8703
> > > > > > > >
> > > > > > > > If the consensus is towards the direction of dropping python2
> > > > > support,
> > > > > > I
> > > > > > > > suggest we announce our plan to drop python2 support in the
> > next
> > > > > > release,
> > > > > > > > and actually drop the support in the next major version.
> > Thanks.
> > > > > > > >
> > > > > > > > -sz
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Yuxi(Darren) Hu, Ph.D.
> > Software Development Engineer
> > Amazon Web Services
> >


Re: [DISCUSS] MXNet 1.6.0 Roadmap

2019-07-18 Thread Pedro Larroy
Are we using Jira or some other tool (Trello?) for planning? I think
getting more visibility on some of on the major ongoing activities
would help rally contributions around them. If they link to the design
document and group PRs in a single place (I think Jira or trello can
do that) it would help streamline the process and contributions.

Pedro.

On Thu, Jul 18, 2019 at 10:54 AM Sheng Zha  wrote:
>
> Hi,
>
> While 1.5.0 vote on general@incubator is still on going, I’d like to propose 
> that we start planning for 1.6.0. To this end, I started a discussion on the 
> roadmap on GitHub https://github.com/apache/incubator-mxnet/issues/15589.
>
> If no objection, we will conclude the discussion in two weeks (Aug 1st), and 
> start release plan by then. Thanks.
>
> -sz


Re: warnings as errors

2019-07-18 Thread Pedro Larroy
Hi.

These two prs are fixing warnings on different platforms and removing
some code bloat due to inlines. Could you guys help get them in? they
are open for a while.

https://github.com/apache/incubator-mxnet/pull/15270
https://github.com/apache/incubator-mxnet/pull/14940

Thanks.

Pedro.

On Wed, May 22, 2019 at 1:50 PM Pedro Larroy
 wrote:
>
> I was not able to fix the warnings on mshadow type switch with unused
> local typedefs, that's one example of warning that I would disable. I
> couldn't find a way to solve that one and I think the ramifications of
> an unused typedef are not likely to cause bugs in the code and are
> more of a pedantic nature.
>
> https://github.com/apache/incubator-mxnet/pull/13424
>
> I think turning on them one by one is going to pollute the compilation
> output unecesarily and even run into commandline length problems?  I
> think is best to enable all warnings and errors and cherry pick the
> ones we can't fix or won't fix on purpose.
>
> In this other case, I managed to tighten the warnings but ASAN is
> giving some problems:
>
> https://github.com/apache/incubator-mxnet/pull/14850
>
> I think having warning fixes reviewed and merged faster without
> triggering additional refactorings could make this process easier,
> also having some help in this area and contributions would be greatly
> appreciated.
>
> Pedro.
>
> On Tue, May 21, 2019 at 3:49 PM Sheng Zha  wrote:
> >
> > It would be great to enforce the check for warnings and treat as errors. 
> > Some questions I have:
> > - what are the warnings that you think should be ignored?
> > - for the rest of the warning types, can we turn them on one by one?
> >
> > -sz
> >
> > On 2019/05/21 22:33:51, Pedro Larroy  wrote:
> > > Hi dev@
> > >
> > > I try to fix any warning that I see during compilation of MXNet in my
> > > platform and with the build toggles that I care about. These seemingly
> > > trivial and ungrateful efforts, take nonetheless energy on the
> > > contributor side.
> > >
> > > I think overall I submitted myself more than a dozen of PRs fixing
> > > warnings and I would like to call for additional help and
> > > contributions in this area.
> > >
> > > There was a question from Lin about discussing this on the mailing
> > > list, I have the feeling that everybody agrees on moving towards zero
> > > warnings and warnings as errors. I think there are unavoidable
> > > warnings that can be disabled specifically such as the one triggered
> > > by mshadow type switch.
> > >
> > > Some important missing warnings such as warning on missing return
> > > values (ie. forgetting to return on a function returning non-void)
> > > cause bugs, danger and additional time spent bugfixing, which can be
> > > better spent somewhere else.
> > >
> > > Is there a process that we can figure out such as a more expedited
> > > merges of PRs fixing warnings or a specific label?
> > >
> > > Some simple PRs that fixes a warning can take long to merge, and
> > > sometimes trigger too much discussion and make the progress a bit
> > > unfriendly to contributors.
> > >
> > > Any help or constructive ideas on this topic would be appreciated.
> > >
> > > Pedro.
> > >


Re: [DISCUSS] Make MXNet deploy it's own distribution

2019-07-03 Thread Pedro Larroy
Nice!  +1 To this approach, seems well thought. Thanks for including
Android and linux-arm.  Does Android and linux-arm use a different
classifier?

On Wed, Jul 3, 2019 at 6:46 AM Chris Olivier  wrote:
>
> Will this be another repo under Apache repo? Is tensorflow java package in
> a separate repo?
>
> On Wed, Jul 3, 2019 at 12:46 AM Per da Silva  wrote:
>
> > Hi,
> >
> > We've started working on something along these lines as part of the CD
> > pipeline framework. The idea is to compile and test the libmxnet.so  (both
> > statically and dynamically linked) for the different variants (cpu, gpu,
> > mkl, etc.) then have the different mxnet frontends (python, Julia, scala,
> > etc) just wrap around the library.
> >
> > I've been on long term sick leave and haven't been able to move forward
> > with this, although I have an open PR that kicks off this work:
> > https://github.com/apache/incubator-mxnet/pull/15051 - I welcome everyone
> > to take a look. It's the first of a series of PRs to automate the
> > distribution of the python (pip and docker) packages. Instead of using
> > maven, we have opted to use S3. But this decision can be revisited.
> >
> > We also want to distribute what we termed "runtime" docker images. Docker
> > images containing the dynamically linked mxnet library and all of the
> > runtime dependencies (examples: https://hub.docker.com/r/mxnet/runtime).
> > This could facilitate the packaging and distribution of docker images for
> > the different frontends.
> >
> > Cheers,
> >
> > Per
> >
> > On Wed., 3 Jul. 2019, 8:47 am Qing Lan,  wrote:
> >
> > > In that case, the answer is yes. The Scala package will be published in
> > > one version with a variaty of backend package choices. User can easily
> > > attach and detach different MXNet versions. However, the Scala package
> > > cannot run without a backend.
> > >
> > > Another key advantage of this design will be a broader support on
> > > different implementations such as Java Cpp. User will be able to
> > implement
> > > their customized MXNet frontend to use the native library.
> > >
> > > Thanks,
> > > Qing
> > >
> > > 
> > > From: Sheng Zha 
> > > Sent: Tuesday, July 2, 2019 22:14
> > > To: dev@mxnet.incubator.apache.org
> > > Subject: Re: [DISCUSS] Make MXNet deploy it's own distribution
> > >
> > > Does it mean that the scala binding of mxnet will be an independent
> > > package that doesn’t directly depend on the native package, and user
> > > projects need to declare dependency on both the scala binding and one of
> > > the native packages?
> > >
> > > -sz
> > >
> > > > On Jul 2, 2019, at 5:50 PM, Frank Liu  wrote:
> > > >
> > > > Currently, MXNet were built along with different language bindings such
> > > as
> > > > Scala.
> > > >
> > > > The libmxnet.so files are bundled within scala jar package.
> > > >
> > > > It would be nice to distribute libmxnet.so library independently in
> > > maven,
> > > > and scala package can choose which native library to use.
> > > >
> > > > Here is the design document on cwiki:
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/MXNET/Make+MXNet+deploy+it%27s+own+distribution
> > > >
> > > > Thanks,
> > > >
> > > > Frank
> > >
> >


Re: FW: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-28 Thread Pedro Larroy
Thanks Manu, the warmup is important, also the first run it downloads
a bunch of data which will affect the measurement. That's a good idea.

How can I find which commit corresponds to a pip build myself?

Pedro.

On Fri, Jun 28, 2019 at 4:48 PM Manu Seth  wrote:
>
> I ran the same cifar10.py script as Pedro, but for 20 epochs. Considering
> the first 10 epochs for warm-up, I averaged time per epoch for the last 10
> epochs.
>
> With MXNet 1.4.1 average time is 164.23 s
> With MXNet 1.5.0 average time is 174.59 s (~6.3% regression)
>
>
> For a second data point, I ran Gluon speed test benchmark script -
> https://github.com/apache/incubator-mxnet/blob/master/benchmark/python/gluon/benchmark_gluon.py
> using the following command:
> python3 benchmark_gluon.py --model 'resnet152_v2' --batch-size 128
> --num-batches 200 --type 'training'
>
> I got the following speeds:
> With MXNet 1.4.1, average speed is 25.677534 img/s
> With MXNet 1.5.0, average speed is 25.082130 img/s (~2.3% regression)
>
> Note:
> For 1.4.1 version, I used pip install mxnet-mkl==1.4.1
> For 1.5.0 version, I used pip install mxnet-mkl==1.5.0b20190619 which
> corresponds to commit# ccbbf6b4b76ea536a6583c99497c83b65a20817b which is
> behind 1.5.x branch by 4 commits
>
>
> Best,
> Manu
>
>
> On 6/27/19, 10:44 AM, "Pedro Larroy"  wrote:
> >
> > I will try to run a few benchmarks in a bare metal instance tonight to
> > remove virtualization variance for the measurements and provide some
> > numbers.
> >
> > Please propose a set of models / examples that would be desirable to
> > run before the release and provide a link to an easy to run script
> > with instructions so we can validate the release better.
> >
> > Thank you.
> >
> > On Thu, Jun 27, 2019 at 10:01 AM Lai Wei  wrote:
> > >
> > > Dear @dev,
> > >
> > > I m cancelling the vote for cached op fix:
> > >
> > > https://github.com/apache/incubator-mxnet/pull/15298
> > >
> > > As for the possible cpu training regression, it looks like not a
> > blocker
> > > for now.
> > >
> > > I will start a new rc2 vote, please help to validate.
> > >
> > > Thanks!
> > >
> > >
> > > On Thu, Jun 27, 2019 at 10:06 PM Chen, Ciyong 
> > wrote:
> > >
> > > > Hi Pedro,
> > > >
> > > > I was able to reproduced the similar result (v1.5 is ~%5.6 slower
> > than
> > > > v1.4, I was using 18 cores for computing) with your script on
> > C5.18xlarge.
> > > > But need to bind the cores with below command when running the
> > script,
> > > > (without setting the env variables, I got a close time (<1%) with
> > v1.5 and
> > > > v1.4)
> > > > export
> > KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
> > > > export OMP_NUM_THREADS=18
> > > >
> > > > Did you set any env variables during running?
> > > >
> > > > The performance result I got as below:
> > > > 1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
> > > > real12m10.856s
> > > > user234m49.576s
> > > > sys 4m38.044s
> > > >
> > > > 2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
> > > > real12m52.140s
> > > > user246m30.740s
> > > > sys 5m8.188s
> > > >
> > > > As I looked at the profiling data, most of the ops have same perf
> > between
> > > > v1.4 and v1.5. But some ops like " _backward_BatchNorm" and
> > "Pooling" is
> > > > ~1.37x slower on v1.5 compared with v1.4.
> > > > Will do further analysis on these ops.
> > > >
> > > > Here's the hardware/OS info from my side:
> > > > --Python Info--
> > > > Version  : 3.6.8
> > > > Compiler : GCC 7.3.0
> > > > Build: ('default', 'Dec 30 2018 01:22:34')
> > > > Arch : ('64bit', '')
> > > > Pip Info---
> > > > Version  : 19.0.3
> > > > Directory:
> > > >
> > /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip
> > > > --MXNet Info---
> >

Re: OMP

2019-06-28 Thread Pedro Larroy
meDefault 0x0050bb66
 0x00504c28
 0x00502540
 0x00502f3d
_PyEval_EvalFrameDefault 0x00506859
 0x00502209
 0x00502f3d
_PyEval_EvalFrameDefault 0x00506859
 0x00502209
 0x00502f3d
_PyEval_EvalFrameDefault 0x00506859
 0x00502209
 0x00502f3d
_PyEval_EvalFrameDefault 0x00506859
_PyFunction_FastCallDict 0x00501945
_PyObject_FastCallDict 0x005a36f1
_PyObject_CallMethodIdObjArgs 0x0059662e
PyImport_ImportModuleLevelObject 0x004ee84d
_PyEval_EvalFrameDefault 0x0050896c
 0x00504c28
 0x00511d78
PyCFunction_Call 0x0056617e
_PyEval_EvalFrameDefault 0x0050bb66
 0x00504c28
 0x00502540
 0x00502f3d
_PyEval_EvalFrameDefault 0x00506859
 0x00502209
 0x00502f3d
_PyEval_EvalFrameDefault 0x00506859
 0x00502209
 0x00502f3d
_PyEval_EvalFrameDefault 0x00506859
 0x00502209
 0x00502f3d
_PyEval_EvalFrameDefault 0x00506859
_PyFunction_FastCallDict 0x00501945
_PyObject_FastCallDict 0x005a36f1
_PyObject_CallMethodIdObjArgs 0x0059662e
PyImport_ImportModuleLevelObject 0x004ee84d
_PyEval_EvalFrameDefault 0x0050896c
 0x00504c28
 0x00511d78
PyCFunction_Call 0x0056617e
_PyEval_EvalFrameDefault 0x0050bb66
 0x00504c28
 0x00502540
 0x00502f3d
_PyEval_EvalFrameDefault 0x00506859
 0x00502209
 0x00502f3d
_PyEval_EvalFrameDefault 0x00506859
 0x00502209
 0x00502f3d
_PyEval_EvalFrameDefault 0x00506859
 0x00502209
 0x00502f3d
_PyEval_EvalFrameDefault 0x00506859
_PyFunction_FastCallDict 0x00501945
_PyObject_FastCallDict 0x005a36f1
_PyObject_CallMethodIdObjArgs 0x0059662e
PyImport_ImportModuleLevelObject 0x004ee84d
_PyEval_EvalFrameDefault 0x0050896c
 0x00504c28
PyEval_EvalCode 0x00506393
 0x00634d52
PyRun_FileExFlags 0x00634e0a
PyRun_SimpleFileExFlags 0x006385c8
Py_Main 0x0063915a
main 0x004a6f10
__libc_start_main 0x7f439c45ab97
_start 0x005afa0a


On Tue, Jun 25, 2019 at 1:55 PM Chris Olivier  wrote:
>
> 1) I don't see how that code could cause reentrancy problems in omp. It
> doesn't make any OMP calls at all.  Still doesn't look related to me.
> Setting an environment variable probably doesn't even do anything, because:
>   a) It probably doesn't check the environment variable except at initial
> startup
>   b) Even if it did, whether this code ran before or after the OMP init
> code would be nondeterministic
>   c) It for sure doesn't check the environment variable every time it hits
> an omp region.  That would be ridiculously expensive and checking the OMP
> source code, it doesn't..  You can't affect the OMP behavior at arbitrary
> points in time by setting the "OMP_NUM_THREADS" environment variable.
>
>
>
>
> On Tue, Jun 25, 2019 at 1:20 PM Pedro Larroy 
> wrote:
>
> > Nobody claimed that the original lockup has to do with OMP, but the
> > fix caused re-entrancy into OMP initialization as explained below. So
> > I agree with your statement that the bug that using pthread_atfork was
> > fixing is not related with OMP, but the fix is causing interactions
> > with OMP as described above.
> >
> > Pedro.
> >
> > On Tue, Jun 25, 2019 at 12:33 PM Chris Olivier 
> > wrote:
> > >
> > > The call stacks there are mostly associated with the execution engine
> > > threads, which are not OMP threads.  That lockup doesn't look to me to be
> > > related to OMP   -- the execution engine uses its own thread pool logic
> > --
> > > I'm pretty familiar with that part of the code.  Unless I am missing one
> > --
> > > can you point to the one that looks OMP-related?
> > >
> > >
> > > On Tue, Jun 25, 2019 at 10:35 AM Pedro Larroy <
> > pedro.larroy.li...@gmail.com>
> > > wrote:
> > >
> > > > Thanks for digging that out Kellen. That's good info so maybe it would
> > > > be good to rework the fix with the info you provided and remove the
> > > > pthread_atfork handlers.
> > > > Do you think setting the device would avoid the problem seen on the
> > > > backtrace you provided?  specifically here:
> > > >
> > > >
> > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600#file-hang_bt-L24
> > > >
> > > > On Mon, Jun 24, 2019 at 6:43 PM kellen sunderland
> > > >  wrote:
> > > > >
> > > > > I remember at the

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-28 Thread Pedro Larroy
Thanks Manu.

@all: I observed other strange stuff that I don't understand at the moment:

I installed rc for 1.5 from pip to check that I'm not doing something
wrong when building. And I found out that the usage of CPU is quite
subpar ( https://imgur.com/fRmbQNc ) compared to a version compiled
from source. The pip package is using 4-5 cores of the 32. When I
compile from source I get good core utilization. (
https://imgur.com/e8BB425 ). I verified this also on a c5d.18xlarge
and a 32 core AMD bare metal machine.

Seems to me also that the version from pip is using gomp instead of
llvm's omp. I'm not sure why.

pip install mxnet==1.5.0b20190627
/home/piotr/py3_1.5rc/lib/python3.6/site-packages/mxnet
piotr@panther:0: ~/p/l/p/s/mxnet> ldd libmxnet.so | grep omp
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x7f99d1832000)

I tried cifar10 on a bare metal 32 core AMD Zen machine and is
extremely slow, doesn't seem to make much progress, when compared to a
c5d.18xlarge, I couldn't even do 1 epoch, tried with and without MKL
without much success. Will continue digging into this when possible.


Pedro.

On Thu, Jun 27, 2019 at 9:41 PM Manu Seth  wrote:
>
> Hi all,
>
> I ran the same cifar10.py script as Pedro, but for 20 epochs. Considering
> the first 10 epochs for warm-up, I averaged time per epoch for the last 10
> epochs.
>
> With MXNet 1.4.1 average time is 164.23 s
> With MXNet 1.5.0 average time is 174.59 s (~6.3% regression)
>
>
> For a second data point, I ran Gluon speed test benchmark script -
> https://github.com/apache/incubator-mxnet/blob/master/benchmark/python/gluon/benchmark_gluon.py
> using the following command:
> python3 benchmark_gluon.py --model 'resnet152_v2' --batch-size 128
> --num-batches 200 --type 'training'
>
> I got the following speeds:
> With MXNet 1.4.1, average speed is 25.677534 img/s
> With MXNet 1.5.0, average speed is 25.082130 img/s (~2.3% regression)
>
> Note:
> For 1.4.1 version, I used pip install mxnet-mkl==1.4.1
> For 1.5.0 version, I used pip install mxnet-mkl==1.5.0b20190619 which
> corresponds to commit# ccbbf6b4b76ea536a6583c99497c83b65a20817b which is
> behind 1.5.x branch by 4 commits
>
>
> Best,
> Manu
>
>
> On 6/27/19, 3:37 PM, "sandeep krishnamurthy" 
> wrote:
>
> Hello Ciyong/Pedro,
>
> Ran operator benchmarks on 1.4.1 and 1.5.0.rc2. (Not complete, doesn’t
> cover all MXNet operators, not presented in best possible way, still
> WIP)
>
> https://gist.github.com/sandeep-krishnamurthy/e0a2be893c8c4d484390c9c8813bdf50
>
> Following operators looks slower in 1.5 compared to 1.4.1:
> - BatchNorm
> - Pooling
> - FullyConnected
> - batch_dot
> - Dot
> - broadcast_mul
> - log_softmax
> and few other operators
>
> Also, several operators runs a lot faster on 1.5 compared to 1.4.1. For
> example - Convolution, flatten, elementwise operators etc. So I see that
> likely few operators have regressed noticeably, however, due to other
> operator performance improvements, the end effect is not that
> significant
> hiding a lot of regression. We need more detailed analysis per operator
> performance. We will not be able to do this for current release, we
> should
> have a more concrete way to determining such performance regression
> before
> next release.
>
> Setup:
> 1.5 => Build from source (head of 1.5.rc2 tag), built with MKLDNN
> 1.4.1 => PyPi mxnet-mkl==1.4.1
> Machine: C5.18X
> No explicit environment variable were set
> Operator benchmark code -
> https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf
>
> Best,
> Sandeep
>
>
> On Thu, Jun 27, 2019 at 10:42 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> wrote:
>
> > I will try to run a few benchmarks in a bare metal instance tonight to
> > remove virtualization variance for the measurements and provide some
> > numbers.
> >
> > Please propose a set of models / examples that would be desirable to
> > run before the release and provide a link to an easy to run script
> > with instructions so we can validate the release better.
> >
> > Thank you.
> >
> > On Thu, Jun 27, 2019 at 10:01 AM Lai Wei  wrote:
> > >
> > > Dear @dev,
> > >
> > > I m cancelling the vote for cached op fix:
> > >
> > > https://github.com/apache/incubator-mxnet/pull/15298
> > >
> > > As for the possible cpu training regression, it looks like not a
> blocker
> > > for now.
> > >

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-27 Thread Pedro Larroy
I will try to run a few benchmarks in a bare metal instance tonight to
remove virtualization variance for the measurements and provide some
numbers.

Please propose a set of models / examples that would be desirable to
run before the release and provide a link to an easy to run script
with instructions so we can validate the release better.

Thank you.

On Thu, Jun 27, 2019 at 10:01 AM Lai Wei  wrote:
>
> Dear @dev,
>
> I m cancelling the vote for cached op fix:
>
> https://github.com/apache/incubator-mxnet/pull/15298
>
> As for the possible cpu training regression, it looks like not a blocker
> for now.
>
> I will start a new rc2 vote, please help to validate.
>
> Thanks!
>
>
> On Thu, Jun 27, 2019 at 10:06 PM Chen, Ciyong  wrote:
>
> > Hi Pedro,
> >
> > I was able to reproduced the similar result (v1.5 is ~%5.6 slower than
> > v1.4, I was using 18 cores for computing) with your script on C5.18xlarge.
> > But need to bind the cores with below command when running the script,
> > (without setting the env variables, I got a close time (<1%) with v1.5 and
> > v1.4)
> > export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
> > export OMP_NUM_THREADS=18
> >
> > Did you set any env variables during running?
> >
> > The performance result I got as below:
> > 1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
> > real12m10.856s
> > user234m49.576s
> > sys 4m38.044s
> >
> > 2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
> > real12m52.140s
> > user246m30.740s
> > sys 5m8.188s
> >
> > As I looked at the profiling data, most of the ops have same perf between
> > v1.4 and v1.5. But some ops like " _backward_BatchNorm" and "Pooling" is
> > ~1.37x slower on v1.5 compared with v1.4.
> > Will do further analysis on these ops.
> >
> > Here's the hardware/OS info from my side:
> > --Python Info--
> > Version  : 3.6.8
> > Compiler : GCC 7.3.0
> > Build: ('default', 'Dec 30 2018 01:22:34')
> > Arch : ('64bit', '')
> > Pip Info---
> > Version  : 19.0.3
> > Directory:
> > /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip
> > --MXNet Info---
> > Version  : 1.5.0
> > Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet
> > Hashtag not found. Not installed from pre-built package.
> > --System Info--
> > Platform : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid
> > system   : Linux
> > node : ip-172-31-32-129
> > release  : 4.4.0-1085-aws
> > version  : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019
> > --Hardware Info--
> > machine  : x86_64
> > processor: x86_64
> > Architecture:  x86_64
> > CPU op-mode(s):32-bit, 64-bit
> > Byte Order:Little Endian
> > CPU(s):72
> > On-line CPU(s) list:   0-71
> > Thread(s) per core:2
> > Core(s) per socket:18
> > Socket(s): 2
> > NUMA node(s):  2
> > Vendor ID: GenuineIntel
> > CPU family:6
> > Model: 85
> > Model name:Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
> > Stepping:  3
> > CPU MHz:   3000.000
> > BogoMIPS:  6000.00
> > Hypervisor vendor: KVM
> > Virtualization type:   full
> > L1d cache: 32K
> > L1i cache: 32K
> > L2 cache:  1024K
> > L3 cache:  25344K
> > NUMA node0 CPU(s): 0-17,36-53
> > NUMA node1 CPU(s): 18-35,54-71
> > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> > pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb
> > rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc
> > aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1
> > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand
> > hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase
> > tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx
> > smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
> > --Network Test--
> >
> >
> > -Ciyong
> >
> >
> > -Original Message-
> > From: Zhao, Patric [mailto:patric.z...@intel.com]
> > Sent: Thursday, June 27, 2019 9:55 AM
> >

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-26 Thread Pedro Larroy
I run again and the gap is again bigger, I guess we need to average
out the times across several runs:

piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench (master)+$
time ~/mxnet_1.4/py3_venv/bin/python cifar10.py --epochs 5 && time
~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5
[23:17:09] ../src/io/iter_image_recordio_2.cc:172:
ImageRecordIOParser2:
/home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4 threads
for decoding..
[23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image
from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
[23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean image
from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed
[23:17:09] ../src/io/iter_image_recordio_2.cc:172:
ImageRecordIOParser2:
/home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4 threads
for decoding..
[23:17:09] ../src/io/iter_image_recordio_2.cc:230: Load mean image
from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
[23:17:09] ../src/io/iter_image_recordio_2.cc:248: Load mean image
from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed
lr_schedule: {0: 0.05, 82: 0.005001, 123: 0.0005, 300: 0.0001}
Epoch 0, Changed learning rate to 0.05
[23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
147456 bytes with malloc directly
[23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
589824 bytes with malloc directly
[23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
2359296 bytes with malloc directly
[23:17:09] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
9437184 bytes with malloc directly
Epoch 0, Batch 199, Speed=384.149839
Epoch 0, Duration=140.919567
Epoch 0, Training accuracy=0.115169
Epoch 0, Validation accuracy=0.141317
Epoch 1, Batch 199, Speed=433.380512
Epoch 1, Duration=119.553233
Epoch 1, Training accuracy=0.170956
Epoch 1, Validation accuracy=0.216146
Epoch 2, Batch 199, Speed=434.864699
Epoch 2, Duration=123.278490
Epoch 2, Training accuracy=0.209455
Epoch 2, Validation accuracy=0.247296
Epoch 3, Batch 199, Speed=433.401854
Epoch 3, Duration=118.327797
Epoch 3, Training accuracy=0.248701
Epoch 3, Validation accuracy=0.302083
Epoch 4, Batch 199, Speed=419.713707
Epoch 4, Duration=126.468409
Epoch 4, Training accuracy=0.260949
Epoch 4, Validation accuracy=0.269030

real10m55.796s
user399m33.567s
sys 13m55.904s
[23:28:04] ../src/io/iter_image_recordio_2.cc:172:
ImageRecordIOParser2:
/home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4 threads
for decoding..
[23:28:04] ../src/io/iter_image_recordio_2.cc:230: Load mean image
from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
[23:28:04] ../src/io/iter_image_recordio_2.cc:248: Load mean image
from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed
[23:28:04] ../src/io/iter_image_recordio_2.cc:172:
ImageRecordIOParser2:
/home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4 threads
for decoding..
[23:28:04] ../src/io/iter_image_recordio_2.cc:230: Load mean image
from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
[23:28:04] ../src/io/iter_image_recordio_2.cc:248: Load mean image
from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed
lr_schedule: {0: 0.05, 82: 0.005001, 123: 0.0005, 300: 0.0001}
Epoch 0, Changed learning rate to 0.05
Epoch 0, Batch 199, Speed=419.039188
Epoch 0, Duration=143.934903
Epoch 0, Training accuracy=0.122542
Epoch 0, Validation accuracy=0.164359
Epoch 1, Batch 199, Speed=445.257048
Epoch 1, Duration=135.248399
Epoch 1, Training accuracy=0.178828
Epoch 1, Validation accuracy=0.199419
Epoch 2, Batch 199, Speed=447.115215
Epoch 2, Duration=132.003770
Epoch 2, Training accuracy=0.217808
Epoch 2, Validation accuracy=0.233073
Epoch 3, Batch 199, Speed=441.079477
Epoch 3, Duration=126.543316
Epoch 3, Training accuracy=0.248102
Epoch 3, Validation accuracy=0.293870
Epoch 4, Batch 199, Speed=449.329787
Epoch 4, Duration=138.398325
Epoch 4, Training accuracy=0.270021
Epoch 4, Validation accuracy=0.311498

real11m45.329s
user426m13.908s
sys 16m45.093s

On Wed, Jun 26, 2019 at 4:18 PM Pedro Larroy
 wrote:
>
> The difference looks smaller now, more like your numbers. I wonder if
> something happened during the previous benchmark like a system
> update...
>
>
> piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench (master)+$
> time ~/mxnet_1.4/py3_venv/bin/python cifar10.py --epochs 5 && time
> ~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5
> [22:49:41] ../src/io/iter_image_recordio_2.cc:172:
> ImageRecordIOParser2:
> /home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4 threads
> for decoding..
> [22:49:41] ../src/io/iter_image_recordio_2.cc:230: Load mean image
> from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
> [22:49:41] ../src/io/iter_image_recordio_2.cc:248: Load mean image
> from /home/piotr/deeplear

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-26 Thread Pedro Larroy
The difference looks smaller now, more like your numbers. I wonder if
something happened during the previous benchmark like a system
update...


piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench (master)+$
time ~/mxnet_1.4/py3_venv/bin/python cifar10.py --epochs 5 && time
~/mxnet_1.5/py3_venv/bin/python cifar10.py --epochs 5
[22:49:41] ../src/io/iter_image_recordio_2.cc:172:
ImageRecordIOParser2:
/home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4 threads
for decoding..
[22:49:41] ../src/io/iter_image_recordio_2.cc:230: Load mean image
from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
[22:49:41] ../src/io/iter_image_recordio_2.cc:248: Load mean image
from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed
[22:49:41] ../src/io/iter_image_recordio_2.cc:172:
ImageRecordIOParser2:
/home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4 threads
for decoding..
[22:49:41] ../src/io/iter_image_recordio_2.cc:230: Load mean image
from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
[22:49:41] ../src/io/iter_image_recordio_2.cc:248: Load mean image
from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed
lr_schedule: {0: 0.05, 82: 0.005001, 123: 0.0005, 300: 0.0001}
Epoch 0, Changed learning rate to 0.05
[22:49:42] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
147456 bytes with malloc directly
[22:49:42] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
589824 bytes with malloc directly
[22:49:42] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
2359296 bytes with malloc directly
[22:49:42] ../src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate
9437184 bytes with malloc directly
Epoch 0, Batch 199, Speed=426.182733
Epoch 0, Duration=134.868458
Epoch 0, Training accuracy=0.127238
Epoch 0, Validation accuracy=0.206388
Epoch 1, Batch 199, Speed=313.127156
Epoch 1, Duration=128.041775
Epoch 1, Training accuracy=0.182065
Epoch 1, Validation accuracy=0.202524
Epoch 2, Batch 199, Speed=410.931187
Epoch 2, Duration=124.920588
Epoch 2, Training accuracy=0.202584
Epoch 2, Validation accuracy=0.245693
Epoch 3, Batch 199, Speed=419.119335
Epoch 3, Duration=120.948349
Epoch 3, Training accuracy=0.235854
Epoch 3, Validation accuracy=0.291066
Epoch 4, Batch 199, Speed=430.473733
Epoch 4, Duration=130.181724
Epoch 4, Training accuracy=0.257773
Epoch 4, Validation accuracy=0.304988

real11m7.356s
user406m9.910s
sys 14m18.349s
[23:00:49] ../src/io/iter_image_recordio_2.cc:172:
ImageRecordIOParser2:
/home/piotr/deeplearning-benchmark/data/cifar/train.rec, use 4 threads
for decoding..
[23:00:49] ../src/io/iter_image_recordio_2.cc:230: Load mean image
from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
[23:00:49] ../src/io/iter_image_recordio_2.cc:248: Load mean image
from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed
[23:00:49] ../src/io/iter_image_recordio_2.cc:172:
ImageRecordIOParser2:
/home/piotr/deeplearning-benchmark/data/cifar/test.rec, use 4 threads
for decoding..
[23:00:49] ../src/io/iter_image_recordio_2.cc:230: Load mean image
from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin
[23:00:49] ../src/io/iter_image_recordio_2.cc:248: Load mean image
from /home/piotr/deeplearning-benchmark/data/cifar/mean.bin completed
lr_schedule: {0: 0.05, 82: 0.005001, 123: 0.0005, 300: 0.0001}
Epoch 0, Changed learning rate to 0.05
Epoch 0, Batch 199, Speed=348.618154
Epoch 0, Duration=146.469352
Epoch 0, Training accuracy=0.124121
Epoch 0, Validation accuracy=0.167227
Epoch 1, Batch 199, Speed=452.790825
Epoch 1, Duration=130.199421
Epoch 1, Training accuracy=0.183863
Epoch 1, Validation accuracy=0.237079
Epoch 2, Batch 199, Speed=451.406559
Epoch 2, Duration=126.320823
Epoch 2, Training accuracy=0.214844
Epoch 2, Validation accuracy=0.244692
Epoch 3, Batch 199, Speed=403.161873
Epoch 3, Duration=125.331660
Epoch 3, Training accuracy=0.243506
Epoch 3, Validation accuracy=0.301182
Epoch 4, Batch 199, Speed=450.826598
Epoch 4, Duration=126.426253
Epoch 4, Training accuracy=0.266424
Epoch 4, Validation accuracy=0.311899

real11m21.930s
user415m3.855s
sys 13m53.975s

On Wed, Jun 26, 2019 at 3:50 PM Pedro Larroy
 wrote:
>
> Hi Ciyong, thanks for trying to reproduce:
>
> I used this one:
> https://github.com/awslabs/deeplearning-benchmark/blob/master/dawnbench/cifar10.py
>
> Could you provide hardware and OS details?
>
> I will rerun and repost numbers in a few minutes.
>
> Pedro.
>
> On Wed, Jun 26, 2019 at 4:18 AM Chen, Ciyong  wrote:
> >
> > Hi Pedro,
> >
> > I'm looking at this case, and using the script of 
> > "incubator-mxnet/example/image-classification/train_cifar10.py" to get
> > the timing data, but seems there's not much difference between mxnet 
> > 1.4.1.rc0 and 1.5.0.rc1 on C5.18xlarge.
> >
> > Not sure if there's any difference in the python script, can

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-26 Thread Pedro Larroy
Hi Ciyong, thanks for trying to reproduce:

I used this one:
https://github.com/awslabs/deeplearning-benchmark/blob/master/dawnbench/cifar10.py

Could you provide hardware and OS details?

I will rerun and repost numbers in a few minutes.

Pedro.

On Wed, Jun 26, 2019 at 4:18 AM Chen, Ciyong  wrote:
>
> Hi Pedro,
>
> I'm looking at this case, and using the script of 
> "incubator-mxnet/example/image-classification/train_cifar10.py" to get
> the timing data, but seems there's not much difference between mxnet 
> 1.4.1.rc0 and 1.5.0.rc1 on C5.18xlarge.
>
> Not sure if there's any difference in the python script, can you point me the 
> link to get your script (cifar10.py)?
> Or you can also have a try with MXNet's script (train_cifar10.py) and see the 
> performance.
>
> Here's the command I used to collect the time:
> python train_cifar10.py --num-epoch=5
>
> 1) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
> real9m4.880s
> user333m13.340s
> sys 14m36.100s
>
> 2) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
> real9m2.155s
> user329m37.092s
>     sys 16m8.668s
>
> -Ciyong
>
>
> -Original Message-
> From: Pedro Larroy [mailto:pedro.larroy.li...@gmail.com]
> Sent: Wednesday, June 26, 2019 6:28 AM
> To: dev@mxnet.incubator.apache.org
> Cc: d...@mxnet.apache.org
> Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
>
> Hi these were my build flags and system info:
>
>
> --- # CMake configuration
> USE_CUDA: "OFF" # Build with CUDA support
> USE_OLDCMAKECUDA: "OFF" # Build with old cmake cuda
> USE_NCCL: "OFF" # Use NVidia NCCL with CUDA
> USE_OPENCV: "ON" # Build with OpenCV support
> USE_OPENMP: "ON" # Build with Openmp support
> USE_CUDNN: "ON" # Build with cudnn support) # one could set CUDNN_ROOT for 
> search path
> USE_SSE: "ON" # Build with x86 SSE instruction support IF NOT ARM
> USE_F16C: "ON" # Build with x86 F16C instruction support) # autodetects 
> support if "ON"
> USE_LAPACK: "ON" # Build with lapack support
> USE_MKL_IF_AVAILABLE: "ON" # Use MKL if found
> USE_MKLML_MKL: "ON" # Use MKLDNN variant of MKL (if MKL found) IF 
> USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> USE_MKLDNN: "ON" # Use MKLDNN variant of MKL (if MKL found) IF 
> USE_MKL_IF_AVAILABLE AND (NOT APPLE)
> USE_OPERATOR_TUNING: "ON" # Enable auto-tuning of operators IF NOT MSVC
> USE_GPERFTOOLS: "ON" # Build with GPerfTools support (if found)
> USE_JEMALLOC: "ON" # Build with Jemalloc support
> USE_PROFILER: "ON" # Build with Profiler support
> USE_DIST_KVSTORE: "OFF" # Build with DIST_KVSTORE support
> USE_PLUGINS_WARPCTC: "OFF" # Use WARPCTC Plugins
> USE_PLUGIN_CAFFE: "OFF" # Use Caffe Plugin
> USE_CPP_PACKAGE: "OFF" # Build C++ Package
> USE_MXNET_LIB_NAMING: "ON" # Use MXNet library naming conventions.
> USE_GPROF: "OFF" # Compile with gprof (profiling) flag
> USE_CXX14_IF_AVAILABLE: "OFF" # Build with C++14 if the compiler supports it
> USE_VTUNE: "OFF" # Enable use of Intel Amplifier XE (VTune)) # one could set 
> VTUNE_ROOT for search path
> ENABLE_CUDA_RTC: "ON" # Build with CUDA runtime compilation support
> BUILD_CPP_EXAMPLES: "ON" # Build cpp examples
> INSTALL_EXAMPLES: "OFF" # Install the example source files.
> USE_SIGNAL_HANDLER: "ON" # Print stack traces on segfaults.
> USE_TENSORRT: "OFF" # Enable infeference optimization with TensorRT.
> USE_ASAN: "OFF" # Enable Clang/GCC ASAN sanitizers.
> ENABLE_TESTCOVERAGE: "OFF" # Enable compilation with test coverage metric 
> output
> CMAKE_BUILD_TYPE: "Release"
> CMAKE_CUDA_COMPILER_LAUNCHER: "ccache"
> CMAKE_C_COMPILER_LAUNCHER: "ccache"
> CMAKE_CXX_COMPILER_LAUNCHER: "ccache"
>
> commit 4d9667121ae6fb643f2a02ab15e25231ed756cde (HEAD, tag: 1.5.0.rc1,
> upstream/v1.5.x)
> commit 1a7199691f5cbc6012bb53eecbf884bed5ae6590 (HEAD, tag: 1.4.1.rc0,
> upstream/v1.4.x)
>
> curl http://169.254.169.254/latest/meta-data/instance-type
> c5d.18xlarge
>
>
> Version  : 3.6.7
> Compiler : GCC 8.2.0
> Build: ('default', 'Oct 22 2018 11:32:17')
> Arch : ('64bit', 'ELF')
> Pip Info---
> Version  : 19.1.1
> Directory: /home/piotr/mxnet_1.5/py3_venv/lib/python3.6/site-packages/pip
> --MXNet Info---
> Version  : 1.5.0
> Directory: /home/pio

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-25 Thread Pedro Larroy
-Pip Info---
Version  : 19.1.1
Directory: /home/piotr/mxnet_1.4/py3_venv/lib/python3.6/site-packages/pip
--MXNet Info---
Version  : 1.4.1
Directory: /home/piotr/mxnet_1.4/python/mxnet
Hashtag not found. Not installed from pre-built package.
--System Info--
Platform : Linux-4.15.0-1035-aws-x86_64-with-Ubuntu-18.04-bionic
system   : Linux
node : ip-172-31-63-171
release  : 4.15.0-1035-aws
version  : #37-Ubuntu SMP Mon Mar 18 16:15:14 UTC 2019
--Hardware Info--
machine  : x86_64
processor: x86_64
Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
CPU(s):  72
On-line CPU(s) list: 0-71
Thread(s) per core:  2
Core(s) per socket:  18
Socket(s):   2
NUMA node(s):2
Vendor ID:   GenuineIntel
CPU family:  6
Model:   85
Model name:  Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:4
CPU MHz: 1223.344
BogoMIPS:6000.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:   32K
L1i cache:   32K
L2 cache:1024K
L3 cache:25344K
NUMA node0 CPU(s):   0-17,36-53
NUMA node1 CPU(s):   18-35,54-71
Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx
pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology
nonstop_tsc cpuid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid
sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx
f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti
fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx
avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw
avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
--Network Test--

On Tue, Jun 25, 2019 at 2:35 PM Pedro Larroy
 wrote:
>
> I did a training of cifar10 in CPU and seems there's some regressions
> in the range of 7% increase of training time against 1.4.1:
>
> (py3_venv) piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench
> (master)+$ time python cifar10.py --epochs 5
> real11m30.388s
> user417m7.766s
> sys 16m57.315s
>
> VS 1.4.1:
> real10m41.994s
> user392m40.646s
> sys 12m30.601s
>
>
> On Thu, Jun 20, 2019 at 10:15 PM Lai Wei  wrote:
> >
> > Hi Anirudh,
> >
> > Thanks for jumping into this quickly, I followed up on the issue.
> >
> > I was meant for sockeye developer/maintainers to help setup nightly tests
> > and raise issues early.
> >
> > Thanks!
> >
> > On Fri, Jun 21, 2019 at 10:10 AM Haibin Lin 
> > wrote:
> >
> > > In GluonNLP we are testing with MXNET nightly build for each PR, and we 
> > > did
> > > find some MXNet related issue caught by the CI.
> > > I recommend other toolkits also add integration tests with MXNet nightly.
> > > It helps identify issues early.
> > >
> > > Best,
> > > Haibin
> > >
> > > On Thu, Jun 20, 2019 at 18:52 Zhao, Patric  wrote:
> > >
> > > > Thanks to raise the issue and we will take a look ASAP.
> > > >
> > > > The downstream cases is not in the MXNet CI so it's hard to catch the
> > > > potential bugs or performance degradation for MXNet developers.
> > > >
> > > > In the future, I suggest adding the major downstream test cases, like
> > > from
> > > > sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the nightly test.
> > > > If it's still too heavy,  maybe testing it weekly or monthly :)
> > > >
> > > > Thanks,
> > > >
> > > > --Patric
> > > >
> > > > > -Original Message-
> > > > > From: Anirudh Subramanian [mailto:anirudh2...@gmail.com]
> > > > > Sent: Friday, June 21, 2019 9:31 AM
> > > > > To: dev@mxnet.incubator.apache.org
> > > > > Cc: d...@mxnet.apache.org
> > > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 
> > > > > 1.5.0.rc1
> > > > >
> > > > > Hi Lai,
> > > > >
> > > > > I have opened an issue:
> > > > > https://github.com/apache/incubator-mxnet/issues/15297
> > > > > I came to know about this issue only today and I have not been
> > > monitoring
> > > > > sockeye.
> > > > > I jumped onto this issue to make sure it wasn't caused by the dlpack
> > > > changes.
> > > > > Also, I don't  think sockeye CI checks against master,

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-25 Thread Pedro Larroy
I did a training of cifar10 in CPU and seems there's some regressions
in the range of 7% increase of training time against 1.4.1:

(py3_venv) piotr@ip-172-31-63-171:0:~/deeplearning-benchmark/dawnbench
(master)+$ time python cifar10.py --epochs 5
real11m30.388s
user417m7.766s
sys 16m57.315s

VS 1.4.1:
real10m41.994s
user392m40.646s
sys 12m30.601s


On Thu, Jun 20, 2019 at 10:15 PM Lai Wei  wrote:
>
> Hi Anirudh,
>
> Thanks for jumping into this quickly, I followed up on the issue.
>
> I was meant for sockeye developer/maintainers to help setup nightly tests
> and raise issues early.
>
> Thanks!
>
> On Fri, Jun 21, 2019 at 10:10 AM Haibin Lin 
> wrote:
>
> > In GluonNLP we are testing with MXNET nightly build for each PR, and we did
> > find some MXNet related issue caught by the CI.
> > I recommend other toolkits also add integration tests with MXNet nightly.
> > It helps identify issues early.
> >
> > Best,
> > Haibin
> >
> > On Thu, Jun 20, 2019 at 18:52 Zhao, Patric  wrote:
> >
> > > Thanks to raise the issue and we will take a look ASAP.
> > >
> > > The downstream cases is not in the MXNet CI so it's hard to catch the
> > > potential bugs or performance degradation for MXNet developers.
> > >
> > > In the future, I suggest adding the major downstream test cases, like
> > from
> > > sockeye, GluonNLP, GLuonCV, DGL, Gluon-TS, into the nightly test.
> > > If it's still too heavy,  maybe testing it weekly or monthly :)
> > >
> > > Thanks,
> > >
> > > --Patric
> > >
> > > > -Original Message-
> > > > From: Anirudh Subramanian [mailto:anirudh2...@gmail.com]
> > > > Sent: Friday, June 21, 2019 9:31 AM
> > > > To: dev@mxnet.incubator.apache.org
> > > > Cc: d...@mxnet.apache.org
> > > > Subject: Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1
> > > >
> > > > Hi Lai,
> > > >
> > > > I have opened an issue:
> > > > https://github.com/apache/incubator-mxnet/issues/15297
> > > > I came to know about this issue only today and I have not been
> > monitoring
> > > > sockeye.
> > > > I jumped onto this issue to make sure it wasn't caused by the dlpack
> > > changes.
> > > > Also, I don't  think sockeye CI checks against master, it is using
> > 1.4.1.
> > > >
> > > > Anirudh
> > > >
> > > >
> > > > On Thu, Jun 20, 2019 at 6:17 PM Lai Wei  wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Could you share which test failed and what’s the crash? How to
> > > > > reproduce it?
> > > > >
> > > > > I was able to install sockeye and run all tests passed. Using python
> > > > > setup.py test
> > > > >
> > > > > I have tested both nightly pip package and 1.5.0.rc1
> > > > >
> > > > > It would be great to create an issue with reproducible steps and move
> > > > > the discussion there.
> > > > >
> > > > > Also I see sockeye nightly build[1] has been failing for some time,
> > if
> > > > > it’s due to MXNet change, please raise this early so we can track and
> > > > > solve it in time rather than block the release during vote time.
> > > > >
> > > > > [1] https://travis-ci.org/awslabs/sockeye
> > > > >
> > > > >
> > > > > On Fri, Jun 21, 2019 at 7:01 AM Anirudh Subramanian
> > > > >  > > > > >
> > > > > wrote:
> > > > >
> > > > > > I was able to reproduce a crash with the commit
> > > > > > 09202f7f261954383aa387144524d38f83f18d06 but not with the commit
> > > > > > a862270beb2d796c1ba311183f7f4a766a18ad6c.
> > > > > >
> > > > > > Anirudh
> > > > > >
> > > > > > On Thu, Jun 20, 2019 at 3:53 PM Lai Wei 
> > wrote:
> > > > > >
> > > > > > > Hi Przemyslaw,
> > > > > > >
> > > > > > > Is there an issue with more details to track the problem?
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jun 21, 2019 at 6:04 AM Przemysław Trędak
> > > > > > > 
> > > > > > > wrote:
> > > > > > >
> > > > > > > > -1
> > > > > > > >
> > > > > > > > There is a crash in sockeye unit test (python setup.py test)
> > > > > > > > observed starting with nightly 1.5 build from 6/13 and still
> > > > > > > > occuring in
> > > > > > 1.5rc1. I
> > > > > > > > don't yet have the exact commit that is responsible for it, but
> > > > > > > > it is either a862270beb2d796c1ba311183f7f4a766a18ad6c (dlpack
> > > > > > > > related) or
> > > > > > > > 09202f7f261954383aa387144524d38f83f18d06 (cached op
> > > > optimization).
> > > > > > > >
> > > > > > > > On 2019/06/20 06:36:22, Lai Wei  wrote:
> > > > > > > > > Dear MXNet community,
> > > > > > > > >
> > > > > > > > > This is the 3-day vote to release Apache MXNet (incubating)
> > > > > > > > > version
> > > > > > > > 1.5.0.
> > > > > > > > > Voting on dev@ will start June 19, 23:59:59(PST)  and close
> > on
> > > > > June
> > > > > > > 22,
> > > > > > > > > 23:59:59.
> > > > > > > > >
> > > > > > > > > 1) Link to release notes:
> > > > > > > > >
> > > > > >
> > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Note
> > > > > > s
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 2) Link to release candidate:
> > > > > > > > >
> > > > > > > > >
> > 

Re: OMP

2019-06-25 Thread Pedro Larroy
Nobody claimed that the original lockup has to do with OMP, but the
fix caused re-entrancy into OMP initialization as explained below. So
I agree with your statement that the bug that using pthread_atfork was
fixing is not related with OMP, but the fix is causing interactions
with OMP as described above.

Pedro.

On Tue, Jun 25, 2019 at 12:33 PM Chris Olivier  wrote:
>
> The call stacks there are mostly associated with the execution engine
> threads, which are not OMP threads.  That lockup doesn't look to me to be
> related to OMP   -- the execution engine uses its own thread pool logic --
> I'm pretty familiar with that part of the code.  Unless I am missing one --
> can you point to the one that looks OMP-related?
>
>
> On Tue, Jun 25, 2019 at 10:35 AM Pedro Larroy 
> wrote:
>
> > Thanks for digging that out Kellen. That's good info so maybe it would
> > be good to rework the fix with the info you provided and remove the
> > pthread_atfork handlers.
> > Do you think setting the device would avoid the problem seen on the
> > backtrace you provided?  specifically here:
> >
> > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600#file-hang_bt-L24
> >
> > On Mon, Jun 24, 2019 at 6:43 PM kellen sunderland
> >  wrote:
> > >
> > > I remember at the time we also had a read through of this blog post, but
> > to
> > > use the code looked like it was following the advice:
> > >
> > https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/
> > >
> > > On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > I remember this hang as well, it was pretty hard to reproduce IIRC.  I
> > > > believe the stacks for the hang are here:
> > > >
> > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600
> > and
> > > > the trick was we could only debug it up to the point that we hit:
> > > >
> > > > #0  0x7fec6df1ba4f in futex_wait (private=0, expected=1,
> > > > futex_word=0x7fec60843758)
> > > > at ../sysdeps/unix/sysv/linux/futex-internal.h:61
> > > > #1  futex_wait_simple (private=0, expected=1,
> > futex_word=0x7fec60843758)
> > > > at ../sysdeps/nptl/futex-internal.h:135
> > > > #2  __pthread_once_slow (once_control=0x7fec60843758,
> > > > init_routine=0x7fec605f38f0)
> > > >     at pthread_once.c:105
> > > > ...
> > > > #6  0x7fec6061c577 in cudaSetDevice () from
> > > > /usr/local/cuda/lib64/libcudart.so.9.0
> > > >
> > > > because the code in libcudart is obviously closed source we couldn't
> > dig
> > > > into what threading work was going on when we called cudaSetDevice.
> > > >
> > > > On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy <
> > pedro.larroy.li...@gmail.com>
> > > > wrote:
> > > >
> > > >> If you check initialize.cc we seem to be explicitly disabling that
> > > >> behaviour in pthread_at_fork which seems to cause thread contention
> > > >> during multiprocessing. Why do we need this major advantage for the
> > > >> library if that's the case?
> > > >>
> > > >> Related PRs:
> > > >>
> > > >> https://github.com/apache/incubator-mxnet/pull/10820
> > > >> https://github.com/apache/incubator-mxnet/issues/14396
> > > >>
> > > >> The original code was authored in this PR:
> > > >>
> > > >> https://github.com/apache/incubator-mxnet/pull/8677
> > > >>
> > > >> I actually remember this fix, it was done during a release as the cuda
> > > >> runtime was forking and the engine was being re-entered. If that
> > > >> situation is now happening anymore it might not be needed any longer.
> > > >> I don't think we know the cause why there was a fork inside cuda, so
> > > >> the code has grown around a fix for an issue which its root cause was
> > > >> not understood, and side effects which this fix caused afterwards.
> > > >>
> > > >> My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in
> > > >> the link above, no libgomp.
> > > >>
> > > >> I didn't try the Make build.
> > > >>
> > > >> I would refactor the code linked above and stop using pthread_at_fork,
> > > >&g

Re: OMP

2019-06-25 Thread Pedro Larroy
Thanks for digging that out Kellen. That's good info so maybe it would
be good to rework the fix with the info you provided and remove the
pthread_atfork handlers.
Do you think setting the device would avoid the problem seen on the
backtrace you provided?  specifically here:
https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600#file-hang_bt-L24

On Mon, Jun 24, 2019 at 6:43 PM kellen sunderland
 wrote:
>
> I remember at the time we also had a read through of this blog post, but to
> use the code looked like it was following the advice:
> https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/
>
> On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > I remember this hang as well, it was pretty hard to reproduce IIRC.  I
> > believe the stacks for the hang are here:
> > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600 
> > and
> > the trick was we could only debug it up to the point that we hit:
> >
> > #0  0x7fec6df1ba4f in futex_wait (private=0, expected=1,
> > futex_word=0x7fec60843758)
> > at ../sysdeps/unix/sysv/linux/futex-internal.h:61
> > #1  futex_wait_simple (private=0, expected=1, futex_word=0x7fec60843758)
> > at ../sysdeps/nptl/futex-internal.h:135
> > #2  __pthread_once_slow (once_control=0x7fec60843758,
> > init_routine=0x7fec605f38f0)
> > at pthread_once.c:105
> > ...
> > #6  0x7fec6061c577 in cudaSetDevice () from
> > /usr/local/cuda/lib64/libcudart.so.9.0
> >
> > because the code in libcudart is obviously closed source we couldn't dig
> > into what threading work was going on when we called cudaSetDevice.
> >
> > On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy 
> > wrote:
> >
> >> If you check initialize.cc we seem to be explicitly disabling that
> >> behaviour in pthread_at_fork which seems to cause thread contention
> >> during multiprocessing. Why do we need this major advantage for the
> >> library if that's the case?
> >>
> >> Related PRs:
> >>
> >> https://github.com/apache/incubator-mxnet/pull/10820
> >> https://github.com/apache/incubator-mxnet/issues/14396
> >>
> >> The original code was authored in this PR:
> >>
> >> https://github.com/apache/incubator-mxnet/pull/8677
> >>
> >> I actually remember this fix, it was done during a release as the cuda
> >> runtime was forking and the engine was being re-entered. If that
> >> situation is now happening anymore it might not be needed any longer.
> >> I don't think we know the cause why there was a fork inside cuda, so
> >> the code has grown around a fix for an issue which its root cause was
> >> not understood, and side effects which this fix caused afterwards.
> >>
> >> My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in
> >> the link above, no libgomp.
> >>
> >> I didn't try the Make build.
> >>
> >> I would refactor the code linked above and stop using pthread_at_fork,
> >> since OMP assumes it won't be initialized twice, but needs to be very
> >> well tested to make sure it doesn't cause bugs or affect the fixes
> >> done on the linked PRs above.
> >>
> >> Pedro.
> >>
> >> On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier 
> >> wrote:
> >> >
> >> > one major advantage of intel/llvm omp is that it spawns a new thread
> >> pool
> >> > after fork if a thread pool was already created. this is so that omp
> >> can be
> >> > used in the forked processes. libgomp doesn’t do this so it’ll just
> >> lock up
> >> > if you try to do omp in the forked process.
> >> >
> >> > is your build linking libgomp as well?
> >> >
> >> > standard mkl build (from Makefile) uses same omp library. are there
> >> > problems with that build?
> >> >
> >> > what changes need to be made to make the assertion not fire?
> >> >
> >> > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy <
> >> pedro.larroy.li...@gmail.com>
> >> > wrote:
> >> >
> >> > > There's an assertion which is easily reproducible, and also there's a
> >> > > crash including core dump, the latter is not easy to reproduce for me
> >> > > in different environments. I have also seen mxnet getting stuck
> >> > > without progressing with this build configuration and using no

Re: OMP

2019-06-24 Thread Pedro Larroy
If you check initialize.cc we seem to be explicitly disabling that
behaviour in pthread_at_fork which seems to cause thread contention
during multiprocessing. Why do we need this major advantage for the
library if that's the case?

Related PRs:

https://github.com/apache/incubator-mxnet/pull/10820
https://github.com/apache/incubator-mxnet/issues/14396

The original code was authored in this PR:

https://github.com/apache/incubator-mxnet/pull/8677

I actually remember this fix, it was done during a release as the cuda
runtime was forking and the engine was being re-entered. If that
situation is now happening anymore it might not be needed any longer.
I don't think we know the cause why there was a fork inside cuda, so
the code has grown around a fix for an issue which its root cause was
not understood, and side effects which this fix caused afterwards.

My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in
the link above, no libgomp.

I didn't try the Make build.

I would refactor the code linked above and stop using pthread_at_fork,
since OMP assumes it won't be initialized twice, but needs to be very
well tested to make sure it doesn't cause bugs or affect the fixes
done on the linked PRs above.

Pedro.

On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier  wrote:
>
> one major advantage of intel/llvm omp is that it spawns a new thread pool
> after fork if a thread pool was already created. this is so that omp can be
> used in the forked processes. libgomp doesn’t do this so it’ll just lock up
> if you try to do omp in the forked process.
>
> is your build linking libgomp as well?
>
> standard mkl build (from Makefile) uses same omp library. are there
> problems with that build?
>
> what changes need to be made to make the assertion not fire?
>
> On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy 
> wrote:
>
> > There's an assertion which is easily reproducible, and also there's a
> > crash including core dump, the latter is not easy to reproduce for me
> > in different environments. I have also seen mxnet getting stuck
> > without progressing with this build configuration and using no CPU at
> > all when running unit tests.
> >
> > In my view, the root cause of the assertion is that we are re-entering
> > OMP initialization when spawning threads on the following code through
> > pthread_at_fork
> >
> > https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58
> >
> > This causes double initialization of the OMP engine, including the
> > assertion which you are asking about,  and I suspect some additional
> > overhead. That's the shady forking part you are asking for.
> >
> > A question for you: What is the cause of runtime differences between
> > OMP runtimes? Shouldn't the implementation overhead diminish as
> > threads run longer?
> >
> > Pedro.
> >
> > On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier 
> > wrote:
> > >
> > > What’s the reason for the assertion failure? btw classifying an assertion
> > > failure a “crash” is debatable. As I stated in the original issue a long
> > > time ago, it’s possible something shady is being done with when forking
> > > that should be fixed.  The assertion should be root caused.
> > >
> > >
> > >
> > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy <
> > pedro.larroy.li...@gmail.com>
> > > wrote:
> > >
> > > > Added a dockerfile, and reports of a crash in my local machine when
> > > > running MKL+OMP+DEBUG, with Anton's branch the crash happened as well.
> > > > I couldn't reproduce the crash on my EC2 machine:
> > > > Added the backtrace of the crash as well.
> > > >
> > > > https://github.com/apache/incubator-mxnet/issues/10856
> > > >
> > > > Dockerfile here:
> > > >
> > > > https://github.com/larroy/mxnet_omp
> > > >
> > > > Kind regards.
> > > >
> > > > Pedro.
> > > >
> > > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu <
> > marco.g.ab...@gmail.com>
> > > > wrote:
> > > > >
> > > > > As already proposed, I think the easiest way to get a common
> > > > understanding
> > > > > is if we start with a few docker containers. Pedro, would it be
> > possible
> > > > > for you to wrap your benchmarks into a few containers that will
> > produce
> > > > > your shown results? That way, we can avoid possible
> > misunderstandings and
> > > > > also pinpoint the exact parts where people disagree or misunderstood
> >

Re: OMP

2019-06-24 Thread Pedro Larroy
There's an assertion which is easily reproducible, and also there's a
crash including core dump, the latter is not easy to reproduce for me
in different environments. I have also seen mxnet getting stuck
without progressing with this build configuration and using no CPU at
all when running unit tests.

In my view, the root cause of the assertion is that we are re-entering
OMP initialization when spawning threads on the following code through
pthread_at_fork

https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58

This causes double initialization of the OMP engine, including the
assertion which you are asking about,  and I suspect some additional
overhead. That's the shady forking part you are asking for.

A question for you: What is the cause of runtime differences between
OMP runtimes? Shouldn't the implementation overhead diminish as
threads run longer?

Pedro.

On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier  wrote:
>
> What’s the reason for the assertion failure? btw classifying an assertion
> failure a “crash” is debatable. As I stated in the original issue a long
> time ago, it’s possible something shady is being done with when forking
> that should be fixed.  The assertion should be root caused.
>
>
>
> On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy 
> wrote:
>
> > Added a dockerfile, and reports of a crash in my local machine when
> > running MKL+OMP+DEBUG, with Anton's branch the crash happened as well.
> > I couldn't reproduce the crash on my EC2 machine:
> > Added the backtrace of the crash as well.
> >
> > https://github.com/apache/incubator-mxnet/issues/10856
> >
> > Dockerfile here:
> >
> > https://github.com/larroy/mxnet_omp
> >
> > Kind regards.
> >
> > Pedro.
> >
> > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu 
> > wrote:
> > >
> > > As already proposed, I think the easiest way to get a common
> > understanding
> > > is if we start with a few docker containers. Pedro, would it be possible
> > > for you to wrap your benchmarks into a few containers that will produce
> > > your shown results? That way, we can avoid possible misunderstandings and
> > > also pinpoint the exact parts where people disagree or misunderstood each
> > > other.
> > >
> > > -Marco
> > >
> > > Pedro Larroy  schrieb am Do., 20. Juni
> > 2019,
> > > 21:47:
> > >
> > > > I can confirm that we are linking with two versions of omp, I'm
> > > > gaining more clarity into this topic, but I have still questions, the
> > > > facts that I got so far are the folllowing:
> > > >
> > > > * #1: We are linking with two versions of omp, intel's omp and llvm
> > > > openmp when building with MKL enabled.
> > > > * #2: We have 3 different possible OMP versions: Intel OMP (comes with
> > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This
> > > > one is used on the PR proposed by Anton).
> > > >
> > > > Questions:
> > > >
> > > >  * #1 Is it ok to have two versions of openmp linked at the same time?
> > > >  * #2 Which implementation of OMP gives the best performance?  (See
> > > > total training time of my measurement for a partial answer)
> > > >  * #3 Should we have a build flag so we can choose the OMP version at
> > > > runtime?
> > > >  * #4 Which Compiler and build flags did Chris use to get 10x slowdown?
> > > >  * #5 @Stas: is there a script to replicate your benchmarks easily? If
> > > > so could you provide a link?  I think we would need to reproduce your
> > > > benchmarks and verify which versions are being linked. It's possible
> > > > that while compiling with MKL intel's omp was pulled in instead of
> > > > GNU OpenMP.
> > > >  * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we
> > > > update the subrepo regularly?
> > > >
> > > > My conclusion so far:
> > > >
> > > >  * #1 We should avoid linking two versions of omp if possible and
> > > > allow users to choose one in the build as we do for BLAS.
> > > >  * #2 For performance reasons and more control vs different compiler
> > > > versions seems it makes indeed sense to keep the LLVM OpenMP version
> > > > in 3rdparty for now. So unless some more data is gathered, it makes
> > > > sense not to remove it as of now.
> > > >  * #3 We should provide build options to choose which openmp library
> > > > is to be used from the three optio

Re: OMP

2019-06-24 Thread Pedro Larroy
Added a dockerfile, and reports of a crash in my local machine when
running MKL+OMP+DEBUG, with Anton's branch the crash happened as well.
I couldn't reproduce the crash on my EC2 machine:
Added the backtrace of the crash as well.

https://github.com/apache/incubator-mxnet/issues/10856

Dockerfile here:

https://github.com/larroy/mxnet_omp

Kind regards.

Pedro.

On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu  wrote:
>
> As already proposed, I think the easiest way to get a common understanding
> is if we start with a few docker containers. Pedro, would it be possible
> for you to wrap your benchmarks into a few containers that will produce
> your shown results? That way, we can avoid possible misunderstandings and
> also pinpoint the exact parts where people disagree or misunderstood each
> other.
>
> -Marco
>
> Pedro Larroy  schrieb am Do., 20. Juni 2019,
> 21:47:
>
> > I can confirm that we are linking with two versions of omp, I'm
> > gaining more clarity into this topic, but I have still questions, the
> > facts that I got so far are the folllowing:
> >
> > * #1: We are linking with two versions of omp, intel's omp and llvm
> > openmp when building with MKL enabled.
> > * #2: We have 3 different possible OMP versions: Intel OMP (comes with
> > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This
> > one is used on the PR proposed by Anton).
> >
> > Questions:
> >
> >  * #1 Is it ok to have two versions of openmp linked at the same time?
> >  * #2 Which implementation of OMP gives the best performance?  (See
> > total training time of my measurement for a partial answer)
> >  * #3 Should we have a build flag so we can choose the OMP version at
> > runtime?
> >  * #4 Which Compiler and build flags did Chris use to get 10x slowdown?
> >  * #5 @Stas: is there a script to replicate your benchmarks easily? If
> > so could you provide a link?  I think we would need to reproduce your
> > benchmarks and verify which versions are being linked. It's possible
> > that while compiling with MKL intel's omp was pulled in instead of
> > GNU OpenMP.
> >  * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we
> > update the subrepo regularly?
> >
> > My conclusion so far:
> >
> >  * #1 We should avoid linking two versions of omp if possible and
> > allow users to choose one in the build as we do for BLAS.
> >  * #2 For performance reasons and more control vs different compiler
> > versions seems it makes indeed sense to keep the LLVM OpenMP version
> > in 3rdparty for now. So unless some more data is gathered, it makes
> > sense not to remove it as of now.
> >  * #3 We should provide build options to choose which openmp library
> > is to be used from the three options available, including libgomp.
> >  * #4 Refining the build we could also enable OpenMP in mac without
> > additional contortions (doesn't work as of today):
> > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
> >  * #5 We should add different omp versions to our benchmarks and track
> > the performance, so this data is available for prescribing the best
> > build options and for binary releases.
> >
> > This is also an interesting related gh issue posted in the mkl-dnn
> > repository:  https://github.com/intel/mkl-dnn/issues/230
> >
> >
> > I don't observe the order of magnitude divergence reported by Chris in
> > vanilla Ubuntu 18.04 in samples / s but the full training finishes
> > indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp.
> >
> > There's also differences in training time when using MKL and the ,
> > it's actually a bit slower, I don't know if it's related to OMP.
> >
> > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
> >
> > Anton's branch:  g...@github.com:lebeg/incubator-mxnet.git   branch 'omp'
> > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
> > build/libmxnet.so |grep -i omp
> > libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > (0x7fd99a51d000)
> >
> > time python train_mnist.py
> >
> > INFO:root:Epoch[18] Validation-accuracy=0.984176
> > INFO:root:Epoch[19] Batch [0-100]   Speed: 41617.00 samples/sec
> >  accuracy=1.00
> > INFO:root:Epoch[19] Batch [100-200] Speed: 47990.69 samples/sec
> >  accuracy=0.999531
> > INFO:root:Epoch[19] Batch [200-300] Speed: 47517.01 samples/sec
> >  accuracy=0.999687
> > INFO:root:Epoch[19] Batch [300-400] Speed: 47430.53 samples/sec
> >  accuracy=1.00
> > INFO:root:Epoch[19] Batch [400-500] Speed: 47649.77 samples/sec
&

Re: OMP

2019-06-20 Thread Pedro Larroy
s/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [700-800] Speed: 44962.78 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [800-900] Speed: 44945.47 samples/sec
 accuracy=0.999375
INFO:root:Epoch[19] Train-accuracy=0.999717
INFO:root:Epoch[19] Time cost=1.367
INFO:root:Epoch[19] Validation-accuracy=0.982783
854.97user 847.21system 0:41.44elapsed 4106%CPU (0avgtext+0avgdata
1154348maxresident)k
0inputs+0outputs (0major+3624361minor)pagefaults 0swaps


MKL OFF:
(py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> grep -i MKL
cmake_options.yml
USE_MKL_IF_AVAILABLE: "OFF" # Use MKL if found
USE_MKLML_MKL: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
USE_MKL_IF_AVAILABLE AND (NOT APPLE)
USE_MKLDNN: "OFF" # Use MKLDNN variant of MKL (if MKL found) IF
USE_MKL_IF_AVAILABLE AND (NOT APPLE)
(py3_venv) piotr@ec2 cpu:0: ~/mxnet_master [master]> ldd
build/libmxnet.so |grep -i omp
libomp.so =>
/home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so
(0x7fb720c54000)

INFO:root:Epoch[18] Validation-accuracy=0.983479
INFO:root:Epoch[19] Batch [0-100]   Speed: 46784.02 samples/sec
 accuracy=1.00
INFO:root:Epoch[19] Batch [100-200] Speed: 48824.29 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [200-300] Speed: 49190.31 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [300-400] Speed: 51518.77 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [400-500] Speed: 51551.62 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [500-600] Speed: 49026.35 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Batch [600-700] Speed: 49002.46 samples/sec
 accuracy=0.999375
INFO:root:Epoch[19] Batch [700-800] Speed: 48980.55 samples/sec
 accuracy=0.999687
INFO:root:Epoch[19] Batch [800-900] Speed: 47402.56 samples/sec
 accuracy=0.999844
INFO:root:Epoch[19] Train-accuracy=0.999767
INFO:root:Epoch[19] Time cost=1.259
INFO:root:Epoch[19] Validation-accuracy=0.983181
755.36user 754.94system 0:35.89elapsed 4207%CPU (0avgtext+0avgdata
1147008maxresident)k
0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps

Let me know what you think.

Link to the original PR: https://github.com/apache/incubator-mxnet/pull/12160

Thanks.

On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
 wrote:
>
> "if you’re linking in two then you’re doing something wrong." Correct,
> that's one thing I believe we've got consensus on.  So let's call that out
> as a bug to be fixed.
>
> Let's move forward with some reproducible numbers and then discuss the pros
> / cons of which particular OMP implementation we should use.
>
> On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy 
> wrote:
>
> > Hi Chris
> >
> > I would ask you to have a bit of patience and help us with your
> > experience in this matter. Nobody is ignoring anything, I think we are
> > individually gathering feedbacks and trying to understand the multiple
> > contributions done to this topic including yours, then go step by
> > step, understand what is going on and run experiments and report back
> > to the list or the corresponding github item. It was suggested by
> > Kellen to prepare some containers, this takes effort.
> >
> > Regarding your final comment, most of us also have many other things
> > to do and responsibilities even if our daytime jobs might involve
> > MXNet in some form or another. I think that's part of the privilege
> > and responsibility of working close with an open source project and
> > the magic of collaboration across organizations. Let's all be patient
> > and take some time to understand and reason about this topic which is
> > not simple. Since we decided to step back and gather more data let's
> > take time and do it properly.
> >
> > Personally I hope to find time to look again into this issue before
> > the end of the week.
> >
> > Thanks.
> >
> > Pedro.
> >
> > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier 
> > wrote:
> > >
> > > if you’re linking in two then you’re doing something wrong. You can see
> > by
> > > my email yesterday that only one is linked in. This is also the case with
> > > the mkl version built by the Makefile — only the Intel OMP library is
> > used
> > > (no libgomp).
> > >
> > > That being said, Do you have clear evidence that using Intel OMP is both
> > > problematic and the situation isn’t fixable?  The burden of proof is on
> > the
> > > ones requesting the change — it is not my responsibility to justify the
> > > current state.  There must be something “terrible” and unfixable to
> > justify
> > > a change.  I have seen no proof of this in all 

Re: OMP

2019-06-19 Thread Pedro Larroy
Hi Chris

I would ask you to have a bit of patience and help us with your
experience in this matter. Nobody is ignoring anything, I think we are
individually gathering feedbacks and trying to understand the multiple
contributions done to this topic including yours, then go step by
step, understand what is going on and run experiments and report back
to the list or the corresponding github item. It was suggested by
Kellen to prepare some containers, this takes effort.

Regarding your final comment, most of us also have many other things
to do and responsibilities even if our daytime jobs might involve
MXNet in some form or another. I think that's part of the privilege
and responsibility of working close with an open source project and
the magic of collaboration across organizations. Let's all be patient
and take some time to understand and reason about this topic which is
not simple. Since we decided to step back and gather more data let's
take time and do it properly.

Personally I hope to find time to look again into this issue before
the end of the week.

Thanks.

Pedro.

On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier  wrote:
>
> if you’re linking in two then you’re doing something wrong. You can see by
> my email yesterday that only one is linked in. This is also the case with
> the mkl version built by the Makefile — only the Intel OMP library is used
> (no libgomp).
>
> That being said, Do you have clear evidence that using Intel OMP is both
> problematic and the situation isn’t fixable?  The burden of proof is on the
> ones requesting the change — it is not my responsibility to justify the
> current state.  There must be something “terrible” and unfixable to justify
> a change.  I have seen no proof of this in all this time.
>
> On a side note, I mentioned a couple of things in my email yesterday that
> still are not being responded to (they were also ignored in the last
> incarnation of this “discussion” — I have much experience in this matter to
> assume “discussion” is a waste of my time, seeing and I am not paid to
> “work on” mxnet like y’all are).
>
> -C
>
>
>
>
>
>
> On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > I've also quite often seen two versions of OpenMP linked.  I think we can
> > all agree we probably want to avoid linking in two libraries that do
> > effectively the same thing.
> >
> > The performance questions should be fairly straight forward to demonstrate
> > right?  Could we just collaborate on a few minimal Dockerfiles that show
> > (or don't show) Intel OpenMP performance speedups with the workloads Chris
> > is referencing?
> >
> > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> > stanislav.tsuk...@gmail.com> wrote:
> >
> > > Hi, Chris!
> > >
> > > Stas here - I've gathered that performance data.
> > > Sure thing, I can be wrong, but please elaborate a bit on what we are
> > > missing.
> > > Be assured, intentional misdirection was never a case.
> > >
> > > Thanks a lot for being constructive.
> > >
> > > > Turning Intel OMP on and off (and MKL as well, since it tends to pull
> > in
> > > omp, depending which one is linked in).
> > >
> > > We never ever considered turning MKL off. We are on the same page here -
> > > MKL is crucial for the performance.
> > > Why should we? There's a GOMP-linked version of MKL, that we can use.
> > >
> > > What we did - we measured, if using compilers default OpenMP
> > > implementation instead of referenced source code distribution of OpenMP
> > > makes anything slower.
> > > We have found the impact to be hardly measurable.
> > > The difference between GOMP and iOMP is <5% on our benchmarks, most of
> > the
> > > time less than that.
> > >
> > > We just suggest to simplify the build of mxnet, by removing the
> > > unnecessary dependency.
> > >
> > > During that we discovered for example the following amazing issue:
> > > https://github.com/apache/incubator-mxnet/issues/14087
> > >
> > > Best Regards
> > >
> > > Stas
> > >
> > > On 18.06.19, 18:24, "Chris Olivier"  wrote:
> > >
> > > I am very reluctant to feed the trolls again, and this will be teh
> > last
> > > time I address Pedro or Anton on the subject, but since I think the
> > > numbers
> > > being presented are incorrect (either by te builders not really
> > > understanding what they are building, or possibly intentional
> > > misdirection):
> > >
> > > Turning Intel OMP on and off (and MKL as well, since it tends to pull
> > > in
> > > omp, depending which one is linked in).
> > > There is a HUGE difference.  This is consistent with my experience
> > > before
> > > when it was added.
> > >
> > >
> > > default mnist:
> > >
> > > python ../example/image-classification/train_mnist.py
> > > INFO:root:start with arguments Namespace(add_stn=False,
> > batch_size=64,
> > > disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
> > > gpus=None, image_shape='1, 28, 

Re: OMP

2019-06-19 Thread Pedro Larroy
+1 Would be best to have a controlled environment so we can reason
about how MXNet is being built and what libraries are linked. I'm
happy to help here. I would think docker won't have a big impact on
the measurement or distort the results much.


On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland
 wrote:
>
> I've also quite often seen two versions of OpenMP linked.  I think we can
> all agree we probably want to avoid linking in two libraries that do
> effectively the same thing.
>
> The performance questions should be fairly straight forward to demonstrate
> right?  Could we just collaborate on a few minimal Dockerfiles that show
> (or don't show) Intel OpenMP performance speedups with the workloads Chris
> is referencing?
>
> On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> stanislav.tsuk...@gmail.com> wrote:
>
> > Hi, Chris!
> >
> > Stas here - I've gathered that performance data.
> > Sure thing, I can be wrong, but please elaborate a bit on what we are
> > missing.
> > Be assured, intentional misdirection was never a case.
> >
> > Thanks a lot for being constructive.
> >
> > > Turning Intel OMP on and off (and MKL as well, since it tends to pull in
> > omp, depending which one is linked in).
> >
> > We never ever considered turning MKL off. We are on the same page here -
> > MKL is crucial for the performance.
> > Why should we? There's a GOMP-linked version of MKL, that we can use.
> >
> > What we did - we measured, if using compilers default OpenMP
> > implementation instead of referenced source code distribution of OpenMP
> > makes anything slower.
> > We have found the impact to be hardly measurable.
> > The difference between GOMP and iOMP is <5% on our benchmarks, most of the
> > time less than that.
> >
> > We just suggest to simplify the build of mxnet, by removing the
> > unnecessary dependency.
> >
> > During that we discovered for example the following amazing issue:
> > https://github.com/apache/incubator-mxnet/issues/14087
> >
> > Best Regards
> >
> > Stas
> >
> > On 18.06.19, 18:24, "Chris Olivier"  wrote:
> >
> > I am very reluctant to feed the trolls again, and this will be teh last
> > time I address Pedro or Anton on the subject, but since I think the
> > numbers
> > being presented are incorrect (either by te builders not really
> > understanding what they are building, or possibly intentional
> > misdirection):
> >
> > Turning Intel OMP on and off (and MKL as well, since it tends to pull
> > in
> > omp, depending which one is linked in).
> > There is a HUGE difference.  This is consistent with my experience
> > before
> > when it was added.
> >
> >
> > default mnist:
> >
> > python ../example/image-classification/train_mnist.py
> > INFO:root:start with arguments Namespace(add_stn=False, batch_size=64,
> > disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
> > gpus=None, image_shape='1, 28, 28', initializer='default',
> > kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
> > lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
> > monitor=0, network='mlp', num_classes=10, num_epochs=20,
> > num_examples=6, num_layers=None, optimizer='sgd',
> > profile_server_suffix='', profile_worker_suffix='', save_period=1,
> > test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear',
> > wd=0.0001)
> >
> > INTEL OMP:
> >
> > ldd libmxnet.so | grep omp
> > libomp.so =>
> > /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > (0x7f978fde7000)
> >
> > :root:Epoch[0] Batch [0-100]Speed: 31548.09 samples/sec
> > accuracy=0.780012
> > INFO:root:Epoch[0] Batch [100-200]  Speed: 16073.21 samples/sec
> > accuracy=0.920469
> > INFO:root:Epoch[0] Batch [200-300]  Speed: 19075.91 samples/sec
> > accuracy=0.928281
> > INFO:root:Epoch[0] Batch [300-400]  Speed: 23211.36 samples/sec
> > accuracy=0.942813
> > INFO:root:Epoch[0] Batch [400-500]  Speed: 22139.79 samples/sec
> > accuracy=0.938750
> > INFO:root:Epoch[0] Batch [500-600]  Speed: 23225.52 samples/sec
> > accuracy=0.946562
> > INFO:root:Epoch[0] Batch [600-700]  Speed: 19547.41 samples/sec
> > accuracy=0.953281
> > INFO:root:Epoch[0] Batch [700-800]  Speed: 24111.73 samples/sec
> > accuracy=0.951562
> > INFO:root:Epoch[0] Batch [800-900]  Speed: 13959.88 samples/sec
> > accuracy=0.957500
> > INFO:root:Epoch[0] Train-accuracy=0.925423
> > INFO:root:Epoch[0] Time cost=3.806
> > INFO:root:Epoch[0] Validation-accuracy=0.962580
> > INFO:root:Epoch[1] Batch [0-100]Speed: 24560.21 samples/sec
> > accuracy=0.968131
> > INFO:root:Epoch[1] Batch [100-200]  Speed: 23457.03 samples/sec
> > accuracy=0.966250
> >
> >
> > LIBGOMP:
> >
> > ldd libmxnet.so | grep omp
> > libgomp.so.1 => 

  1   2   3   4   >