from:"Chris Olivier"

Re: Severe legal issues with releases on repository.apache.org

2020-05-08 Thread Chris Olivier

do the gpu builds actually include the nvidia cuda libraries such as
libcudart.so or just link to them and expect them to be on the machine?


On Fri, May 8, 2020 at 1:50 PM Lausen, Leonard 
wrote:

> Hi all,
>
> repository.apache.org is an official Apache Software Foundation release
> channel
> and the MXNet project has been publishing convenience binaries via that
> channel
> since quite a while. Unfortunately it appears that no-one has initiated a
> license review of these convenience binaries, and unfortunately they are
> incompatible with the ASF requirements. They should have never been
> uploaded.
>
> I recently reached out to Legal to inquire about this issue [1] and Legal
> team
> recommends to remedy the situation ASAP.
>
> Two issues, out of the potentially larger set of all issues.
>
> 1) There are GPU builds (mxnet-full_2.11-linux-x86_64-gpu) incorporating
> the
> CUDA SDK and possibly cuDNN, placing the resulting libmxnet.so under the
> CUDA
> EULA and cuDNN SLA. This EULA and SLA contain many restrictions, making
> them
> Category-X licenses [1]. No Apache project must under any circumstance
> redistribute such binaries.
>
> 2) All builds redistribute libgfortran.so, which is part of the GNU Fortran
> compiler, part of GCC and subject to the GPL. The GPL is also a Category-X
> license and the same restrictions apply.
>
> I see the following two potential remedies:
>
> 1) Ask the Infra team to delete all MXNet releases on
> repository.apache.org
>
> 2) Ask the Infra team to delete all MXNet GPU releases on
> repository.apache.org
> and provide replacement releases without libgfortran.so and other
> potentially
> Category-X files (I found libmkl_ml.so in one of the JARs..)
>
> If no-one steps up to do 2) or no-one suggests a better option, I
> recommend we
> go for option 1). Let's start discussing the options. Once discussion has
> settled, I'll initiate a lazy consensus or vote session.
>
> Note that these license rules apply to MXNet as part of the ASF.
> Third-parties
> (individuals or companies) may redistribute binary builds of MXNet
> incorporating
> Category-X licenses, IF they are appropriately labeled and no ASF
> trademarks or
> branding is infringed.
>
> As for the GPU builds, NVidia or Amazon may be willing to provide
> third-party
> GPU builds. I opened another ticket with Jira to see if such third-parties
> could
> provide them and what considerations would need to be taken into account.
> [3]
> This is similar to the Pypi releases, are third-party releases and not
> performed
> by the MXNet project (though also for them some legal questions remain
> open; in
> particular our Website does not disclaim that these are third-party
> releases).
>
> Best regards
> Leonard
>
> [1]: https://issues.apache.org/jira/browse/LEGAL-516
> [2]: https://www.apache.org/legal/resolved.html#category-x
> [3]: https://issues.apache.org/jira/browse/LEGAL-515
>

Re: Workflow proposal

2020-03-11 Thread Chris Olivier

My $0.02

We had this model dual-branch when I was at GE and it was problematic.
Among other things, the two branches would tend to diverge to a large
degree and you ended up just cherry-picking in stuff here and there, which
caused even more problems, as well as the model allows the secondary branch
to get pretty buggy -- human nature being what it is -- to the point where
it's difficult to merge it into master without freezing them both and
stabilizing, merging into master, then stabilizing again (small things
almost certainly went into master in the meantime -- hotfixes, critical
features, etc, while everything was on hold stabilizing the secondary
branch).  It just double the work in the end, is what I experienced.

On Wed, Mar 11, 2020 at 5:47 PM Yuan Tang  wrote:

> Second to not introduce a dev branch. We should try to improve our release
> process instead and avoid another branch that may introduce confusion
> around the source of truth.
>
> On Wed, Mar 11, 2020 at 8:39 PM Tianqi Chen 
> wrote:
>
> > While the idea of staging seems to be reasonable.
> > Most OSS projects choose not to do so because having a complicated
> staging
> > will likely confuse the contributors, and increase the change of
> > divergence(between dev and master).
> >
> > Given that we have a release model, so in some sense the release itself
> > serves as a staging pt.
> > A good approach would simply setup the nightly if necessary strive to fix
> > regressions and make sure the formal release addresses the issues.
> >
> > TQ
> >
> > On Wed, Mar 11, 2020 at 5:32 PM Pedro Larroy <
> pedro.larroy.li...@gmail.com
> > >
> > wrote:
> >
> > > Hi
> > >
> > > I talk to some people about this and they thought it would be a good
> > idea,
> > > so sharing it here:
> > >
> > > I would propose to use a staging or "dev" branch into which nightly &
> > > performance tests are done periodically and then this branch is merged
> to
> > > master. The goal of this workflow would be to avoid having regressions
> > and
> > > nightly failures creeping into master. PRs would get merged into dev
> and
> > > dev promoted periodically / nightly into master.
> > >
> > > The names can be swapped as well, between dev and master, so PRS get
> > merged
> > > into master and it doesn't change the workflow, and staging is the
> branch
> > > where nightly changes are merged to.
> > >
> > > Have this been considered?
> > >
> > > Pedro.
> > >
> >
>
>
> --
> Yuan Tang
> https://terrytangyuan.github.io/about/ 
> 
>

Re: [VOTE] Release Apache MXNet (incubating) version 1.6.0.rc2

2020-02-04 Thread Chris Olivier

When "fixing", please "fix" through actual root-cause analysis (use gdb,
for instance) and not simply by guesswork and cutting out things which
probably aren't actually at fault (blaming an OMP library that's in
worldwide distribution int he billions should be treated with great
skepticism).

On Tue, Feb 4, 2020 at 10:44 AM Lin Yuan  wrote:

> Pedro,
>
> While I agree with you we need to fix this usability issue, I don't think
> this is a release blocker as Przemek mentioned above. Could we fix this in
> the next minor release?
>
> Thanks,
>
> Lin
>
> On Tue, Feb 4, 2020 at 10:38 AM Pedro Larroy  >
> wrote:
>
> > Right. Would it be possible to have the CMake build also use libgomp for
> > consistency with the releases until these issues are resolved?
> > This can affect anyone compiling the distribution with CMake and also
> > happens randomly in CI, worsening the contributor experience due to CI
> > failures.
> >
> > On Tue, Feb 4, 2020 at 9:33 AM Przemysław Trędak 
> > wrote:
> >
> > > Hi Pedro,
> > >
> > > From the issue that you linked it seems that you are using the LLVM
> > > OpenMP, whereas I believe the actual release uses libgomp (at least
> > that's
> > > what seems to be the conclusion from this issue:
> > > https://github.com/apache/incubator-mxnet/issues/16891)?
> > >
> > > Przemek
> > >
> > > On 2020/02/04 03:42:30, Pedro Larroy 
> > > wrote:
> > > > -1
> > > >
> > > > Unit tests passed in CPU build.
> > > >
> > > > I observe crashes related to openmp using cpp unit tests:
> > > >
> > > > https://github.com/apache/incubator-mxnet/issues/17043
> > > >
> > > > Pedro.
> > > >
> > > > On Mon, Feb 3, 2020 at 6:44 PM Chaitanya Bapat  >
> > > wrote:
> > > >
> > > > > +1
> > > > > Successfully built MXNet 1.6.0rc2 on Linux
> > > > > Tested for OpPerf utility
> > > > > For CPU -
> > > > >
> https://gist.github.com/ChaiBapchya/d5ecc3e971c5a3c558d672477b4b6b9c
> > > > >
> > > > > Works well!
> > > > >
> > > > >
> > > > >
> > > > > On Mon, 3 Feb 2020 at 15:43, Lin Yuan  wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > Tested Horovod with mnist example. My compiler flags are below:
> > > > > >
> > > > > > [✔ CUDA, ✔ CUDNN, ✔ NCCL, ✔ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔
> > > CPU_SSE2,
> > > > > ✔
> > > > > > CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖
> > > > > CPU_AVX2, ✔
> > > > > > OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS, ✖
> > > > > BLAS_MKL, ✖
> > > > > > BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✔ OPENCV, ✖ CAFFE, ✖ PROFILER, ✔
> > > > > > DIST_KVSTORE, ✖ CXX14, ✖ INT64_TENSOR_SIZE, ✖ SIGNAL_HANDLER, ✖
> > > DEBUG, ✖
> > > > > > TVM_OP]
> > > > > >
> > > > > > Lin
> > > > > >
> > > > > > On Sat, Feb 1, 2020 at 9:55 PM Tao Lv  wrote:
> > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > I tested below items:
> > > > > > > 1. download artifacts from Apache dist repo;
> > > > > > > 2. the signature looks good;
> > > > > > > 3. build from source code with MKL-DNN and MKL on centos;
> > > > > > > 4. run fp32 and int8 inference of ResNet50 under
> > > > > /example/quantization/.
> > > > > > >
> > > > > > > thanks,
> > > > > > > -tao
> > > > > > >
> > > > > > > On Sun, Feb 2, 2020 at 11:00 AM Tao Lv 
> wrote:
> > > > > > >
> > > > > > > > I see. I was looking at this page:
> > > > > > > >
> > https://github.com/apache/incubator-mxnet/releases/tag/1.6.0.rc2
> > > > > > > >
> > > > > > > > On Sun, Feb 2, 2020 at 4:54 AM Przemysław Trędak <
> > > ptre...@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Hi Tao,
> > > > > > > >>
> > > > > > > >> Could you tell me where did you look for it and did not find
> > > it? I
> > > > > > just
> > > > > > > >> checked and both
> > > > > > > >>
> > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.6.0.rc2/
> > > > > and
> > > > > > > >> draft of the release on GitHub have them.
> > > > > > > >>
> > > > > > > >> Thank you
> > > > > > > >> Przemek
> > > > > > > >>
> > > > > > > >> On 2020/02/01 14:23:11, Tao Lv  wrote:
> > > > > > > >> > It seems the src tar and signature are missing from the
> tag.
> > > > > > > >> >
> > > > > > > >> > On Fri, Jan 31, 2020 at 11:09 AM Przemysław Trędak <
> > > > > > > ptre...@apache.org>
> > > > > > > >> > wrote:
> > > > > > > >> >
> > > > > > > >> > > Dear MXNet community,
> > > > > > > >> > >
> > > > > > > >> > > This is the vote to release Apache MXNet (incubating)
> > > version
> > > > > > 1.6.0.
> > > > > > > >> > > Voting starts today and will close on Monday 2/3/2020
> > 23:59
> > > PST.
> > > > > > > >> > >
> > > > > > > >> > > Link to release notes:
> > > > > > > >> > >
> > > > > > >
> > > https://cwiki.apache.org/confluence/display/MXNET/1.6.0+Release+notes
> > > > > > > >> > >
> > > > > > > >> > > Link to release candidate:
> > > > > > > >> > >
> > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.6.0.rc2
> > > > > > > >> > >
> > > > > > > >> > > Link to source and signatures on apache dist server:
> > > > > > > >> > >
> > > > >

Re: Stop redistributing source code of 3rdparty dependencies to avoid licensing issues

2020-01-17 Thread Chris Olivier

+1

On Fri, Jan 17, 2020 at 10:19 AM Lausen, Leonard 
wrote:

> Dear MXNet community,
>
> as per recent mail on gene...@incubator.apache.org [1] there are a number
> of
> licensing issues in MXNet 1.6rc1. Based on anecdotal evidence I believe
> there
> has been no release so far without any licensing issues, which is a
> blocker to
> MXNet graduating from it's incubating status. One contributing factor is
> that we
> bundle 3rdparty source code in our releases [2].
>
> One key factor is that 3rdparty projects don't always enforce licensing
> best
> practice in the way we do. For example, 3rdparty/ps-lite doesn't enforce
> license
> headers in the source files and there has been confusion about the license
> of
> recent contributions by ByteDance (See [1]).
>
> To avoid such licensing issues in MXNet releases a simple solution is to
> stop
> distributing the 3rdparty code in our source releases. Instead, we can
> adapt our
> buildsystem to download 3rdparty code as part of the build configuration
> process. CMake makes this very easy with the FetchContent module [3].
>
> For development purpose involving changes to the 3rdparty source or build
> systems that can't access the internet, there are easy means for
> specifying the
> location of local sources (instead of downloading), via the
> FETCHCONTENT_SOURCE_DIR_ variable [4].
>
> Would there be any concerns about such approach? Obviously it can only be
> fully
> implemented as soon as the CMake build system is feature complete and the
> Makefile build can be dropped. (Note that the Makefile build is being
> deprecated
> and removed as part of MXNet 2 roadmap [5])
>
> Best regards
> Leonard
>
> [1]:
>
> https://lists.apache.org/thread.html/rb83ff64bdac464df2f0cf2fe8fb4c6b9d3b8fa62b645763dc606045f%40%3Cgeneral.incubator.apache.org%3E
> [2]: See the .tar.gz files at
> https://incubator.apache.org/clutch/mxnet.html
> [3]: https://cmake.org/cmake/help/latest/module/FetchContent.html
> [4]: https://cmake.org/pipermail/cmake/2019-June/069709.html
> [5]: https://github.com/apache/incubator-mxnet/issues/16167
>

Re: Stopping nightly releases to Pypi

2020-01-09 Thread Chris Olivier

If this tiny fix is representative of the bulk of the reasoning behind all
the the CD churn recently, then this seems to be of some concern.

-Chris

On Thu, Jan 9, 2020 at 6:32 AM Marco de Abreu 
wrote:

> Great, thanks a lot sheng!
>
> -Marco
>
> Sheng Zha  schrieb am Do., 9. Jan. 2020, 14:28:
>
> > I'm fixing the CD pipeline in
> > https://github.com/apache/incubator-mxnet/pull/17259/files and will
> > update the s3 publish path so that it's friendly for automatically
> > generating such page.
> >
> > -sz
> >
> > On 2020/01/06 18:19:52, "Lausen, Leonard" 
> > wrote:
> > > Consider a user finds a bug in a nightly version but we can't narrow
> > down the
> > > version of mxnet used as the name is constant over time. Or users wan't
> > to
> > > revert back to the previous nightly version installed but don't know
> > which date
> > > it was from due to constant name.
> > >
> > > Instead I suggest we introduce an autogenerated page like
> > > https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html
> > >
> > > Then "pip install -f URLTOPAGE mxnet" will install the latest available
> > version.
> > > Maybe the team maintaining the S3 bucket can reconsider creating such
> > page for
> > > the intermediate time until the CD based nighlty build is operating.
> > >
> > > On Mon, 2020-01-06 at 10:01 -0800, Lin Yuan wrote:
> > > > +1 for a nightly pip with fixed name.
> > > >
> > > > We need this to track mxnet integration with other packages such as
> > Horovod.
> > > >
> > > > Sam, when do you think we can have this nightly build with a fixed
> > name?
> > > >
> > > > Thanks,
> > > >
> > > > Lin
> > > >
> > > > On Sun, Jan 5, 2020 at 7:48 PM Skalicky, Sam
> > 
> > > > wrote:
> > > >
> > > > > Hi Tao,
> > > > >
> > > > > We dont have this yet, but we did think about putting the latest
> > wheels in
> > > > > a specific place in the s3 bucket so they are always updated.
> > Initially we
> > > > > decided not to do this since the main MXNet CD should have been
> > fixed. But
> > > > > since its still not fixed yet, we might try and go ahead and do
> this.
> > > > >
> > > > > Sam
> > > > >
> > > > > On Jan 5, 2020, at 6:02 PM, Lv, Tao A  > > > > tao.a...@intel.com>> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > How to install the latest available build of a flavor without
> > specifying
> > > > > the build date? Something like `pip install mxnet --pre`.
> > > > >
> > > > > Thanks,
> > > > > -tao
> > > > >
> > > > > -Original Message-
> > > > > From: Skalicky, Sam  > > > > sska...@amazon.com.INVALID>>
> > > > > Sent: Monday, January 6, 2020 2:09 AM
> > > > > To: dev@mxnet.incubator.apache.org > dev@mxnet.incubator.apache.org>
> > > > > Subject: Re: Stopping nightly releases to Pypi
> > > > >
> > > > > Hi Haibin,
> > > > >
> > > > > You typed the correct URLs, the cu100 build has been failing since
> > > > > December 30th but other builds have succeeded. The wheels are being
> > > > > delivered into a public bucket that anyone with an AWS account can
> > access
> > > > > and go poke around, here’s the URL for web access:
> > > > >
> > > > >
> > > > >
> >
> https://s3.console.aws.amazon.com/s3/buckets/apache-mxnet/dist/2020-01-01/dist/?region=us-west-2=overview
> > > > >
> > > > > You will have to log into your AWS account to access it however
> > (which
> > > > > means you’ll need an AWS account).
> > > > >
> > > > > It looks like only the following flavors are available for
> > 2020-01-01:
> > > > > mxnet
> > > > > mxnet-cu92
> > > > > mxnet-cu92mkl
> > > > > mxnet-mkl
> > > > >
> > > > > Sam
> > > > >
> > > > > On Jan 4, 2020, at 9:06 PM, Haibin Lin  >  > > > > haibin.lin@gmail.com>> wrote:
> > > > >
> > > > > I was trying the nightly builds, but none of them is available:
> > > > >
> > > > > pip3 install
> > > > >
> > > > >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2020-01-01/dist/mxnet_cu100-1.6.0b20200101-py2.py3-none-manylinux1_x86_64.whl
> > > > > --user
> > > > > <
> > > > >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2020-01-01/dist/mxnet_cu100-1.6.0b20200101-py2.py3-none-manylinux1_x86_64.whl--user
> > > > > >
> > > > > pip3 install
> > > > >
> > > > >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2020-01-02/dist/mxnet_cu100-1.6.0b20200102-py2.py3-none-manylinux1_x86_64.whl
> > > > > --user
> > > > > <
> > > > >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2020-01-02/dist/mxnet_cu100-1.6.0b20200102-py2.py3-none-manylinux1_x86_64.whl--user
> > > > > >
> > > > > pip3 install
> > > > >
> > > > >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2020-01-03/dist/mxnet_cu100-1.6.0b20200103-py2.py3-none-manylinux1_x86_64.whl
> > > > > --user
> > > > > <
> > > > >
> >
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2020-01-03/dist/mxnet_cu100-1.6.0b20200103-py2.py3-none-manylinux1_x86_64.whl--user
> > > > > >
> > > > > pip3 install
> > > > >
> > > > >
> >
>

Re: Stopping nightly releases to Pypi

2020-01-03 Thread Chris Olivier

I am curious what reasoning went into a non-community entity deploying what
is effectively de-facto public builds of an Apache project, "temporary" or
not.  Was this discussed on dev list?  Btw, I don't buy this "temporary"
thing -- "temporary" has a bad habit of becoming "permanent".  Also, I
challenge the logic behind "We built something that violates Apache
guidelines because no one else was doing it".

-Chris



On Fri, Jan 3, 2020 at 9:42 AM Skalicky, Sam 
wrote:

> Hi Marco,
>
> I don’t think anyone wants only Amazonians to control access to the
> system. However, no one has stepped up to help develop one that the
> community can maintain. Sure there has been some work here or there but
> nothing consistent. I think what we’re waiting for is someone to volunteer
> and commit to actually spending time and writing the code and getting this
> done.
>
> Are you volunteering to do this, or are you willing to find someone who
> is?
>
> There is nothing to veto here. There was a problem with CD, we came up
> with a short-term fix, and are waiting for the community to finish the
> Jenkins CD so that the community can go back to maintaining the system.
>
> Sam
>
> > On Jan 3, 2020, at 6:56 AM, Marco de Abreu 
> wrote:
> >
> > Agree, but the question how a non Amazonian is able to maintain and
> access
> > the system is still open. As it stands right now, the community has
> taken a
> > step back and loses some control if we continue down that road.
> >
> > I personally am disapproving of that approach since committers are no
> > longer in control of that process. So far it seems like my questions were
> > skipped and further actions have been taken. As openness and the
> community
> > having control are part of our graduation criteria, I'm putting in my
> veto
> > with a grace period until 15th of January. Please bring the system into a
> > state that aligns with Apache values or revert the changes.
> >
> > -Marco
> >
> > Pedro Larroy  schrieb am Fr., 3. Jan.
> 2020,
> > 03:33:
> >
> >> CD should be separate from CI for security reasons in any case.
> >>
> >>
> >> On Sat, Dec 7, 2019 at 10:04 AM Marco de Abreu  >
> >> wrote:
> >>
> >>> Could you elaborate how a non-Amazonian is able to access, maintain and
> >>> review the CodeBuild pipeline? How come we've diverted from the
> community
> >>> agreed-on standard where the public Jenkins serves for the purpose of
> >>> testing and releasing MXNet? I'd be curious about the issues you're
> >>> encountering with Jenkins CI that led to a non-standard solution.
> >>>
> >>> -Marco
> >>>
> >>>
> >>> Skalicky, Sam  schrieb am Sa., 7. Dez.
> 2019,
> >>> 18:39:
> >>>
>  Hi MXNet Community,
> 
>  We have been working on getting nightly builds fixed and made
> available
>  again. We’ve made another system using AWS CodeBuild & S3 to work
> >> around
>  the problems with Jenkins CI, PyPI, etc. It is currently building all
> >> the
>  flavors and publishing to an S3 bucket here:
> 
> 
> >>>
> >>
> https://us-west-2.console.aws.amazon.com/s3/buckets/apache-mxnet/dist/?region=us-west-2
> 
>  There are folders for each set of nightly builds, try out the wheels
>  starting today 2019-12-07. Builds start at 1:30am PT (9:30am GMT) and
>  arrive in the bucket 30min-2hours later. Inside each folder are the
> >>> wheels
>  for each flavor of MXNet. Currently we’re only building for linux,
> >> builds
>  for windows/Mac will come later.
> 
>  If you want to download the wheels easily you can use a URL in the
> form
> >>> of:
>  https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/
> 
> >>>
> >>
> /dist/-1.6.0b-py2.py3-none-manylinux1_x86_64.whl
> 
>  Heres a set of links for today’s builds
> 
>  (Plain mxnet, no mkl no cuda)
> 
> 
> >>>
> >>
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
>  (mxnet-mkl
>  <
> >>>
> >>
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl(mxnet-mkl
> 
>  )
> 
> 
> >>>
> >>
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
>  (mxnet-cuXXX
>  <
> >>>
> >>
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_mkl-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl(mxnet-cuXXX
> 
>  )
> 
> 
> >>>
> >>
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu90-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> 
> 
> >>>
> >>
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu92-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> 
> 
> >>>
> >>
> https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/2019-12-07/dist/mxnet_cu100-1.6.0b20191207-py2.py3-none-manylinux1_x86_64.whl
> 
> 
> >>>
> >>

Re: Performance regression from removing libiomp5.so

2019-12-13 Thread Chris Olivier

the follow/up PR shouldn’t affect directly since the problem was taken care
of with the first PR., it was just something i did in the course that is
probably a good idea in general.

The dual linking probably can be addressed if it is found that the
performance returns. I would suggest that when doing a performance test, be
sure that build doesn’t have both linked in because it may be that the one
it uses happens to be the first one loaded, so the performance might look
the same in that case if a dependency uses is libgomp.


On Thu, Dec 12, 2019 at 9:48 PM Lausen, Leonard 
wrote:

> That error should be fixed by Chris's work at
> https://github.com/apache/incubator-mxnet/pull/17039
>
> It is currently expected that libmxnet.so transitively requires both
> libomp.so
> and libgomp.so. If this is an issue, we need to build OpenBLAS from source
> as
> part of our build scripts, because it introduces the libgomp.so
> requirement.
>
> Are there any other known issues currently?
>
> @Sam, do we have any performance comparison with the CMake build of 1.6 (+
> #
> 17039 fix backported)?
>
> Chris also mentioned he still has a follow-up PR
>
> > I actually also did change (removed from this PR in order to keep it
> > simple) that stops forcing the creation of the engine during static init
> and
> > the fork handlers, if it is desired. This has a couple of effects
> including
> > reducing the overhead of resources and threads created simply by the
> library
> > being loaded, which then, in turn, causes a lot of thread churn (Stop(),
> > Start()) when forking. Even if hey never use mxnet, we spin up a lot of
> stuff
> > unnecessarily. It actually gets lot faster on startp when this is done.
>
> Best regards
> Leonard
>
> On Fri, 2019-12-13 at 13:34 +0800, Tao Lv wrote:
> > Hi Chris,
> >
> > From the licensing standpoint, llvm omp is indeed a choice. But
> previously
> > we noticed that building mxnet with cmake and the llvm omp under 3rdparty
> > folder will cause two runtimes linked to libmxnet.so [1]. Do you think
> > that's still a problem?
> >
> > Also with the current two build systems, linking llvm omp means we need
> > move the binary release process from make to cmake, which i think need
> more
> > discussion in the community. It's not likely we can finish that within
> the
> > schedule of 1.6.0 release.
> >
> > [1]
> >
> https://github.com/apache/incubator-mxnet/issues/11417#issuecomment-555413002
> >
> > On Thu, Dec 12, 2019 at 10:28 PM Chris Olivier 
> > wrote:
> >
> > > Hi Patric,
> > >
> > > The llvm openmp we compile (originally from same Intel source as we all
> > > know) seems to be Apache 2.0 licensed. Could we use that instead from a
> > > licensing standpoint?
> > >
> > > On Wed, Dec 11, 2019 at 10:36 PM Zhao, Patric 
> > > wrote:
> > >
> > > > Thanks, Sam.
> > > >
> > > > The root cause is from different OpenMP library. Intel OpenMP will
> > > provide
> > > > better performance as your data shown.
> > > >
> > > > Regarding release, since the license issue[1], we can't ship Intel
> OpenMP
> > > > in the binary, but the most of performance boost from MKLDNN is still
> > > > available.
> > > > I think it should be acceptable to release 1.6 with MKLDNN  + GNU
> OpenMP
> > > > for suboptimal performance.
> > > >
> > > > To achieve the best performance, user should build from source to
> enable
> > > > more advanced features like Intel MKL, Intel OpenMP, AVX512.
> > > >
> > > > Thanks,
> > > >
> > > > --Patric
> > > >
> > > > [1] https://www.apache.org/legal/resolved.html#category-x
> > > >
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: Skalicky, Sam 
> > > > > Sent: Wednesday, December 11, 2019 1:36 PM
> > > > > To: dev@mxnet.incubator.apache.org
> > > > > Cc: Keshavan, Arjuna ; Harish, Nihal
> > > > > 
> > > > > Subject: Performance regression from removing libiomp5.so
> > > > >
> > > > > Hi MXNet community,
> > > > >
> > > > > I would like to bring your attention to the performance regression
> that
> > > > was
> > > > > found [1] between 1.5.1 and 1.6.0 due to removing the libiomp5.so
> > > library
> > > > > due to licensing issues. This change was made since thi

Re: Performance regression from removing libiomp5.so

2019-12-12 Thread Chris Olivier

Hi Patric,

The llvm openmp we compile (originally from same Intel source as we all
know) seems to be Apache 2.0 licensed. Could we use that instead from a
licensing standpoint?

On Wed, Dec 11, 2019 at 10:36 PM Zhao, Patric  wrote:

> Thanks, Sam.
>
> The root cause is from different OpenMP library. Intel OpenMP will provide
> better performance as your data shown.
>
> Regarding release, since the license issue[1], we can't ship Intel OpenMP
> in the binary, but the most of performance boost from MKLDNN is still
> available.
> I think it should be acceptable to release 1.6 with MKLDNN  + GNU OpenMP
> for suboptimal performance.
>
> To achieve the best performance, user should build from source to enable
> more advanced features like Intel MKL, Intel OpenMP, AVX512.
>
> Thanks,
>
> --Patric
>
> [1] https://www.apache.org/legal/resolved.html#category-x
>
>
>
> > -Original Message-
> > From: Skalicky, Sam 
> > Sent: Wednesday, December 11, 2019 1:36 PM
> > To: dev@mxnet.incubator.apache.org
> > Cc: Keshavan, Arjuna ; Harish, Nihal
> > 
> > Subject: Performance regression from removing libiomp5.so
> >
> > Hi MXNet community,
> >
> > I would like to bring your attention to the performance regression that
> was
> > found [1] between 1.5.1 and 1.6.0 due to removing the libiomp5.so library
> > due to licensing issues. This change was made since this library has a
> category
> > x license [2] that is not compatible with the MXNet Apache
> > license/distribution.
> >
> > We found that using OpenBLAS instead of MKL BLAS caused a regression
> > from 1500 samples/sec to 1300 samples/sec a 13.3% regression in training
> > speed for a resnet18 training benchmark on a C5.18xlarge EC2 instance
> (with
> > 72 cores). Rebuilding with MKL BLAS showed an increase in performance to
> > 1600 samples/sec in the 1.6.0 branch.
> >
> > Please provide your feedback on the licensing issue (are there any work-
> > arounds) and the tradeoff in performance (is the benefit worth trying to
> > include back into MXNet builds).
> >
> > Thanks to the efforts of the following folks for working on this issue
> (in no
> > particular order):
> > Patric Zhao
> > Amol Lele
> > Tao Lv A
> > Pedro Larroy
> > Nihal Harish
> > Chai Bapat
> > Arjuna Keshavan
> > Rong Zhang
> >
> > Thanks!
> > Sam
> >
> > [1] https://github.com/apache/incubator-mxnet/issues/16891
> > [2] https://www.apache.org/legal/resolved.html#category-x
>

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Chris Olivier

please answer the questions in my last email regarding the suspected issue
in mxnet as well as on that PR you opened.


On Sun, Dec 8, 2019 at 7:00 PM Lausen, Leonard 
wrote:

> The assertion failure in the MXNet DEBUG build goes away by updating LLVM
> OpenMP
> to the latest released version. All evidence I have points to the assertion
> failure being due to a bug in the 2 years old UNRELEASED version of LLVM
> OpenMP.
> that we are using currently in CMake builds.
>
> Thus I'm requesting 3 commiters to approve
> https://github.com/apache/incubator-mxnet/pull/17012 to update to a
> released
> version of LLVM OpenMP.
>
> As described in the PR, the assertion is still part of LLVM OpenMP 9.0
> codebase.
> In particular look at lines
>
> https://github.com/llvm-mirror/openmp/blob/release_90/runtime/src/kmp_runtime.cpp#L6616
> and
>
> https://github.com/llvm-mirror/openmp/blob/37c72127e90360a020f351f18d9cccfc30e5145a/runtime/src/kmp_runtime.cpp#L6481
> where the latter is the line that currently fails in MXNet DEBUG build and
> the
> former is the equivalent line that doesn't fail in MXNet DEBUG builds after
> updating LLVM OpenMP.
>
>
>
> There is also a crash with Intel OpenMP as well as both the old UNRELEASED
> and
> the new, released version LLVM OpenMP that happens after forking. That
> crash
> doesn't go away and needs to be root-caused
> https://github.com/apache/incubator-mxnet/issues/14979
>
>
> On Sun, 2019-12-08 at 16:27 -0800, Pedro Larroy wrote:
> > Hi Leonard.
> >
> > Are you saying that you have updated this library and the problems
> desribed
> > in the related tickets are no longer present?
> >
> > P.
> >
> > On Sunday, December 8, 2019, Lausen, Leonard 
> > wrote:
> > > Thanks Pedro and Chris for your responses.
> > >
> > > After further investigation I find:
> > >
> > > 1) I don't think
> https://github.com/apache/incubator-mxnet/issues/14979 is
> > > caused by any incompatibility between gomp and llvm / intel omp. Rather
> > it's
> > > simply a problem of llvm / intel omp. See my comment to the issue for
> the
> > > methodology to arrive at this claim.
> > >
> > > 2) Regarding the assertion failure when compiling with (llvm)
> > 3rdparty/openmp,
> > > it can be fixed by updating the by now 2 years old llvm openmp code to
> the
> > > newest released version. I went ahead and opened a PR
> > > https://github.com/apache/incubator-mxnet/pull/17012
> > >
> > > Based on the investigation described in 1), I think Chris is right that
> > the
> > > assertion failure is not due to some interaction between gomp and llvm
> > omp.
> > > However, I'm not sure about Chris's suggestion that the assertion fa
>
> > > ilure
> > is due
> > > to a bug in MXNet. In fact, the failure goes away when updating the
> llvm
> > openmp
> > > code. So I think it's just due to a bug in the 2 years old code.
> > >
> > > @Chris, I think updating 3rdparty/openmp to fix the assertion issue is
> not
> > > contentious. Thus let's do it via lazy consensus (72 hours) or just
> > approve the
> > > PR and merge it.
> > >
> > > Please also take a look at my comment at #14979 and let everyone know
> if
> > you see
> > > any option to fix the bug while keeping 3rdparty/openmp. As this bug
> > affects an
> > > important use-case, I beleive we need to remove 3rdparty/openmp from
> the
> > CMake
> > > build as long as we don't find a solution for making #14979 work with
> > > 3rdparty/openmp.
> > >
> > > In fact, removing 3rdparty/openmp will then match the current Makefile
> > setup
> > > that according to my understanding is used to build the nightly
> releases
> > used by
> > > the majority of developers. Ie. most users actually don't use the CMake
> > build
> > > with 3rdparty/openmp. You can consider rescinding your veto on removing
> > > 3rdparty/openmp after reading through the evidence in that issue. If
> you
> > don't
> > > provide any evidence for why the methodology/conclusion in #14979 is
> > flawed, I
> > > will assume your previous veto is void based on Apache Voting rule as
> it
> > lacks
> > > technical justification and in any case was motivated by the assertion
> > issue,
> > > which I agree with you, is likely not due to gomp / omp interaction.
> > >
> > > Thank you
> > > Leonard
> > >
> > >
> > > On Sat, 2019-12-0

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Chris Olivier

btw the call stack I am referring to below is the one where I explained
this problem before and after I got a hostile response, I locked the issue.

On Sun, Dec 8, 2019 at 7:24 AM Chris Olivier  wrote:

> Again, here is what I suspect the bug is in mxnet:
>
> The way that advanced openmp libraries handle a fork is that they hook an
> atfork() callback in which, in the new process, it creates a new “team” of
> threads to use for its thread pool (since all of the thread handles in its
> data structure belong to the previous process). atfork() callback order is
> the order at which the callbacks are registered, which will tend to be the
> first call to the openmp library.  For this reason, the fork order will
> vary depending upon what other libraries might be linked in and whether
> they make omp calls before mxnet starts its static init.
>
> What the assert in question is trying to say is that mxnet code is calling
> into omp library after a fork, but before the omp library’s atfork()
> handler is called, so the omp library has not yet initialized a new team if
> threads.  This looks to be the case in one of the call stacks on that
> issue. This is problematic for any openmp library which supports omp after
> a fork, and may not be deterministic from build to build, since the order
> of static init calls for a given module is undefined (i think mxnet is
> initializing omp during static init, but this may not matter).
>
> So if mxnet is doing that, it is a bug and remains a problem regardless of
> the omp library and probably should be fixed.  llvm omp happens to be nice
> enough to tell you you’re doing something wrong, at least when built in
> debug mode.
>
> Once this issue is resolved, we can discuss the library inclusion itself.
> My objection is “fixing” what appears to be a bug by effectively
> “commenting out the assert” which is what i stated in the very beginning.
>
> It stands to reason that linking this or that library may affect the
> assert occurring because it’s not known at what time one of the dependent
> libraries initializes omp (thus causing it to hook its atfork handler), so
> it is not surprising that mucking with dependencies may cause the assert to
> occur or not occur.
>
> Is there another explanation for the call stack with the assert?  Can this
> bug be ruled out?
>
>
> Here is an example of the atfork team concept with libgomp as well.
> Probably you can check the current libgomp code itself but this explains
> the code:
> https://patchwork.ozlabs.org/patch/319827/
>
>
>
>
>
>
>
>
> On Sun, Dec 8, 2019 at 2:21 AM Lausen, Leonard 
> wrote:
>
>> Thanks Pedro and Chris for your responses.
>>
>> After further investigation I find:
>>
>> 1) I don't think https://github.com/apache/incubator-mxnet/issues/14979
>> is
>> caused by any incompatibility between gomp and llvm / intel omp. Rather
>> it's
>> simply a problem of llvm / intel omp. See my comment to the issue for the
>> methodology to arrive at this claim.
>>
>> 2) Regarding the assertion failure when compiling with (llvm)
>> 3rdparty/openmp,
>> it can be fixed by updating the by now 2 years old llvm openmp code to the
>> newest released version. I went ahead and opened a PR
>> https://github.com/apache/incubator-mxnet/pull/17012
>>
>> Based on the investigation described in 1), I think Chris is right that
>> the
>> assertion failure is not due to some interaction between gomp and llvm
>> omp.
>> However, I'm not sure about Chris's suggestion that the assertion failure
>> is due
>> to a bug in MXNet. In fact, the failure goes away when updating the llvm
>> openmp
>> code. So I think it's just due to a bug in the 2 years old code.
>>
>> @Chris, I think updating 3rdparty/openmp to fix the assertion issue is not
>> contentious. Thus let's do it via lazy consensus (72 hours) or just
>> approve the
>> PR and merge it.
>>
>> Please also take a look at my comment at #14979 and let everyone know if
>> you see
>> any option to fix the bug while keeping 3rdparty/openmp. As this bug
>> affects an
>> important use-case, I beleive we need to remove 3rdparty/openmp from the
>> CMake
>> build as long as we don't find a solution for making #14979 work with
>> 3rdparty/openmp.
>>
>> In fact, removing 3rdparty/openmp will then match the current Makefile
>> setup
>> that according to my understanding is used to build the nightly releases
>> used by
>> the majority of developers. Ie. most users actually don't use the CMake
>> build
>> with 3rdparty/openmp. You can consider rescinding your veto on removing
>

Re: Please remove conflicting Open MP version from CMake builds

2019-12-08 Thread Chris Olivier

Again, here is what I suspect the bug is in mxnet:

The way that advanced openmp libraries handle a fork is that they hook an
atfork() callback in which, in the new process, it creates a new “team” of
threads to use for its thread pool (since all of the thread handles in its
data structure belong to the previous process). atfork() callback order is
the order at which the callbacks are registered, which will tend to be the
first call to the openmp library.  For this reason, the fork order will
vary depending upon what other libraries might be linked in and whether
they make omp calls before mxnet starts its static init.

What the assert in question is trying to say is that mxnet code is calling
into omp library after a fork, but before the omp library’s atfork()
handler is called, so the omp library has not yet initialized a new team if
threads.  This looks to be the case in one of the call stacks on that
issue. This is problematic for any openmp library which supports omp after
a fork, and may not be deterministic from build to build, since the order
of static init calls for a given module is undefined (i think mxnet is
initializing omp during static init, but this may not matter).

So if mxnet is doing that, it is a bug and remains a problem regardless of
the omp library and probably should be fixed.  llvm omp happens to be nice
enough to tell you you’re doing something wrong, at least when built in
debug mode.

Once this issue is resolved, we can discuss the library inclusion itself.
My objection is “fixing” what appears to be a bug by effectively
“commenting out the assert” which is what i stated in the very beginning.

It stands to reason that linking this or that library may affect the assert
occurring because it’s not known at what time one of the dependent
libraries initializes omp (thus causing it to hook its atfork handler), so
it is not surprising that mucking with dependencies may cause the assert to
occur or not occur.

Is there another explanation for the call stack with the assert?  Can this
bug be ruled out?

Here is an example of the atfork team concept with libgomp as well.
Probably you can check the current libgomp code itself but this explains
the code:
https://patchwork.ozlabs.org/patch/319827/

On Sun, Dec 8, 2019 at 2:21 AM Lausen, Leonard 
wrote:

> Thanks Pedro and Chris for your responses.
>
> After further investigation I find:
>
> 1) I don't think https://github.com/apache/incubator-mxnet/issues/14979 is
> caused by any incompatibility between gomp and llvm / intel omp. Rather
> it's
> simply a problem of llvm / intel omp. See my comment to the issue for the
> methodology to arrive at this claim.
>
> 2) Regarding the assertion failure when compiling with (llvm)
> 3rdparty/openmp,
> it can be fixed by updating the by now 2 years old llvm openmp code to the
> newest released version. I went ahead and opened a PR
> https://github.com/apache/incubator-mxnet/pull/17012
>
> Based on the investigation described in 1), I think Chris is right that the
> assertion failure is not due to some interaction between gomp and llvm omp.
> However, I'm not sure about Chris's suggestion that the assertion failure
> is due
> to a bug in MXNet. In fact, the failure goes away when updating the llvm
> openmp
> code. So I think it's just due to a bug in the 2 years old code.
>
> @Chris, I think updating 3rdparty/openmp to fix the assertion issue is not
> contentious. Thus let's do it via lazy consensus (72 hours) or just
> approve the
> PR and merge it.
>
> Please also take a look at my comment at #14979 and let everyone know if
> you see
> any option to fix the bug while keeping 3rdparty/openmp. As this bug
> affects an
> important use-case, I beleive we need to remove 3rdparty/openmp from the
> CMake
> build as long as we don't find a solution for making #14979 work with
> 3rdparty/openmp.
>
> In fact, removing 3rdparty/openmp will then match the current Makefile
> setup
> that according to my understanding is used to build the nightly releases
> used by
> the majority of developers. Ie. most users actually don't use the CMake
> build
> with 3rdparty/openmp. You can consider rescinding your veto on removing
> 3rdparty/openmp after reading through the evidence in that issue. If you
> don't
> provide any evidence for why the methodology/conclusion in #14979 is
> flawed, I
> will assume your previous veto is void based on Apache Voting rule as it
> lacks
> technical justification and in any case was motivated by the assertion
> issue,
> which I agree with you, is likely not due to gomp / omp interaction.
>
> Thank you
> Leonard
>
>
> On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
> > Stop disseminating false information:
> >
> > https://github.com/apache/incubator-mxnet/issues/14979
> &

Re: [apache/incubator-mxnet] Failed OpenMP assertion when loading MXNet compiled with DEBUG=1 (#10856)

2019-12-07 Thread Chris Olivier

if it is really a problem, then it would be prioritized. all the necessary
info is in that issue (and i already mentioned just yesterday or today on
that ticket) what it was again and it’s like i was talking to no one, as it
has been, simply an immediate revert to “remove the library”.  in the time
wasted on all this, it could have been resolved 100 times over.

I can remove just about every bug from mxnet by turning off ALL of the
features in CMakeLists.txt. no features, no bugs. This is roughly
equivalent to the approach that has been taken so far for 1.5 years, which
is not good engineering practice, ad a suggestion that I am surprised to
see championed by a committer.

Here’s another example:

Not too long ago (maybe 8 months?) there was a crash at shutdown in debug
mode in tcmalloc (gperf version of malloc, which is similar to jemalloc)
with an error message about bad pointer to free() or something like that.  At
the time, I didn’t know what caused it and so I did not block it’s removal.

fast-forward to about two months ago, where I saw the same error in a
different code base. Since it was happening to me, I was in a position to
debug it, so I did and found that a the same small static library was
linked into two different shared objects, and occasionally (depending upon
link order, I presume), a global string variable was created and destroyed
twice, because when linking, both shared object c-runtime init functions
had the same name, so mapped to the same startup routine and global data
address, so when both shared objects initialized, they called the same
address. This caused both a memory leak because the first startup string
memory allocation was discarded by the second call to the constructor and
at shutdown,  an assert in tcmalloc because the same second memory pointer
allocated was freed twice.  When tcmalloc was removed, the assert went away
but the bug, to the best of my knowledge, is still there.  If I knew then
what I know now, I would have asked the bug to be fixed rather than remove
tcmalloc.  Not because of a love for tcmalloc, but because there is
something telling you there is a bug and the bug should be fixed, because
if you just hide the bug (comment out the assert) then it’s likely to cause
other (harder to track down) problems later. So now that bug is probably
still there causing who-knows-what random crashes or undefined behavior.

This is the kind of root causing that should be done and not effectively
commenting out the assert. I believe we should insist on the highest
standards. I understand if a person does CI all day and if they find
something they can do via CI (ie turn off a feature) which makes the
problem go away, then they might feel compelled to champion that option.
Like the saying goes, “When you have a hammer in your hand, everything
looks like a nail”.

But this is not always the best solution for the project. There is a bug,
and it should be fixed because commenting out the assert just hides the bug
from plain view, but the bug remains.  Or sufficient evidence otherwise.

-Chris

On Sat, Dec 7, 2019 at 8:06 AM Leonard Lausen 
wrote:

> It appears to me that this issue only occurs when having multiple openmp
> libraries at runtime. I don't understand why we need to support this
> use-case. MKL-DNN works with whatever openmp runtime is provided by the
> compiler [1
> ].
> If you think this use-case is important, please give some more reasoning.
> If you convince me I'm happy to help to root-cause it.
>
> Otherwise I suggest to follow the simplistic approach of using the compile
> openmp runtime. If any specific openmp runtime is needed, then we can
> compile with the associated compiler (GCC, LLVM, Intel Compiler).
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> ,
> or unsubscribe
> 
> .
>

Re: Please remove conflicting Open MP version from CMake builds

2019-12-07 Thread Chris Olivier

-1

mkldnn removed omp5 for licencing issues
no bugs have actually been traced to the use of llvm openmp. only an assert
caused by an actual bug in mxnet code. there are suitable workarounds.

over time llvm omp has simply been used as a “catch all” for random
problems that aren’t related at all (such as getenv race condition in an
atfork call that isn’t even part of an omp parallel region).

proposal is now and has always been roughly equivalent to the idea of
“comment out an assert rather than fix the bug it’s reporting”.

Up until very recently, Makefile version of mxnet used libomp5 for YEARS
and not libgomp, with no issue reported (omp not built in debug mode), so
the equivalent configuration from CMake mysteriously causing myriads if
problems has questionable merit and smells more like a hubris situation.

I use tensorflow as well and it links to libomp5 rather than libgomp.

if the assert problem is really a problem, the bug being reported would be
prioritized and fixed. it should be fixed regardless. all the time spent by
some CI people trying to remove this could have simply fixed the actual bug
in a small fraction of the time.


On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard 
wrote:

> I think it's reasonable to assume that the Intel MKLDNN team is an
> "authorative"
> source about the issue of compilation with OpenMP and the OpenMP runtime
> library
> related issues. Thus I suggest we follow the recommendation of Intel
> MKLDNN team
> within the MXNet project.
>
> Looking through the Intel MKLDNN documentation, I find [1]:
>
> > DNNL uses OpenMP runtime library provided by the compiler.
>
> as well as
>
> > it's important to ensure that only one OpenMP runtime is used throughout
> the
> > application. Having more than one OpenMP runtime linked to an executable
> may
> > lead to undefined behavior including incorrect results or crashes.
>
> To keep our project maintainable and error free, I thus suggest we follow
> DNNL
> and use the OpenMP runtime library provided by the compiler.
> We have limited ressources and finding the root cause for any bugs
> resulting
> from linking multiple OpenMP libraries as currently done is, in my
> opinion. not
> a good use of time. We know it's due to undefined behavior and we know
> it's best
> practice to use OpenMP runtime library provided by the compiler. So let's
> just
> do that.
>
> I think given that MKL-DNN has also adopted the "OpenMP runtime library
> provided
> by the compiler" approach, this issue is not contentious anymore and
> qualifies
> for lazy consensus.
>
> Thus if there is no objection within 72 hours (lazy consensus), let's drop
> bundled LLVM OpenMP from master [2]. If we find any issues due to
> droppeing the
> bundled LLVM OpenMP, we can always add it back prior to the next release.
>
> Best regards
> Leonard
>
> [1]:
>
> https://github.com/intel/mkl-dnn/blob/433e086bf5d9e5ccfc9ec0b70322f931b6b1921d/doc/build/build_options.md#openmp
> (This is the updated reference from Anton's previous comment, based on the
> changes in MKLDNN done in the meantime
> https://github.com/apache/incubator-mxnet/pull/12160#issuecomment-415078066
> )
> [2]: Alike https://github.com/apache/incubator-mxnet/pull/12160
>
>
> On Fri, 2019-12-06 at 12:16 -0800, Pedro Larroy wrote:
> > I will try to stay on the sidelines for now since previous conversations
> > about OMP have not been productive here and I have spent way too much
> time
> > on this already, I'm not the first one giving up on trying to help with
> > this topic.
> >
> > I would be glad if you guys can work together and find a solution. I will
> > just put my understanding of the big picture hoping that it helps move it
> > forward.
> >
> >
> > Recently the intel omp library which seemed to have the best performance
> of
> > the 3 was removed from MKL.
> >
> > - There's 3 libraries in play, GNU Omp which is shipped with gcc (gomp),
> > LLVM openmp in 3rdparty (llvm-omp), Intel OMP when using MKL, which is
> > recently removed (iomp)
> >
> > - IOMP seems to have the best performance, there's stability issues
> > producing crashes sometimes but the impact seems relatively small for
> users
> > and developers. In general seems linking with a different OMP version
> that
> > the one shipped with the compiler is known to cause stability issues but
> > it's done anyway.
> >
> > - LLVM-OMP used when building with CMake, not used in the PIP releases or
> > when building with Make. Has stability issues, hangs when running in
> debug
> > mode during test execution and produces tons of assertions in debug mode.
> > Might have some small performance gains but there is no clear cut data
> that
> > showcases significant performance gains.
> >
> > - GOMP is the version shipped with GCC and the PIP wheels without MKL,
> has
> > no stability problems.
> >
> > As a ballpark, IOMP might give 10% performance improvement in some cases.
> >
> > We need to document well how users should tune and configure MXNet when
> >

Re: Proposal to make MKLDNN as default CPU backend

2019-11-19 Thread Chris Olivier

Thanks, Patric. I was just trying to point out that there was currently no
guarantee of deterministic results without MKL, so there’s not necessarily
an expectation of determinism with MKL (ie requirement isn’t relaxed).

On Mon, Nov 18, 2019 at 9:38 PM Zhao, Patric  wrote:

> It may be a concern but little noise can't affect the final results if the
> algorithm is stable in numerical.
> The MKLDNN backend with mxnet-mkl has been used for 2 years and we didn't
> see the coverage issue caused by multiple threading.
> In other words, GPU programming mode works well on training where the
> non-deterministic also exists from multiple threads.
>
> Parts of training accuracy was pasted in the first PR when MKLDNN is
> integrated.
> https://github.com/apache/incubator-mxnet/pull/8302#issuecomment-359674818
>
> In conclusion, it may happen with very little probability. I believe we
> can get a solution in case it happens someday.
>
> Thanks,
>
> --Patric
>
>
> > -Original Message-
> > From: Chris Olivier 
> > Sent: Tuesday, November 19, 2019 11:51 AM
> > To: dev@mxnet.incubator.apache.org
> > Cc: Tao Lv 
> > Subject: Re: Proposal to make MKLDNN as default CPU backend
> >
> > (for non mkl dropout, for instance)
> >
> > On Mon, Nov 18, 2019 at 7:50 PM Chris Olivier 
> > wrote:
> >
> > > To address the deterministic item, I know for a fact that training
> > > will not be deterministic in some cases where the “parallel random”
> > > class is utilized in parallel threads, such as OMP, if the number of
> > > cores is different, even with the same seed, because threads are
> > > seeded independently and different number of threads will end up
> > > generating different random number sequences. Dropout operator being
> > an example.
> > >
> > > On Mon, Nov 18, 2019 at 6:39 PM Alfredo Luque
> > >  wrote:
> > >
> > >> For AMD CPUs, you’d want to perform validation because now MKL-DNN
> > >> would be enabled by default. Historically, other intel libraries
> > >> (along with the ICC
> > >> compiler) have had performance issues on AMD CPUs. It’s just worth
> > >> double checking to make sure that’s not the case here. Perhaps some
> > >> MKL-DNN authors can chime in though. It’s not sufficient to double
> > >> check that an
> > >> AVX2 package passes tests.
> > >>
> > >> Agreed in the case we’re not releasing ARM binaries.
> > >>
> > >> The reproducibility argument is around the results being numerically
> > >> reproducible. That is, eg; if I train a model with some fixed set of
> > >> data, some random seed, etc. and then run inference on it do I get
> > >> the exact same floating point values for the weights and results?
> > >> Does MxNet already offer this without MKL-DNN?
> > >>
> > >> On November 18, 2019 at 6:32:07 PM, Tao Lv (mutou...@gmail.com)
> > wrote:
> > >>
> > >> Regarding the cases listed by Marco:
> > >> - AMD CPU
> > >> From my architecture knowledge, what works on C4 instances (with AVX2
> > >> support) should also work well on m5a, right? I think mxnet-mkl and
> > >> mxnet-cuxxmkl packages have been fully validated on AVX2 machines.
> > >> Also, we didn't perform any validation on AMD CPU before, why we need
> > >> do that for this time?
> > >>
> > >> - ARM CPU
> > >> I don't know we're releasing any convenience binaries for ARM CPU.
> > >> This proposal mainly targets those pypi packages.
> > >>
> > >> - Windows
> > >> Already validated by CI. We're also releasing mxnet-mkl packages for
> Win.
> > >>
> > >> - GPU and MKLDNN enabled
> > >> Already validated by CI and mxnet-cuxxmkl packages have been released
> > >> for several versions.
> > >>
> > >> - Fully reproducible results (medical and financial sector requested
> > >> that and we have some flags for cuda) Not sure I understand this
> > >> case. We already have MKL-DNN backend for a while. Functionality and
> > >> correctness of it have been verified by MXNet users.
> > >>
> > >> -tao
> > >>
> > >> On Tue, Nov 19, 2019 at 4:41 AM Marco de Abreu
> > >> 
> > >> wrote:
> > >>
> > >> > Sorry, my intent with the "non-standard" phrase was not about
> > >> > general
> > >> MXNet
> > >

Re: Proposal to make MKLDNN as default CPU backend

2019-11-18 Thread Chris Olivier

(for non mkl dropout, for instance)

On Mon, Nov 18, 2019 at 7:50 PM Chris Olivier  wrote:

> To address the deterministic item, I know for a fact that training will
> not be deterministic in some cases where the “parallel random” class is
> utilized in parallel threads, such as OMP, if the number of cores is
> different, even with the same seed, because threads are seeded
> independently and different number of threads will end up generating
> different random number sequences. Dropout operator being an example.
>
> On Mon, Nov 18, 2019 at 6:39 PM Alfredo Luque
>  wrote:
>
>> For AMD CPUs, you’d want to perform validation because now MKL-DNN would
>> be
>> enabled by default. Historically, other intel libraries (along with the
>> ICC
>> compiler) have had performance issues on AMD CPUs. It’s just worth double
>> checking to make sure that’s not the case here. Perhaps some MKL-DNN
>> authors can chime in though. It’s not sufficient to double check that an
>> AVX2 package passes tests.
>>
>> Agreed in the case we’re not releasing ARM binaries.
>>
>> The reproducibility argument is around the results being numerically
>> reproducible. That is, eg; if I train a model with some fixed set of data,
>> some random seed, etc. and then run inference on it do I get the exact
>> same
>> floating point values for the weights and results? Does MxNet already
>> offer
>> this without MKL-DNN?
>>
>> On November 18, 2019 at 6:32:07 PM, Tao Lv (mutou...@gmail.com) wrote:
>>
>> Regarding the cases listed by Marco:
>> - AMD CPU
>> From my architecture knowledge, what works on C4 instances (with AVX2
>> support) should also work well on m5a, right? I think mxnet-mkl and
>> mxnet-cuxxmkl packages have been fully validated on AVX2 machines.
>> Also, we didn't perform any validation on AMD CPU before, why we need do
>> that for this time?
>>
>> - ARM CPU
>> I don't know we're releasing any convenience binaries for ARM CPU. This
>> proposal mainly targets those pypi packages.
>>
>> - Windows
>> Already validated by CI. We're also releasing mxnet-mkl packages for Win.
>>
>> - GPU and MKLDNN enabled
>> Already validated by CI and mxnet-cuxxmkl packages have been released for
>> several versions.
>>
>> - Fully reproducible results (medical and financial sector requested that
>> and we have some flags for cuda)
>> Not sure I understand this case. We already have MKL-DNN backend for a
>> while. Functionality and correctness of it have been verified by MXNet
>> users.
>>
>> -tao
>>
>> On Tue, Nov 19, 2019 at 4:41 AM Marco de Abreu 
>> wrote:
>>
>> > Sorry, my intent with the "non-standard" phrase was not about general
>> MXNet
>> > but rather from MKLDNNs point of view, considering that it's being
>> > developed by Intel, I assumed that MKLDNN might consider non-intel
>> > use-cases non standard.
>> >
>> > -Marco
>> >
>> > Skalicky, Sam  schrieb am Mo., 18. Nov.
>> 2019,
>> > 21:34:
>> >
>> > > Thanks Alfredo, if you can create a GitHub issue with notes/steps we
>> can
>> > > add this to the todo list for integrating with the MXNet CI to test on
>> > m5a
>> > > instances too. Then we can start tracking this on a regular basis. It
>> > would
>> > > be great to actually test on ARM instances now that AWS has A1
>> instances
>> > > too…..ill add it to the wish list ;-D
>> > >
>> > > Sam
>> > >
>> > > > On Nov 18, 2019, at 12:32 PM, Alfredo Luque <
>> alfredo.lu...@airbnb.com
>> > .INVALID>
>> > > wrote:
>> > > >
>> > > > Happy to run some benchmarks on an AWS m5a instance (Epyc) and first
>> > > > generation AMD Threadripper Gen 1 if someone has something easy to
>> run
>> > > and
>> > > > representative.
>> > > >
>> > > > On November 18, 2019 at 12:29:31 PM, Skalicky, Sam (
>> > > > sska...@amazon.com.invalid) wrote:
>> > > >
>> > > > Thanks a good idea Alfredo, are you able to help test on AMD CPUs?
>> Or
>> > is
>> > > > there someone else in the mxnet dev@ community who can help?
>> > > >
>> > > > Sam
>> > > >
>> > > >> On Nov 18, 2019, at 12:27 PM, Alfredo Luque
>> > > >  wrote:
>> > > >>
>> > > >> Verifying that there isn’t a slow

Re: Proposal to make MKLDNN as default CPU backend

2019-11-18 Thread Chris Olivier

To address the deterministic item, I know for a fact that training will not
be deterministic in some cases where the “parallel random” class is
utilized in parallel threads, such as OMP, if the number of cores is
different, even with the same seed, because threads are seeded
independently and different number of threads will end up generating
different random number sequences. Dropout operator being an example.

On Mon, Nov 18, 2019 at 6:39 PM Alfredo Luque
 wrote:

> For AMD CPUs, you’d want to perform validation because now MKL-DNN would be
> enabled by default. Historically, other intel libraries (along with the ICC
> compiler) have had performance issues on AMD CPUs. It’s just worth double
> checking to make sure that’s not the case here. Perhaps some MKL-DNN
> authors can chime in though. It’s not sufficient to double check that an
> AVX2 package passes tests.
>
> Agreed in the case we’re not releasing ARM binaries.
>
> The reproducibility argument is around the results being numerically
> reproducible. That is, eg; if I train a model with some fixed set of data,
> some random seed, etc. and then run inference on it do I get the exact same
> floating point values for the weights and results? Does MxNet already offer
> this without MKL-DNN?
>
> On November 18, 2019 at 6:32:07 PM, Tao Lv (mutou...@gmail.com) wrote:
>
> Regarding the cases listed by Marco:
> - AMD CPU
> From my architecture knowledge, what works on C4 instances (with AVX2
> support) should also work well on m5a, right? I think mxnet-mkl and
> mxnet-cuxxmkl packages have been fully validated on AVX2 machines.
> Also, we didn't perform any validation on AMD CPU before, why we need do
> that for this time?
>
> - ARM CPU
> I don't know we're releasing any convenience binaries for ARM CPU. This
> proposal mainly targets those pypi packages.
>
> - Windows
> Already validated by CI. We're also releasing mxnet-mkl packages for Win.
>
> - GPU and MKLDNN enabled
> Already validated by CI and mxnet-cuxxmkl packages have been released for
> several versions.
>
> - Fully reproducible results (medical and financial sector requested that
> and we have some flags for cuda)
> Not sure I understand this case. We already have MKL-DNN backend for a
> while. Functionality and correctness of it have been verified by MXNet
> users.
>
> -tao
>
> On Tue, Nov 19, 2019 at 4:41 AM Marco de Abreu 
> wrote:
>
> > Sorry, my intent with the "non-standard" phrase was not about general
> MXNet
> > but rather from MKLDNNs point of view, considering that it's being
> > developed by Intel, I assumed that MKLDNN might consider non-intel
> > use-cases non standard.
> >
> > -Marco
> >
> > Skalicky, Sam  schrieb am Mo., 18. Nov.
> 2019,
> > 21:34:
> >
> > > Thanks Alfredo, if you can create a GitHub issue with notes/steps we
> can
> > > add this to the todo list for integrating with the MXNet CI to test on
> > m5a
> > > instances too. Then we can start tracking this on a regular basis. It
> > would
> > > be great to actually test on ARM instances now that AWS has A1
> instances
> > > too…..ill add it to the wish list ;-D
> > >
> > > Sam
> > >
> > > > On Nov 18, 2019, at 12:32 PM, Alfredo Luque <
> alfredo.lu...@airbnb.com
> > .INVALID>
> > > wrote:
> > > >
> > > > Happy to run some benchmarks on an AWS m5a instance (Epyc) and first
> > > > generation AMD Threadripper Gen 1 if someone has something easy to
> run
> > > and
> > > > representative.
> > > >
> > > > On November 18, 2019 at 12:29:31 PM, Skalicky, Sam (
> > > > sska...@amazon.com.invalid) wrote:
> > > >
> > > > Thanks a good idea Alfredo, are you able to help test on AMD CPUs? Or
> > is
> > > > there someone else in the mxnet dev@ community who can help?
> > > >
> > > > Sam
> > > >
> > > >> On Nov 18, 2019, at 12:27 PM, Alfredo Luque
> > > >  wrote:
> > > >>
> > > >> Verifying that there isn’t a slowdown on AMD CPUs (eg; Ryzen / Epyc)
> > > > would
> > > >> definitely make sense as a requirement. It seems odd to classify
> that
> > as
> > > > a
> > > >> “nonstandard” use case.
> > > >>
> > > >> On November 18, 2019 at 12:20:33 PM, Skalicky, Sam (
> > > >> sska...@amazon.com.invalid) wrote:
> > > >>
> > > >> Thanks Patric & team for your work over the years to make MXNet fast
> > > with
> > > >> MKLDNN!
> > > >>
> > > >> I think it would be great to make MKLDNN enabled by default. We will
> > > need
> > > >> to continue producing variants without MKLDNN for those who don’t
> want
> > > it
> > > >> (Marco enumerated some use cases). How do you propose to identify
> the
> > > pip
> > > >> wheels with/without MKLDNN? Previously we had: mxnet-mkl and
> > > > mxnet-cu101mkl
> > > >> with MKLDNN. If the plain “mxnet” pip wheel now contains MKLDNN what
> > do
> > > > you
> > > >> propose we call the build without MKLDNN? mxnet-nomkl?
> > > >>
> > > >> Thanks!
> > > >> Sam
> > > >>
> > > >>> On Nov 18, 2019, at 11:08 AM, Marco de Abreu <
> > marco.g.ab...@gmail.com>
> > > >> wrote:
> > > >>>
> > > >>> Hi Patric,
> > > >>>
> >

Re: [DISCUSS] Remove amalgamation

2019-09-28 Thread Chris Olivier

; > > > >
> >> > > > > > >
> >> > > > > > > On Tuesday, September 10, 2019, Anirudh Subramanian
> >> > > > > > >  >> > > > > > > >
> >> > > > > > > wrote:
> >> > > > > > > > Hi Pedro,
> >> > > > > > > >
> >> > > > > > > > I don't see anything "destructive" with Chris asking for
> >> > > > > > > > justification
> >> > > > > > > for
> >> > > > > > > > you calling something "hacky". The only email in this
> thread
> >> > > where
> >> > > > I
> >> > > > > > > > see
> >> > > > > > > ad
> >> > > > > > > > hominems and disrespectful comments is your email.
> >> > > > > > > >
> >> > > > > > > > On Sat, Sep 7, 2019, 10:18 PM Pedro Larroy
> >> > > > > > > >  >> > > > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > >> Apache mentors should have a look at these reincident
> >> > harassment
> >> > > > > > > >> and destructive behaviors which demotivate contributions
> >> and
> >> > > take
> >> > > > > > > >> action. It takes only one bad apple to ruin a community.
> >> > > > > > > >>
> >> > > > > > > >> The mobile solution that is known to work as of know is
> >> cross
> >> > > > > > > >> compiling with "ci/build.py -p build.android_armv8" or
> >> > > > > > > >> "build.android_armv7". The only advantage of amalgamation
> >> is
> >> > to
> >> > > > > > > >> provide a smaller binary that we
> >> > > > > > > could
> >> > > > > > > >> accomplish with the C preprocessor.
> >> > > > > > > >>
> >> > > > > > > >> My technical contributions speak for themselves,
> including
> >> > > porting
> >> > > > > > > >> MXNet
> >> > > > > > > to
> >> > > > > > > >> Android and ARM and helping many users run MXNet in
> Jetson,
> >> > > > > > > >> Raspberry Pi and Android amongst many other topics. I
> have
> >> > never
> >> > > > > > > >> been disrespectful
> >> > > > > > > to
> >> > > > > > > >> anyone. I'm entitled to my own technical opinions about
> >> > > > > > > >> amalgamation or
> >> > > > > > > any
> >> > > > > > > >> other piece of code whatsoever, that's no personal
> >> disrespect
> >> > to
> >> > > > > > > >> anyone
> >> > > > > > > and
> >> > > > > > > >> perfectly valid. If you are not interested in this
> project
> >> > > > anymore,
> >> > > > > > > >> do
> >> > > > > > > us
> >> > > > > > > >> all a favor and stop trolling and being toxic. If you
> want
> >> my
> >> > > > > > > >> respect,
> >> > > > > > > step
> >> > > > > > > >> up your technical contributions, be positive and
> encourage
> >> > > others,
> >> > > > > > > >> this including commits, I haven't seen for many months,
> >> please
> >> > > be
> >> > > > > > > >> positive
> >> > > > > > > and
> >> > > > > > > >> constructive. This scorched-earth attitude is only
> >> reflecting
> >> > > bad
> >> > > > > > > >> on
> >> > > > > > > you.
> >> > > > > > > >> I'm certainly not interested in your ad-hominems or
> unasked
> >> > for
> >> > > > > > > technical
> >> > > > > > > >> advice, which to be honest,  showing poor judgment and
> >> > > ignorance.
> >> > > > > > > >> Myself and others have come up with numbers, graphs,
> >> metrics
> >> > and
> >> > > > > > > >> arguments and have been met with dismissal, trolling and
> >> > > > > > > >> sea-lioning. I have recieved your insults via public and
> >> > private
> >> > > > > > > >> channels (such as linkedin) as have others. This is not
> ok
> >> and
> >> > > has
> >> > > > > > > >> to stop. If you have something personal against me or
> >> against
> >> > > your
> >> > > > > > > >> former employer, this is not the right place
> >> > > > > > > or
> >> > > > > > > >> forum.
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >> On Fri, Sep 6, 2019 at 3:56 PM Chris Olivier
> >> > > > > > > >> 
> >> > > > > > > >> wrote:
> >> > > > > > > >>
> >> > > > > > > >> > Hi Pedro,
> >> > > > > > > >> >
> >> > > > > > > >> > While I was not involved with amalgamation or its
> >> > development
> >> > > in
> >> > > > > > > >> > any
> >> > > > > > > way,
> >> > > > > > > >> > can you please refrain from referring to the work of
> >> others
> >> > > as a
> >> > > > > > > "hacky
> >> > > > > > > >> > solution"?  This is derogatory slang and the statement
> >> was
> >> > not
> >> > > > > > > supported
> >> > > > > > > >> > with any justification for such name-calling.  Someone
> >> > spent a
> >> > > > > > > >> > good
> >> > > > > > > deal
> >> > > > > > > >> of
> >> > > > > > > >> > time on this solution at some point in time and I am
> >> sure it
> >> > > > > > > >> > worked
> >> > > > > > > for
> >> > > > > > > >> its
> >> > > > > > > >> > purpose at that time -- I think it was used in the
> >> original
> >> > > > > > > >> > javascript
> >> > > > > > > >> port
> >> > > > > > > >> > as well, actually -- and it is disrespectful to call
> >> their
> >> > > > > > > >> > efforts "hacky".  Please respect what came before.
> >> > > > > > > >> >
> >> > > > > > > >> > Thanks for understanding,
> >> > > > > > > >> >
> >> > > > > > > >> > -Chris
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> > On Fri, Sep 6, 2019 at 3:07 PM Pedro Larroy <
> >> > > > > > > >> pedro.larroy.li...@gmail.com>
> >> > > > > > > >> > wrote:
> >> > > > > > > >> >
> >> > > > > > > >> > > Hi
> >> > > > > > > >> > >
> >> > > > > > > >> > > I would like to propose to remove amalgamation from
> >> MXNet
> >> > > and
> >> > > > > > > >> > > CI,
> >> > > > > > > users
> >> > > > > > > >> > > have reported that they couldn't use it successfully
> in
> >> > > > > > > >> > > Android, and instead they were able to use the cross
> >> > > compiled
> >> > > > > > > >> > > docker build
> >> > > > > > > >> > successfully.
> >> > > > > > > >> > >
> >> > > > > > > >> > > Any reason why we shouldn't remove this hacky
> solution?
> >> > > > > > > >> > >
> >> > > > > > > >> > > Pedro.
> >> > > > > > > >> > >
> >> > > > > > > >> >
> >> > > > > > > >>
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

[Announcement] New Committer - Anirudh Acharya

2019-09-26 Thread Chris Olivier

Hi all,

Please join me in welcoming Anirudh Acharya as a new committer of Apache
MXNet (incubating)!

Anirudh Acharya has been contributing to the MXNet project for a year and
half now and has made several improvements to the MXNet R project and
continues to contribute by adding optimizers, fixing tests and actively
providing feedback on the PRs and has good understanding of building
operators in MXNet and architecture in general.

Welcome, Anirudh!

Re: mxnet ctrl-c

2019-09-23 Thread Chris Olivier

thanks for the response. i trying to write to a “gpu” (sort of) with mxnet
and sometimes takes a long time and having no way to interrupt it
gracefully is “bad”. I will try to experiment with chaining back to the
python through normal signal channels. if i can get it to work i’ll post a
PR.

On Mon, Sep 23, 2019 at 12:00 PM Anirudh Subramanian 
wrote:

> Currently I don't see any special handling in the code base for this. We
> have atexit.register which invokes MXInvokeShutdown from python but that
> doesnt work for signals.
>
> Anirudh
>
> On Sun, Sep 22, 2019 at 7:30 PM Chris Olivier 
> wrote:
>
> > question: how does gluon handle ctrl-c during a “long” imperative
> operation
> > where the GLI hasn’t been released yet? is it supposed to be caught in
> c++
> > or python or no special handling for it at the moment?
> >
>

mxnet ctrl-c

2019-09-22 Thread Chris Olivier

question: how does gluon handle ctrl-c during a “long” imperative operation
where the GLI hasn’t been released yet? is it supposed to be caught in c++
or python or no special handling for it at the moment?

Re: [DISCUSS] Assigning Issues

2019-09-12 Thread Chris Olivier

+1

On Thu, Sep 12, 2019 at 1:18 PM Zach Kimberg 
wrote:

> We had a discussion a while back about trying to improve the way we handle
> issues by assigning them to users who are working on them. However, the
> discussion ended because issues could only be assigned to those with write
> access (committers).
>
> I just came across a new Github feature where issues can also be assigned
> to any user who comments on an issue [
> https://github.blog/2019-06-25-assign-issues-to-issue-commenters/].
> Committers can then assign anyone from the community who wants to work on
> the issue so we can track which issues are assigned and which ones are not.
> Assigned community members still have an "Unassign me" button if they no
> longer wish to work on an issue. It is also possible to assign up to 10
> people to an issue (or PR).
>
> Given this, I think we should try to assign issues when possible to those
> working on them. What does everyone think?
>
> Zach
>

Re: [DISCUSS] Remove amalgamation

2019-09-06 Thread Chris Olivier

Hi Pedro,

While I was not involved with amalgamation or its development in any way,
can you please refrain from referring to the work of others as a "hacky
solution"?  This is derogatory slang and the statement was not supported
with any justification for such name-calling.  Someone spent a good deal of
time on this solution at some point in time and I am sure it worked for its
purpose at that time -- I think it was used in the original javascript port
as well, actually -- and it is disrespectful to call their efforts
"hacky".  Please respect what came before.

Thanks for understanding,

-Chris

On Fri, Sep 6, 2019 at 3:07 PM Pedro Larroy 
wrote:

> Hi
>
> I would like to propose to remove amalgamation from MXNet and CI, users
> have reported that they couldn't use it successfully in Android, and
> instead they were able to use the cross compiled docker build successfully.
>
> Any reason why we shouldn't remove this hacky solution?
>
> Pedro.
>

Re: [DISCUSS] Apache MXNet: Path to graduation

2019-08-30 Thread Chris Olivier

Thank you for the details, Hen.

On Thu, Aug 29, 2019 at 10:18 PM Hen  wrote:

> Amazon. Amazon created the brand. They own the first repository to use the
> term in this conext ( https://github.com/gluon-api ). There was some
> involvement from Microsoft, so Microsoft's opinion may also be relevant.
> Gluon is not an Apache Software Foundation nor Apache MXNet brand.
>
> Unless it was very recent, I don't believe there have been any trademark
> registrations. If Amazon would prefer Apache control the Gluon naming, I
> think the simplest 'act' to make that so would be to move the gluon-api
> repository over to ASF control.
>
> Hen
>
> On Thu, Aug 29, 2019 at 8:27 AM Chris Olivier 
> wrote:
>
> > Who is the gluon “Brand Owner”?
> >
> > On Tue, Aug 27, 2019 at 10:43 AM Chris Olivier 
> > wrote:
> >
> > > Who is the gluon "brand owner"?
> > >
> > > On Tue, Aug 27, 2019 at 10:13 AM Qing Lan  wrote:
> > >
> > >> Hi Lieven,
> > >>
> > >> Thanks for your comments. After the discussion with several committers
> > >> and contributors offline, we agreed that there are space for
> > improvement.
> > >>
> > >>
> > >>   1.  About the Gluon naming
> > >>
> > >> As we know, Gluon is born with the unique API design pattern. It
> > >> gradually became the dominant Python front end for MXNet. I would
> > suggest
> > >> to discuss more with the Brand owner and see if there could be a
> further
> > >> integration with MXNet. To MXNet itself, it becomes more popular with
> > this
> > >> frontend. We lean on the strong community and improve our product
> > better by
> > >> consuming the feedback from it.
> > >>
> > >>  2. Diversity of the PMC
> > >> Currently, we have 40 PMC numbers from different companies, like
> Amazon,
> > >> Uber, NVIDIA, ByteDance and a lot more. We are trying to grow the
> number
> > >> and invite indivials from different companies as well as research
> > institute.
> > >>
> > >> 3. Release rotation
> > >> In the history, most of the releases were done by the Amazon side.
> > >> Currently, we are moving on to rotate this responsibility with
> > >> contributors/committers not from Amazon to start working on them.
> > >>
> > >> 4. Committers from different firm/institution should have real
> work
> > >> on MXNet
> > >> I can tell from the issues/PRs/rfcs they submitted and indeed and
> indeed
> > >> we should encourage the committers who is less active to be involved
> > into
> > >> MXNet contribution.
> > >>
> > >> Thanks,
> > >> Qing
> > >>
> > >> 
> > >> From: Lieven Govaerts 
> > >> Sent: Saturday, August 10, 2019 5:59
> > >> To: dev@mxnet.incubator.apache.org 
> > >> Cc: d...@mxnet.apache.org 
> > >> Subject: Re: [DISCUSS] Apache MXNet: Path to graduation
> > >>
> > >> Hi Qing,
> > >>
> > >> as a user and ASF member observing this project:
> > >>
> > >> On Sat, 10 Aug 2019 at 01:44, Qing Lan  wrote:
> > >>
> > >> > Hi All,
> > >> >
> > >> > I would like to start a thread to discuss about the graduation for
> > >> Apache
> > >> > MXNet. From my time working in the community, I saw a great
> > improvement
> > >> in
> > >> > most of the area that we do to make MXNet a better place. We keep
> > >> tracking
> > >> > on all of the issues user raised and reviewing PRs. We follow the
> > Apache
> > >> > Way to release the package in official repository.
> > >> >
> > >> >
> > >> in terms of code, documentation, visibility this project is certainly
> > in a
> > >> healthy state, I see a lot of interest of companies and people, the
> > >> community is growing... As a user that gives me confidence my time
> > >> invested
> > >> in this product is well spent.
> > >>
> > >>
> > >> > In 2017, Apache MXNet joined the Apache incubation project. I think
> > now
> > >> is
> > >> > a good time to review the path to graduate MXNet and move forward to
> > it.
> > >> > Please feel free to share your thoughts on graduation and space for
> > >> > improvement.
> > >> >
> > >> >
> > >> If I may share one observation: I don't see the community working a
> lot
> > on
> > >> non-code topics. One example that I personally find important is the
> > >> discussion of the Gluon brand. People have expressed confusion about
> how
> > >> the name is used by multiple non-ASF projects, the MXNet team finds
> the
> > >> Gluon name very valuable yet the discussion on how to protect the name
> > and
> > >> decide on acceptable use by other projects has stalled [1]. I suggest
> > you
> > >> make a decision on this topic before you go for graduation.
> > >>
> > >> regards,
> > >>
> > >> Lieven
> > >>
> > >> [1]
> > >>
> > >>
> >
> https://mail-archives.apache.org/mod_mbox/mxnet-dev/201903.mbox/%3ccac_cu1gi+3s6ob48kt0x5wta4oxdum8uq9tmnyku2ujyaya...@mail.gmail.com%3e
> > >>
> > >>
> > >>
> > >>
> > >> > You can find more about graduation policy in here:
> > >> > https://incubator.apache.org/guides/graduation.html
> > >> >
> > >> > Thanks,
> > >> > Qing
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] Apache MXNet: Path to graduation

2019-08-29 Thread Chris Olivier

Who is the gluon “Brand Owner”?

On Tue, Aug 27, 2019 at 10:43 AM Chris Olivier 
wrote:

> Who is the gluon "brand owner"?
>
> On Tue, Aug 27, 2019 at 10:13 AM Qing Lan  wrote:
>
>> Hi Lieven,
>>
>> Thanks for your comments. After the discussion with several committers
>> and contributors offline, we agreed that there are space for improvement.
>>
>>
>>   1.  About the Gluon naming
>>
>> As we know, Gluon is born with the unique API design pattern. It
>> gradually became the dominant Python front end for MXNet. I would suggest
>> to discuss more with the Brand owner and see if there could be a further
>> integration with MXNet. To MXNet itself, it becomes more popular with this
>> frontend. We lean on the strong community and improve our product better by
>> consuming the feedback from it.
>>
>>  2. Diversity of the PMC
>> Currently, we have 40 PMC numbers from different companies, like Amazon,
>> Uber, NVIDIA, ByteDance and a lot more. We are trying to grow the number
>> and invite indivials from different companies as well as research institute.
>>
>> 3. Release rotation
>> In the history, most of the releases were done by the Amazon side.
>> Currently, we are moving on to rotate this responsibility with
>> contributors/committers not from Amazon to start working on them.
>>
>> 4. Committers from different firm/institution should have real work
>> on MXNet
>> I can tell from the issues/PRs/rfcs they submitted and indeed and indeed
>> we should encourage the committers who is less active to be involved into
>> MXNet contribution.
>>
>> Thanks,
>> Qing
>>
>> 
>> From: Lieven Govaerts 
>> Sent: Saturday, August 10, 2019 5:59
>> To: dev@mxnet.incubator.apache.org 
>> Cc: d...@mxnet.apache.org 
>> Subject: Re: [DISCUSS] Apache MXNet: Path to graduation
>>
>> Hi Qing,
>>
>> as a user and ASF member observing this project:
>>
>> On Sat, 10 Aug 2019 at 01:44, Qing Lan  wrote:
>>
>> > Hi All,
>> >
>> > I would like to start a thread to discuss about the graduation for
>> Apache
>> > MXNet. From my time working in the community, I saw a great improvement
>> in
>> > most of the area that we do to make MXNet a better place. We keep
>> tracking
>> > on all of the issues user raised and reviewing PRs. We follow the Apache
>> > Way to release the package in official repository.
>> >
>> >
>> in terms of code, documentation, visibility this project is certainly in a
>> healthy state, I see a lot of interest of companies and people, the
>> community is growing... As a user that gives me confidence my time
>> invested
>> in this product is well spent.
>>
>>
>> > In 2017, Apache MXNet joined the Apache incubation project. I think now
>> is
>> > a good time to review the path to graduate MXNet and move forward to it.
>> > Please feel free to share your thoughts on graduation and space for
>> > improvement.
>> >
>> >
>> If I may share one observation: I don't see the community working a lot on
>> non-code topics. One example that I personally find important is the
>> discussion of the Gluon brand. People have expressed confusion about how
>> the name is used by multiple non-ASF projects, the MXNet team finds the
>> Gluon name very valuable yet the discussion on how to protect the name and
>> decide on acceptable use by other projects has stalled [1]. I suggest you
>> make a decision on this topic before you go for graduation.
>>
>> regards,
>>
>> Lieven
>>
>> [1]
>>
>> https://mail-archives.apache.org/mod_mbox/mxnet-dev/201903.mbox/%3ccac_cu1gi+3s6ob48kt0x5wta4oxdum8uq9tmnyku2ujyaya...@mail.gmail.com%3e
>>
>>
>>
>>
>> > You can find more about graduation policy in here:
>> > https://incubator.apache.org/guides/graduation.html
>> >
>> > Thanks,
>> > Qing
>> >
>>
>

Re: [DISCUSS] Apache MXNet: Path to graduation

2019-08-27 Thread Chris Olivier

Who is the gluon "brand owner"?

On Tue, Aug 27, 2019 at 10:13 AM Qing Lan  wrote:

> Hi Lieven,
>
> Thanks for your comments. After the discussion with several committers and
> contributors offline, we agreed that there are space for improvement.
>
>
>   1.  About the Gluon naming
>
> As we know, Gluon is born with the unique API design pattern. It gradually
> became the dominant Python front end for MXNet. I would suggest to discuss
> more with the Brand owner and see if there could be a further integration
> with MXNet. To MXNet itself, it becomes more popular with this frontend. We
> lean on the strong community and improve our product better by consuming
> the feedback from it.
>
>  2. Diversity of the PMC
> Currently, we have 40 PMC numbers from different companies, like Amazon,
> Uber, NVIDIA, ByteDance and a lot more. We are trying to grow the number
> and invite indivials from different companies as well as research institute.
>
> 3. Release rotation
> In the history, most of the releases were done by the Amazon side.
> Currently, we are moving on to rotate this responsibility with
> contributors/committers not from Amazon to start working on them.
>
> 4. Committers from different firm/institution should have real work on
> MXNet
> I can tell from the issues/PRs/rfcs they submitted and indeed and indeed
> we should encourage the committers who is less active to be involved into
> MXNet contribution.
>
> Thanks,
> Qing
>
> 
> From: Lieven Govaerts 
> Sent: Saturday, August 10, 2019 5:59
> To: dev@mxnet.incubator.apache.org 
> Cc: d...@mxnet.apache.org 
> Subject: Re: [DISCUSS] Apache MXNet: Path to graduation
>
> Hi Qing,
>
> as a user and ASF member observing this project:
>
> On Sat, 10 Aug 2019 at 01:44, Qing Lan  wrote:
>
> > Hi All,
> >
> > I would like to start a thread to discuss about the graduation for Apache
> > MXNet. From my time working in the community, I saw a great improvement
> in
> > most of the area that we do to make MXNet a better place. We keep
> tracking
> > on all of the issues user raised and reviewing PRs. We follow the Apache
> > Way to release the package in official repository.
> >
> >
> in terms of code, documentation, visibility this project is certainly in a
> healthy state, I see a lot of interest of companies and people, the
> community is growing... As a user that gives me confidence my time invested
> in this product is well spent.
>
>
> > In 2017, Apache MXNet joined the Apache incubation project. I think now
> is
> > a good time to review the path to graduate MXNet and move forward to it.
> > Please feel free to share your thoughts on graduation and space for
> > improvement.
> >
> >
> If I may share one observation: I don't see the community working a lot on
> non-code topics. One example that I personally find important is the
> discussion of the Gluon brand. People have expressed confusion about how
> the name is used by multiple non-ASF projects, the MXNet team finds the
> Gluon name very valuable yet the discussion on how to protect the name and
> decide on acceptable use by other projects has stalled [1]. I suggest you
> make a decision on this topic before you go for graduation.
>
> regards,
>
> Lieven
>
> [1]
>
> https://mail-archives.apache.org/mod_mbox/mxnet-dev/201903.mbox/%3ccac_cu1gi+3s6ob48kt0x5wta4oxdum8uq9tmnyku2ujyaya...@mail.gmail.com%3e
>
>
>
>
> > You can find more about graduation policy in here:
> > https://incubator.apache.org/guides/graduation.html
> >
> > Thanks,
> > Qing
> >
>

Re: CI and PRs

2019-08-23 Thread Chris Olivier

 apt-key add r.gpg
> > > >> OK
> > > >> + add-apt-repository 'deb [arch=amd64,i386]
> > > >> https://cran.rstudio.com/bin/linux/ubuntu xenial/'
> > > >> + apt-get update
> > > >> Ign:1 http://cran.rstudio.com/bin/linux/ubuntu trusty/ InRelease
> > > >>
> > > >> On Fri, Aug 16, 2019 at 11:32 AM Pedro Larroy
> > > >>  wrote:
> > > >> >
> > > >> > Also, I forgot, another workaround is that I added the -R flag to
> > the
> > > >> build
> > > >> > logic (build.py) so the container is not rebuilt for manual use.
> > > >> >
> > > >> > On Fri, Aug 16, 2019 at 11:18 AM Pedro Larroy <
> > > >> pedro.larroy.li...@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > >
> > > >> > > Hi Aaron.
> > > >> > >
> > > >> > > As Marco explained, if you are in master the cache usually
> works,
> > > >> there's
> > > >> > > two issues that I have observed:
> > > >> > >
> > > >> > > 1 - Docker doesn't automatically pull the base image (ex.
> > > >> ubuntu:16.04) so
> > > >> > > if your cached base which is used in the FROM statement becomes
> > > >> outdated
> > > >> > > your caching won't work. (Using docker pull ubuntu:16.04) or the
> > > base
> > > >> > > images from the container helps with this.
> > > >> > >
> > > >> > > 2 - There's another situation where the above doesn't help which
> > > >> seems to
> > > >> > > be an unidentified issue with the docker cache:
> > > >> > > https://github.com/docker/docker.github.io/issues/8886
> > > >> > >
> > > >> > > We can get a short term workaround for #1 by explicitly pulling
> > > bases
> > > >> from
> > > >> > > the script, but I think docker should do it when using
> > --cache-from
> > > so
> > > >> > > maybe contributing a patch to docker would the best approach.
> > > >> > >
> > > >> > > Pedro
> > > >> > >
> > > >> > > On Thu, Aug 15, 2019 at 7:06 PM Aaron Markham <
> > > >> aaron.s.mark...@gmail.com>
> > > >> > > wrote:
> > > >> > >
> > > >> > >> When you create a new Dockerfile and use that on CI, it doesn't
> > > seem
> > > >> > >> to cache some of the steps... like this:
> > > >> > >>
> > > >> > >> Step 13/15 : RUN /work/ubuntu_docs.sh
> > > >> > >>  ---> Running in a1e522f3283b
> > > >> > >>  [91m+ echo 'Installing dependencies...'
> > > >> > >> + apt-get update
> > > >> > >>  [0mInstalling dependencies.
> > > >> > >>
> > > >> > >> Or this
> > > >> > >>
> > > >> > >> Step 4/13 : RUN /work/ubuntu_core.sh
> > > >> > >>  ---> Running in e7882d7aa750
> > > >> > >>  [91m+ apt-get update
> > > >> > >>
> > > >> > >> I get if I was changing those scripts, but then I'd think it
> > should
> > > >> > >> cache after running it once... but, no.
> > > >> > >>
> > > >> > >>
> > > >> > >> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu <
> > > >> marco.g.ab...@gmail.com>
> > > >> > >> wrote:
> > > >> > >> >
> > > >> > >> > Do I understand it correctly that you are saying that the
> > Docker
> > > >> cache
> > > >> > >> > doesn't work properly and regularly reinstalls dependencies?
> Or
> > > do
> > > >> you
> > > >> > >> mean
> > > >> > >> > that you only have cache misses when you modify the
> > dependencies
> > > -
> > > >> which
> > > >> > >> > would be expected?
> > > >> > >> >
> > > >> > >> > -Marco
> > > >> > >> >
> > > >> > >> &g

Re: MxNet/XLA

2019-08-18 Thread Chris Olivier

I will take the silence as a “no”.  Well, that’s a shame, then.

On Thu, Aug 15, 2019 at 4:32 PM Chris Olivier 
wrote:

> Tensorflow and pytorch seem to have XLA compatibility (pytorch probably is
> not as stable as tensorflow in this respect, I imagine), and maybe others
> that I don’t know about directly. Is anyone currently working on XLA
> support for mxnet?
>
>
> -Chris
>

MxNet/XLA

2019-08-15 Thread Chris Olivier

Tensorflow and pytorch seem to have XLA compatibility (pytorch probably is
not as stable as tensorflow in this respect, I imagine), and maybe others
that I don’t know about directly. Is anyone currently working on XLA
support for mxnet?


-Chris

Re: CI and PRs

2019-08-14 Thread Chris Olivier

I see it done daily now, and while I can’t share all the details, it’s not
an incredibly complex thing, and involves not much more than nfs/efs
sharing and remote ssh commands.  All it takes is a little ingenuity and
some imagination.

On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy 
wrote:

> Sounds good in theory. I think there are complex details with regards of
> resource sharing during parallel execution. Still I think both ways can be
> explored. I think some tests run for unreasonably long times for what they
> are doing. We already scale parts of the pipeline horizontally across
> workers.
>
>
> On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier 
> wrote:
>
> > +1
> >
> > Rather than remove tests (which doesn’t scale as a solution), why not
> scale
> > them horizontally so that they finish more quickly? Across processes or
> > even on a pool of machines that aren’t necessarily the build machine?
> >
> > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu  >
> > wrote:
> >
> > > With regards to time I rather prefer us spending a bit more time on
> > > maintenance than somebody running into an error that could've been
> caught
> > > with a test.
> > >
> > > I mean, our Publishing pipeline for Scala GPU has been broken for quite
> > > some time now, but nobody noticed that. Basically my stance on that
> > matter
> > > is that as soon as something is not blocking, you can also just
> > deactivate
> > > it since you don't have a forcing function in an open source project.
> > > People will rarely come back and fix the errors of some nightly test
> that
> > > they introduced.
> > >
> > > -Marco
> > >
> > > Carin Meier  schrieb am Mi., 14. Aug. 2019,
> 21:59:
> > >
> > > > If a language binding test is failing for a not important reason,
> then
> > it
> > > > is too brittle and needs to be fixed (we have fixed some of these
> with
> > > the
> > > > Clojure package [1]).
> > > > But in general, if we thinking of the MXNet project as one project
> that
> > > is
> > > > across all the language bindings, then we want to know if some
> > > fundamental
> > > > code change is going to break a downstream package.
> > > > I can't speak for all the high level package binding maintainers, but
> > I'm
> > > > always happy to pitch in to provide code fixes to help the base PR
> get
> > > > green.
> > > >
> > > > The time costs to maintain such a large CI project obviously needs to
> > be
> > > > considered as well.
> > > >
> > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > >
> > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > pedro.larroy.li...@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > From what I have seen Clojure is 15 minutes, which I think is
> > > reasonable.
> > > > > The only question is that when a binding such as R, Perl or Clojure
> > > > fails,
> > > > > some devs are a bit confused about how to fix them since they are
> not
> > > > > familiar with the testing tools and the language.
> > > > >
> > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier  >
> > > > wrote:
> > > > >
> > > > > > Great idea Marco! Anything that you think would be valuable to
> > share
> > > > > would
> > > > > > be good. The duration of each node in the test stage sounds like
> a
> > > good
> > > > > > start.
> > > > > >
> > > > > > - Carin
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > > > marco.g.ab...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > we record a bunch of metrics about run statistics (down to the
> > > > duration
> > > > > > of
> > > > > > > every individual step). If you tell me which ones you're
> > > particularly
> > > > > > > interested in (probably total duration of each node in the test
> > > > stage),
> > > > > > I'm
> > > > > > > happy to provide them.
> > > > > > >
> > > > > > > Dimension

Re: CI and PRs

2019-08-14 Thread Chris Olivier

+1

Rather than remove tests (which doesn’t scale as a solution), why not scale
them horizontally so that they finish more quickly? Across processes or
even on a pool of machines that aren’t necessarily the build machine?

On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu 
wrote:

> With regards to time I rather prefer us spending a bit more time on
> maintenance than somebody running into an error that could've been caught
> with a test.
>
> I mean, our Publishing pipeline for Scala GPU has been broken for quite
> some time now, but nobody noticed that. Basically my stance on that matter
> is that as soon as something is not blocking, you can also just deactivate
> it since you don't have a forcing function in an open source project.
> People will rarely come back and fix the errors of some nightly test that
> they introduced.
>
> -Marco
>
> Carin Meier  schrieb am Mi., 14. Aug. 2019, 21:59:
>
> > If a language binding test is failing for a not important reason, then it
> > is too brittle and needs to be fixed (we have fixed some of these with
> the
> > Clojure package [1]).
> > But in general, if we thinking of the MXNet project as one project that
> is
> > across all the language bindings, then we want to know if some
> fundamental
> > code change is going to break a downstream package.
> > I can't speak for all the high level package binding maintainers, but I'm
> > always happy to pitch in to provide code fixes to help the base PR get
> > green.
> >
> > The time costs to maintain such a large CI project obviously needs to be
> > considered as well.
> >
> > [1] https://github.com/apache/incubator-mxnet/pull/15579
> >
> > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> pedro.larroy.li...@gmail.com
> > >
> > wrote:
> >
> > > From what I have seen Clojure is 15 minutes, which I think is
> reasonable.
> > > The only question is that when a binding such as R, Perl or Clojure
> > fails,
> > > some devs are a bit confused about how to fix them since they are not
> > > familiar with the testing tools and the language.
> > >
> > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier 
> > wrote:
> > >
> > > > Great idea Marco! Anything that you think would be valuable to share
> > > would
> > > > be good. The duration of each node in the test stage sounds like a
> good
> > > > start.
> > > >
> > > > - Carin
> > > >
> > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > marco.g.ab...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > we record a bunch of metrics about run statistics (down to the
> > duration
> > > > of
> > > > > every individual step). If you tell me which ones you're
> particularly
> > > > > interested in (probably total duration of each node in the test
> > stage),
> > > > I'm
> > > > > happy to provide them.
> > > > >
> > > > > Dimensions are (in hierarchical order):
> > > > > - job
> > > > > - branch
> > > > > - stage
> > > > > - node
> > > > > - step
> > > > >
> > > > > Unfortunately I don't have the possibility to export them since we
> > > store
> > > > > them in CloudWatch Metrics which afaik doesn't offer raw exports.
> > > > >
> > > > > Best regards,
> > > > > Marco
> > > > >
> > > > > Carin Meier  schrieb am Mi., 14. Aug. 2019,
> > > 19:43:
> > > > >
> > > > > > I would prefer to keep the language binding in the PR process.
> > > Perhaps
> > > > we
> > > > > > could do some analytics to see how much each of the language
> > bindings
> > > > is
> > > > > > contributing to overall run time.
> > > > > > If we have some metrics on that, maybe we can come up with a
> > > guideline
> > > > of
> > > > > > how much time each should take. Another possibility is leverage
> the
> > > > > > parallel builds more.
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > > pedro.larroy.li...@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Carin.
> > > > > > >
> > > > > > > That's a good point, all things considered would your
> preference
> > be
> > > > to
> > > > > > keep
> > > > > > > the Clojure tests as part of the PR process or in Nightly?
> > > > > > > Some options are having notifications here or in slack. But if
> we
> > > > think
> > > > > > > breakages would go unnoticed maybe is not a good idea to fully
> > > remove
> > > > > > > bindings from the PR process and just streamline the process.
> > > > > > >
> > > > > > > Pedro.
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> > carinme...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Before any binding tests are moved to nightly, I think we
> need
> > to
> > > > > > figure
> > > > > > > > out how the community can get proper notifications of failure
> > and
> > > > > > success
> > > > > > > > on those nightly runs. Otherwise, I think that breakages
> would
> > go
> > > > > > > > unnoticed.
> > > > > > > >
> > > > > > > > -Carin
> > > > > > > >
> > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > > > > >

Re: Turning CUDNN on/off in anaconda distribution

2019-07-11 Thread Chris Olivier

the problem is that additional training of a pretrained model on a
relatively simple vision-type model (not a complex model like resnet, but
it has some convolutions), is converging on CPU but not GPU — validation
does not converge, anyway.

is there an anaconda or pip package without cudnn to try without a rebuild?
i don’t think rebuild is an option.

On Thu, Jul 11, 2019 at 10:22 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Having runtime loadable / plugable operators might help with this.
>
> On Thu, Jul 11, 2019 at 10:20 AM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Once it's compiled the forward / backward, etc kernel implementations are
> > hard coded to use cuDNN.  In theory we could support raw CUDA in addition
> > to cuDNN but the additional CUDA kernel code would bloat the binary (it
> > targets several GPU types).
> >
> > On Thu, Jul 11, 2019 at 9:36 AM Chris Olivier 
> > wrote:
> >
> >> Is there an environment variable or some other way to not use CUDNN in
> the
> >> anaconda distribution of mxnet?
> >>
> >
>

Turning CUDNN on/off in anaconda distribution

2019-07-11 Thread Chris Olivier

Is there an environment variable or some other way to not use CUDNN in the
anaconda distribution of mxnet?

Re: [DISCUSS] Make MXNet deploy it's own distribution

2019-07-03 Thread Chris Olivier

Will this be another repo under Apache repo? Is tensorflow java package in
a separate repo?

On Wed, Jul 3, 2019 at 12:46 AM Per da Silva  wrote:

> Hi,
>
> We've started working on something along these lines as part of the CD
> pipeline framework. The idea is to compile and test the libmxnet.so  (both
> statically and dynamically linked) for the different variants (cpu, gpu,
> mkl, etc.) then have the different mxnet frontends (python, Julia, scala,
> etc) just wrap around the library.
>
> I've been on long term sick leave and haven't been able to move forward
> with this, although I have an open PR that kicks off this work:
> https://github.com/apache/incubator-mxnet/pull/15051 - I welcome everyone
> to take a look. It's the first of a series of PRs to automate the
> distribution of the python (pip and docker) packages. Instead of using
> maven, we have opted to use S3. But this decision can be revisited.
>
> We also want to distribute what we termed "runtime" docker images. Docker
> images containing the dynamically linked mxnet library and all of the
> runtime dependencies (examples: https://hub.docker.com/r/mxnet/runtime).
> This could facilitate the packaging and distribution of docker images for
> the different frontends.
>
> Cheers,
>
> Per
>
> On Wed., 3 Jul. 2019, 8:47 am Qing Lan,  wrote:
>
> > In that case, the answer is yes. The Scala package will be published in
> > one version with a variaty of backend package choices. User can easily
> > attach and detach different MXNet versions. However, the Scala package
> > cannot run without a backend.
> >
> > Another key advantage of this design will be a broader support on
> > different implementations such as Java Cpp. User will be able to
> implement
> > their customized MXNet frontend to use the native library.
> >
> > Thanks,
> > Qing
> >
> > 
> > From: Sheng Zha 
> > Sent: Tuesday, July 2, 2019 22:14
> > To: dev@mxnet.incubator.apache.org
> > Subject: Re: [DISCUSS] Make MXNet deploy it's own distribution
> >
> > Does it mean that the scala binding of mxnet will be an independent
> > package that doesn’t directly depend on the native package, and user
> > projects need to declare dependency on both the scala binding and one of
> > the native packages?
> >
> > -sz
> >
> > > On Jul 2, 2019, at 5:50 PM, Frank Liu  wrote:
> > >
> > > Currently, MXNet were built along with different language bindings such
> > as
> > > Scala.
> > >
> > > The libmxnet.so files are bundled within scala jar package.
> > >
> > > It would be nice to distribute libmxnet.so library independently in
> > maven,
> > > and scala package can choose which native library to use.
> > >
> > > Here is the design document on cwiki:
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Make+MXNet+deploy+it%27s+own+distribution
> > >
> > > Thanks,
> > >
> > > Frank
> >
>

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-29 Thread Chris Olivier

for batch norm, I mean. max*

On Sat, Jun 29, 2019 at 12:34 PM Chris Olivier 
wrote:

> what’s with the mac memory usage being 2x in 1.4? As I am not sure where
> the number is coming from (if it’s my profiler code, I wouldn’t consider it
> terribly meaningful), but it is the same everywhere else, so it kind of
> sticks out.
>
> On Thu, Jun 27, 2019 at 3:36 PM sandeep krishnamurthy <
> sandeep.krishn...@gmail.com> wrote:
>
>> Hello Ciyong/Pedro,
>>
>> Ran operator benchmarks on 1.4.1 and 1.5.0.rc2. (Not complete, doesn’t
>> cover all MXNet operators, not presented in best possible way, still WIP)
>>
>> https://gist.github.com/sandeep-krishnamurthy/e0a2be893c8c4d484390c9c8813bdf50
>>
>> Following operators looks slower in 1.5 compared to 1.4.1:
>> - BatchNorm
>> - Pooling
>> - FullyConnected
>> - batch_dot
>> - Dot
>> - broadcast_mul
>> - log_softmax
>> and few other operators
>>
>> Also, several operators runs a lot faster on 1.5 compared to 1.4.1. For
>> example - Convolution, flatten, elementwise operators etc. So I see that
>> likely few operators have regressed noticeably, however, due to other
>> operator performance improvements, the end effect is not that significant
>> hiding a lot of regression. We need more detailed analysis per operator
>> performance. We will not be able to do this for current release, we should
>> have a more concrete way to determining such performance regression before
>> next release.
>>
>> Setup:
>> 1.5 => Build from source (head of 1.5.rc2 tag), built with MKLDNN
>> 1.4.1 => PyPi mxnet-mkl==1.4.1
>> Machine: C5.18X
>> No explicit environment variable were set
>> Operator benchmark code -
>> https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf
>>
>> Best,
>> Sandeep
>>
>>
>> On Thu, Jun 27, 2019 at 10:42 AM Pedro Larroy <
>> pedro.larroy.li...@gmail.com>
>> wrote:
>>
>> > I will try to run a few benchmarks in a bare metal instance tonight to
>> > remove virtualization variance for the measurements and provide some
>> > numbers.
>> >
>> > Please propose a set of models / examples that would be desirable to
>> > run before the release and provide a link to an easy to run script
>> > with instructions so we can validate the release better.
>> >
>> > Thank you.
>> >
>> > On Thu, Jun 27, 2019 at 10:01 AM Lai Wei  wrote:
>> > >
>> > > Dear @dev,
>> > >
>> > > I m cancelling the vote for cached op fix:
>> > >
>> > > https://github.com/apache/incubator-mxnet/pull/15298
>> > >
>> > > As for the possible cpu training regression, it looks like not a
>> blocker
>> > > for now.
>> > >
>> > > I will start a new rc2 vote, please help to validate.
>> > >
>> > > Thanks!
>> > >
>> > >
>> > > On Thu, Jun 27, 2019 at 10:06 PM Chen, Ciyong 
>> > wrote:
>> > >
>> > > > Hi Pedro,
>> > > >
>> > > > I was able to reproduced the similar result (v1.5 is ~%5.6 slower
>> than
>> > > > v1.4, I was using 18 cores for computing) with your script on
>> > C5.18xlarge.
>> > > > But need to bind the cores with below command when running the
>> script,
>> > > > (without setting the env variables, I got a close time (<1%) with
>> v1.5
>> > and
>> > > > v1.4)
>> > > > export
>> KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
>> > > > export OMP_NUM_THREADS=18
>> > > >
>> > > > Did you set any env variables during running?
>> > > >
>> > > > The performance result I got as below:
>> > > > 1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
>> > > > real12m10.856s
>> > > > user234m49.576s
>> > > > sys 4m38.044s
>> > > >
>> > > > 2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
>> > > > real12m52.140s
>> > > > user246m30.740s
>> > > > sys 5m8.188s
>> > > >
>> > > > As I looked at the profiling data, most of the ops have same perf
>> > between
>> > > > v1.4 and v1.5. But some ops like " _backward_BatchNorm" and
>> "Pooling"
>> > is
>> > > > ~1.37x slower o

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc1

2019-06-29 Thread Chris Olivier

what’s with the mac memory usage being 2x in 1.4? As I am not sure where
the number is coming from (if it’s my profiler code, I wouldn’t consider it
terribly meaningful), but it is the same everywhere else, so it kind of
sticks out.

On Thu, Jun 27, 2019 at 3:36 PM sandeep krishnamurthy <
sandeep.krishn...@gmail.com> wrote:

> Hello Ciyong/Pedro,
>
> Ran operator benchmarks on 1.4.1 and 1.5.0.rc2. (Not complete, doesn’t
> cover all MXNet operators, not presented in best possible way, still WIP)
>
> https://gist.github.com/sandeep-krishnamurthy/e0a2be893c8c4d484390c9c8813bdf50
>
> Following operators looks slower in 1.5 compared to 1.4.1:
> - BatchNorm
> - Pooling
> - FullyConnected
> - batch_dot
> - Dot
> - broadcast_mul
> - log_softmax
> and few other operators
>
> Also, several operators runs a lot faster on 1.5 compared to 1.4.1. For
> example - Convolution, flatten, elementwise operators etc. So I see that
> likely few operators have regressed noticeably, however, due to other
> operator performance improvements, the end effect is not that significant
> hiding a lot of regression. We need more detailed analysis per operator
> performance. We will not be able to do this for current release, we should
> have a more concrete way to determining such performance regression before
> next release.
>
> Setup:
> 1.5 => Build from source (head of 1.5.rc2 tag), built with MKLDNN
> 1.4.1 => PyPi mxnet-mkl==1.4.1
> Machine: C5.18X
> No explicit environment variable were set
> Operator benchmark code -
> https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf
>
> Best,
> Sandeep
>
>
> On Thu, Jun 27, 2019 at 10:42 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> wrote:
>
> > I will try to run a few benchmarks in a bare metal instance tonight to
> > remove virtualization variance for the measurements and provide some
> > numbers.
> >
> > Please propose a set of models / examples that would be desirable to
> > run before the release and provide a link to an easy to run script
> > with instructions so we can validate the release better.
> >
> > Thank you.
> >
> > On Thu, Jun 27, 2019 at 10:01 AM Lai Wei  wrote:
> > >
> > > Dear @dev,
> > >
> > > I m cancelling the vote for cached op fix:
> > >
> > > https://github.com/apache/incubator-mxnet/pull/15298
> > >
> > > As for the possible cpu training regression, it looks like not a
> blocker
> > > for now.
> > >
> > > I will start a new rc2 vote, please help to validate.
> > >
> > > Thanks!
> > >
> > >
> > > On Thu, Jun 27, 2019 at 10:06 PM Chen, Ciyong 
> > wrote:
> > >
> > > > Hi Pedro,
> > > >
> > > > I was able to reproduced the similar result (v1.5 is ~%5.6 slower
> than
> > > > v1.4, I was using 18 cores for computing) with your script on
> > C5.18xlarge.
> > > > But need to bind the cores with below command when running the
> script,
> > > > (without setting the env variables, I got a close time (<1%) with
> v1.5
> > and
> > > > v1.4)
> > > > export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
> > > > export OMP_NUM_THREADS=18
> > > >
> > > > Did you set any env variables during running?
> > > >
> > > > The performance result I got as below:
> > > > 1) 1.4.1.rc0 (1a7199691f5cbc6012bb53eecbf884bed5ae6590)
> > > > real12m10.856s
> > > > user234m49.576s
> > > > sys 4m38.044s
> > > >
> > > > 2) 1.5.0.rc1 (4d9667121ae6fb643f2a02ab15e25231ed756cde)
> > > > real12m52.140s
> > > > user246m30.740s
> > > > sys 5m8.188s
> > > >
> > > > As I looked at the profiling data, most of the ops have same perf
> > between
> > > > v1.4 and v1.5. But some ops like " _backward_BatchNorm" and "Pooling"
> > is
> > > > ~1.37x slower on v1.5 compared with v1.4.
> > > > Will do further analysis on these ops.
> > > >
> > > > Here's the hardware/OS info from my side:
> > > > --Python Info--
> > > > Version  : 3.6.8
> > > > Compiler : GCC 7.3.0
> > > > Build: ('default', 'Dec 30 2018 01:22:34')
> > > > Arch : ('64bit', '')
> > > > Pip Info---
> > > > Version  : 19.0.3
> > > > Directory:
> > > >
> /home/ubuntu/anaconda3/envs/perf-mxnet/lib/python3.6/site-packages/pip
> > > > --MXNet Info---
> > > > Version  : 1.5.0
> > > > Directory: /home/ubuntu/ws/incubator-mxnet/python/mxnet
> > > > Hashtag not found. Not installed from pre-built package.
> > > > --System Info--
> > > > Platform : Linux-4.4.0-1085-aws-x86_64-with-debian-stretch-sid
> > > > system   : Linux
> > > > node : ip-172-31-32-129
> > > > release  : 4.4.0-1085-aws
> > > > version  : #96-Ubuntu SMP Tue Jun 11 09:08:32 UTC 2019
> > > > --Hardware Info--
> > > > machine  : x86_64
> > > > processor: x86_64
> > > > Architecture:  x86_64
> > > > CPU op-mode(s):32-bit, 64-bit
> > > > Byte Order:Little Endian
> > > > CPU(s):72
> > > > On-line CPU(s) list:   0-71
> > > > Thread(s) per

Re: OMP

2019-06-25 Thread Chris Olivier

1) I don't see how that code could cause reentrancy problems in omp. It
doesn't make any OMP calls at all.  Still doesn't look related to me.
Setting an environment variable probably doesn't even do anything, because:
  a) It probably doesn't check the environment variable except at initial
startup
  b) Even if it did, whether this code ran before or after the OMP init
code would be nondeterministic
  c) It for sure doesn't check the environment variable every time it hits
an omp region.  That would be ridiculously expensive and checking the OMP
source code, it doesn't..  You can't affect the OMP behavior at arbitrary
points in time by setting the "OMP_NUM_THREADS" environment variable.




On Tue, Jun 25, 2019 at 1:20 PM Pedro Larroy 
wrote:

> Nobody claimed that the original lockup has to do with OMP, but the
> fix caused re-entrancy into OMP initialization as explained below. So
> I agree with your statement that the bug that using pthread_atfork was
> fixing is not related with OMP, but the fix is causing interactions
> with OMP as described above.
>
> Pedro.
>
> On Tue, Jun 25, 2019 at 12:33 PM Chris Olivier 
> wrote:
> >
> > The call stacks there are mostly associated with the execution engine
> > threads, which are not OMP threads.  That lockup doesn't look to me to be
> > related to OMP   -- the execution engine uses its own thread pool logic
> --
> > I'm pretty familiar with that part of the code.  Unless I am missing one
> --
> > can you point to the one that looks OMP-related?
> >
> >
> > On Tue, Jun 25, 2019 at 10:35 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> > wrote:
> >
> > > Thanks for digging that out Kellen. That's good info so maybe it would
> > > be good to rework the fix with the info you provided and remove the
> > > pthread_atfork handlers.
> > > Do you think setting the device would avoid the problem seen on the
> > > backtrace you provided?  specifically here:
> > >
> > >
> https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600#file-hang_bt-L24
> > >
> > > On Mon, Jun 24, 2019 at 6:43 PM kellen sunderland
> > >  wrote:
> > > >
> > > > I remember at the time we also had a read through of this blog post,
> but
> > > to
> > > > use the code looked like it was following the advice:
> > > >
> > >
> https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/
> > > >
> > > > On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland <
> > > > kellen.sunderl...@gmail.com> wrote:
> > > >
> > > > > I remember this hang as well, it was pretty hard to reproduce
> IIRC.  I
> > > > > believe the stacks for the hang are here:
> > > > >
> > >
> https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600
> > > and
> > > > > the trick was we could only debug it up to the point that we hit:
> > > > >
> > > > > #0  0x7fec6df1ba4f in futex_wait (private=0, expected=1,
> > > > > futex_word=0x7fec60843758)
> > > > > at ../sysdeps/unix/sysv/linux/futex-internal.h:61
> > > > > #1  futex_wait_simple (private=0, expected=1,
> > > futex_word=0x7fec60843758)
> > > > > at ../sysdeps/nptl/futex-internal.h:135
> > > > > #2  __pthread_once_slow (once_control=0x7fec60843758,
> > > > > init_routine=0x7fec605f38f0)
> > > > > at pthread_once.c:105
> > > > > ...
> > > > > #6  0x7fec6061c577 in cudaSetDevice () from
> > > > > /usr/local/cuda/lib64/libcudart.so.9.0
> > > > >
> > > > > because the code in libcudart is obviously closed source we
> couldn't
> > > dig
> > > > > into what threading work was going on when we called cudaSetDevice.
> > > > >
> > > > > On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy <
> > > pedro.larroy.li...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> If you check initialize.cc we seem to be explicitly disabling that
> > > > >> behaviour in pthread_at_fork which seems to cause thread
> contention
> > > > >> during multiprocessing. Why do we need this major advantage for
> the
> > > > >> library if that's the case?
> > > > >>
> > > > >> Related PRs:
> > > > >>
> > > > >> https://github.com/apache/incubator-mxnet/

Re: OMP

2019-06-25 Thread Chris Olivier

The call stacks there are mostly associated with the execution engine
threads, which are not OMP threads.  That lockup doesn't look to me to be
related to OMP   -- the execution engine uses its own thread pool logic --
I'm pretty familiar with that part of the code.  Unless I am missing one --
can you point to the one that looks OMP-related?


On Tue, Jun 25, 2019 at 10:35 AM Pedro Larroy 
wrote:

> Thanks for digging that out Kellen. That's good info so maybe it would
> be good to rework the fix with the info you provided and remove the
> pthread_atfork handlers.
> Do you think setting the device would avoid the problem seen on the
> backtrace you provided?  specifically here:
>
> https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600#file-hang_bt-L24
>
> On Mon, Jun 24, 2019 at 6:43 PM kellen sunderland
>  wrote:
> >
> > I remember at the time we also had a read through of this blog post, but
> to
> > use the code looked like it was following the advice:
> >
> https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/
> >
> > On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > I remember this hang as well, it was pretty hard to reproduce IIRC.  I
> > > believe the stacks for the hang are here:
> > >
> https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600
> and
> > > the trick was we could only debug it up to the point that we hit:
> > >
> > > #0  0x7fec6df1ba4f in futex_wait (private=0, expected=1,
> > > futex_word=0x7fec60843758)
> > > at ../sysdeps/unix/sysv/linux/futex-internal.h:61
> > > #1  futex_wait_simple (private=0, expected=1,
> futex_word=0x7fec60843758)
> > > at ../sysdeps/nptl/futex-internal.h:135
> > > #2  __pthread_once_slow (once_control=0x7fec60843758,
> > > init_routine=0x7fec605f38f0)
> > > at pthread_once.c:105
> > > ...
> > > #6  0x7fec6061c577 in cudaSetDevice () from
> > > /usr/local/cuda/lib64/libcudart.so.9.0
> > >
> > > because the code in libcudart is obviously closed source we couldn't
> dig
> > > into what threading work was going on when we called cudaSetDevice.
> > >
> > > On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> > > wrote:
> > >
> > >> If you check initialize.cc we seem to be explicitly disabling that
> > >> behaviour in pthread_at_fork which seems to cause thread contention
> > >> during multiprocessing. Why do we need this major advantage for the
> > >> library if that's the case?
> > >>
> > >> Related PRs:
> > >>
> > >> https://github.com/apache/incubator-mxnet/pull/10820
> > >> https://github.com/apache/incubator-mxnet/issues/14396
> > >>
> > >> The original code was authored in this PR:
> > >>
> > >> https://github.com/apache/incubator-mxnet/pull/8677
> > >>
> > >> I actually remember this fix, it was done during a release as the cuda
> > >> runtime was forking and the engine was being re-entered. If that
> > >> situation is now happening anymore it might not be needed any longer.
> > >> I don't think we know the cause why there was a fork inside cuda, so
> > >> the code has grown around a fix for an issue which its root cause was
> > >> not understood, and side effects which this fix caused afterwards.
> > >>
> > >> My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in
> > >> the link above, no libgomp.
> > >>
> > >> I didn't try the Make build.
> > >>
> > >> I would refactor the code linked above and stop using pthread_at_fork,
> > >> since OMP assumes it won't be initialized twice, but needs to be very
> > >> well tested to make sure it doesn't cause bugs or affect the fixes
> > >> done on the linked PRs above.
> > >>
> > >> Pedro.
> > >>
> > >> On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier 
> > >> wrote:
> > >> >
> > >> > one major advantage of intel/llvm omp is that it spawns a new thread
> > >> pool
> > >> > after fork if a thread pool was already created. this is so that omp
> > >> can be
> > >> > used in the forked processes. libgomp doesn’t do this so it’ll just
> > >> lock up
> > >> > if you try to do omp in the forked pr

Re: OMP

2019-06-25 Thread Chris Olivier

That doesnt look like it has anything to do with omp

On Mon, Jun 24, 2019 at 6:40 PM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> I remember this hang as well, it was pretty hard to reproduce IIRC.  I
> believe the stacks for the hang are here:
> https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600
> and
> the trick was we could only debug it up to the point that we hit:
>
> #0  0x7fec6df1ba4f in futex_wait (private=0, expected=1,
> futex_word=0x7fec60843758)
> at ../sysdeps/unix/sysv/linux/futex-internal.h:61
> #1  futex_wait_simple (private=0, expected=1, futex_word=0x7fec60843758)
> at ../sysdeps/nptl/futex-internal.h:135
> #2  __pthread_once_slow (once_control=0x7fec60843758,
> init_routine=0x7fec605f38f0)
> at pthread_once.c:105
> ...
> #6  0x7fec6061c577 in cudaSetDevice () from
> /usr/local/cuda/lib64/libcudart.so.9.0
>
> because the code in libcudart is obviously closed source we couldn't dig
> into what threading work was going on when we called cudaSetDevice.
>
> On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy  >
> wrote:
>
> > If you check initialize.cc we seem to be explicitly disabling that
> > behaviour in pthread_at_fork which seems to cause thread contention
> > during multiprocessing. Why do we need this major advantage for the
> > library if that's the case?
> >
> > Related PRs:
> >
> > https://github.com/apache/incubator-mxnet/pull/10820
> > https://github.com/apache/incubator-mxnet/issues/14396
> >
> > The original code was authored in this PR:
> >
> > https://github.com/apache/incubator-mxnet/pull/8677
> >
> > I actually remember this fix, it was done during a release as the cuda
> > runtime was forking and the engine was being re-entered. If that
> > situation is now happening anymore it might not be needed any longer.
> > I don't think we know the cause why there was a fork inside cuda, so
> > the code has grown around a fix for an issue which its root cause was
> > not understood, and side effects which this fix caused afterwards.
> >
> > My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in
> > the link above, no libgomp.
> >
> > I didn't try the Make build.
> >
> > I would refactor the code linked above and stop using pthread_at_fork,
> > since OMP assumes it won't be initialized twice, but needs to be very
> > well tested to make sure it doesn't cause bugs or affect the fixes
> > done on the linked PRs above.
> >
> > Pedro.
> >
> > On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier 
> > wrote:
> > >
> > > one major advantage of intel/llvm omp is that it spawns a new thread
> pool
> > > after fork if a thread pool was already created. this is so that omp
> can
> > be
> > > used in the forked processes. libgomp doesn’t do this so it’ll just
> lock
> > up
> > > if you try to do omp in the forked process.
> > >
> > > is your build linking libgomp as well?
> > >
> > > standard mkl build (from Makefile) uses same omp library. are there
> > > problems with that build?
> > >
> > > what changes need to be made to make the assertion not fire?
> > >
> > > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy <
> > pedro.larroy.li...@gmail.com>
> > > wrote:
> > >
> > > > There's an assertion which is easily reproducible, and also there's a
> > > > crash including core dump, the latter is not easy to reproduce for me
> > > > in different environments. I have also seen mxnet getting stuck
> > > > without progressing with this build configuration and using no CPU at
> > > > all when running unit tests.
> > > >
> > > > In my view, the root cause of the assertion is that we are
> re-entering
> > > > OMP initialization when spawning threads on the following code
> through
> > > > pthread_at_fork
> > > >
> > > >
> >
> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58
> > > >
> > > > This causes double initialization of the OMP engine, including the
> > > > assertion which you are asking about,  and I suspect some additional
> > > > overhead. That's the shady forking part you are asking for.
> > > >
> > > > A question for you: What is the cause of runtime differences between
> > > > OMP runtimes? Shouldn't the implementation overhead diminish as
> > > > threads run longer?
> > > >
> > > > Pedro.
>

Re: OMP

2019-06-24 Thread Chris Olivier

one major advantage of intel/llvm omp is that it spawns a new thread pool
after fork if a thread pool was already created. this is so that omp can be
used in the forked processes. libgomp doesn’t do this so it’ll just lock up
if you try to do omp in the forked process.

is your build linking libgomp as well?

standard mkl build (from Makefile) uses same omp library. are there
problems with that build?

what changes need to be made to make the assertion not fire?

On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy 
wrote:

> There's an assertion which is easily reproducible, and also there's a
> crash including core dump, the latter is not easy to reproduce for me
> in different environments. I have also seen mxnet getting stuck
> without progressing with this build configuration and using no CPU at
> all when running unit tests.
>
> In my view, the root cause of the assertion is that we are re-entering
> OMP initialization when spawning threads on the following code through
> pthread_at_fork
>
> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58
>
> This causes double initialization of the OMP engine, including the
> assertion which you are asking about,  and I suspect some additional
> overhead. That's the shady forking part you are asking for.
>
> A question for you: What is the cause of runtime differences between
> OMP runtimes? Shouldn't the implementation overhead diminish as
> threads run longer?
>
> Pedro.
>
> On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier 
> wrote:
> >
> > What’s the reason for the assertion failure? btw classifying an assertion
> > failure a “crash” is debatable. As I stated in the original issue a long
> > time ago, it’s possible something shady is being done with when forking
> > that should be fixed.  The assertion should be root caused.
> >
> >
> >
> > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> > wrote:
> >
> > > Added a dockerfile, and reports of a crash in my local machine when
> > > running MKL+OMP+DEBUG, with Anton's branch the crash happened as well.
> > > I couldn't reproduce the crash on my EC2 machine:
> > > Added the backtrace of the crash as well.
> > >
> > > https://github.com/apache/incubator-mxnet/issues/10856
> > >
> > > Dockerfile here:
> > >
> > > https://github.com/larroy/mxnet_omp
> > >
> > > Kind regards.
> > >
> > > Pedro.
> > >
> > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu <
> marco.g.ab...@gmail.com>
> > > wrote:
> > > >
> > > > As already proposed, I think the easiest way to get a common
> > > understanding
> > > > is if we start with a few docker containers. Pedro, would it be
> possible
> > > > for you to wrap your benchmarks into a few containers that will
> produce
> > > > your shown results? That way, we can avoid possible
> misunderstandings and
> > > > also pinpoint the exact parts where people disagree or misunderstood
> each
> > > > other.
> > > >
> > > > -Marco
> > > >
> > > > Pedro Larroy  schrieb am Do., 20. Juni
> > > 2019,
> > > > 21:47:
> > > >
> > > > > I can confirm that we are linking with two versions of omp, I'm
> > > > > gaining more clarity into this topic, but I have still questions,
> the
> > > > > facts that I got so far are the folllowing:
> > > > >
> > > > > * #1: We are linking with two versions of omp, intel's omp and llvm
> > > > > openmp when building with MKL enabled.
> > > > > * #2: We have 3 different possible OMP versions: Intel OMP (comes
> with
> > > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This
> > > > > one is used on the PR proposed by Anton).
> > > > >
> > > > > Questions:
> > > > >
> > > > >  * #1 Is it ok to have two versions of openmp linked at the same
> time?
> > > > >  * #2 Which implementation of OMP gives the best performance?  (See
> > > > > total training time of my measurement for a partial answer)
> > > > >  * #3 Should we have a build flag so we can choose the OMP version
> at
> > > > > runtime?
> > > > >  * #4 Which Compiler and build flags did Chris use to get 10x
> slowdown?
> > > > >  * #5 @Stas: is there a script to replicate your benchmarks
> easily? If
> > > > > so could you provide a link?  I th

Re: OMP

2019-06-24 Thread Chris Olivier

 > >  accuracy=0.999687
> > > INFO:root:Epoch[19] Batch [300-400] Speed: 51518.77 samples/sec
> > >  accuracy=0.999844
> > > INFO:root:Epoch[19] Batch [400-500] Speed: 51551.62 samples/sec
> > >  accuracy=0.999844
> > > INFO:root:Epoch[19] Batch [500-600] Speed: 49026.35 samples/sec
> > >  accuracy=0.999844
> > > INFO:root:Epoch[19] Batch [600-700] Speed: 49002.46 samples/sec
> > >  accuracy=0.999375
> > > INFO:root:Epoch[19] Batch [700-800] Speed: 48980.55 samples/sec
> > >  accuracy=0.999687
> > > INFO:root:Epoch[19] Batch [800-900] Speed: 47402.56 samples/sec
> > >  accuracy=0.999844
> > > INFO:root:Epoch[19] Train-accuracy=0.999767
> > > INFO:root:Epoch[19] Time cost=1.259
> > > INFO:root:Epoch[19] Validation-accuracy=0.983181
> > > 755.36user 754.94system 0:35.89elapsed 4207%CPU (0avgtext+0avgdata
> > > 1147008maxresident)k
> > > 0inputs+3112outputs (0major+3568826minor)pagefaults 0swaps
> > >
> > > Let me know what you think.
> > >
> > > Link to the original PR:
> > > https://github.com/apache/incubator-mxnet/pull/12160
> > >
> > > Thanks.
> > >
> > > On Wed, Jun 19, 2019 at 5:35 PM kellen sunderland
> > >  wrote:
> > > >
> > > > "if you’re linking in two then you’re doing something wrong."
> Correct,
> > > > that's one thing I believe we've got consensus on.  So let's call
> that
> > > out
> > > > as a bug to be fixed.
> > > >
> > > > Let's move forward with some reproducible numbers and then discuss
> the
> > > pros
> > > > / cons of which particular OMP implementation we should use.
> > > >
> > > > On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy <
> > > pedro.larroy.li...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Chris
> > > > >
> > > > > I would ask you to have a bit of patience and help us with your
> > > > > experience in this matter. Nobody is ignoring anything, I think we
> are
> > > > > individually gathering feedbacks and trying to understand the
> multiple
> > > > > contributions done to this topic including yours, then go step by
> > > > > step, understand what is going on and run experiments and report
> back
> > > > > to the list or the corresponding github item. It was suggested by
> > > > > Kellen to prepare some containers, this takes effort.
> > > > >
> > > > > Regarding your final comment, most of us also have many other
> things
> > > > > to do and responsibilities even if our daytime jobs might involve
> > > > > MXNet in some form or another. I think that's part of the privilege
> > > > > and responsibility of working close with an open source project and
> > > > > the magic of collaboration across organizations. Let's all be
> patient
> > > > > and take some time to understand and reason about this topic which
> is
> > > > > not simple. Since we decided to step back and gather more data
> let's
> > > > > take time and do it properly.
> > > > >
> > > > > Personally I hope to find time to look again into this issue before
> > > > > the end of the week.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > Pedro.
> > > > >
> > > > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier <
> cjolivie...@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > if you’re linking in two then you’re doing something wrong. You
> can
> > > see
> > > > > by
> > > > > > my email yesterday that only one is linked in. This is also the
> case
> > > with
> > > > > > the mkl version built by the Makefile — only the Intel OMP
> library is
> > > > > used
> > > > > > (no libgomp).
> > > > > >
> > > > > > That being said, Do you have clear evidence that using Intel OMP
> is
> > > both
> > > > > > problematic and the situation isn’t fixable?  The burden of
> proof is
> > > on
> > > > > the
> > > > > > ones requesting the change — it is not my responsibility to
> justify
> > > the
> > > > > > current state.  There must be somethi

Re: OMP

2019-06-19 Thread Chris Olivier

if you’re linking in two then you’re doing something wrong. You can see by
my email yesterday that only one is linked in. This is also the case with
the mkl version built by the Makefile — only the Intel OMP library is used
(no libgomp).

That being said, Do you have clear evidence that using Intel OMP is both
problematic and the situation isn’t fixable?  The burden of proof is on the
ones requesting the change — it is not my responsibility to justify the
current state.  There must be something “terrible” and unfixable to justify
a change.  I have seen no proof of this in all this time.

On a side note, I mentioned a couple of things in my email yesterday that
still are not being responded to (they were also ignored in the last
incarnation of this “discussion” — I have much experience in this matter to
assume “discussion” is a waste of my time, seeing and I am not paid to
“work on” mxnet like y’all are).

-C






On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> I've also quite often seen two versions of OpenMP linked.  I think we can
> all agree we probably want to avoid linking in two libraries that do
> effectively the same thing.
>
> The performance questions should be fairly straight forward to demonstrate
> right?  Could we just collaborate on a few minimal Dockerfiles that show
> (or don't show) Intel OpenMP performance speedups with the workloads Chris
> is referencing?
>
> On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> stanislav.tsuk...@gmail.com> wrote:
>
> > Hi, Chris!
> >
> > Stas here - I've gathered that performance data.
> > Sure thing, I can be wrong, but please elaborate a bit on what we are
> > missing.
> > Be assured, intentional misdirection was never a case.
> >
> > Thanks a lot for being constructive.
> >
> > > Turning Intel OMP on and off (and MKL as well, since it tends to pull
> in
> > omp, depending which one is linked in).
> >
> > We never ever considered turning MKL off. We are on the same page here -
> > MKL is crucial for the performance.
> > Why should we? There's a GOMP-linked version of MKL, that we can use.
> >
> > What we did - we measured, if using compilers default OpenMP
> > implementation instead of referenced source code distribution of OpenMP
> > makes anything slower.
> > We have found the impact to be hardly measurable.
> > The difference between GOMP and iOMP is <5% on our benchmarks, most of
> the
> > time less than that.
> >
> > We just suggest to simplify the build of mxnet, by removing the
> > unnecessary dependency.
> >
> > During that we discovered for example the following amazing issue:
> > https://github.com/apache/incubator-mxnet/issues/14087
> >
> > Best Regards
> >
> > Stas
> >
> > On 18.06.19, 18:24, "Chris Olivier"  wrote:
> >
> > I am very reluctant to feed the trolls again, and this will be teh
> last
> > time I address Pedro or Anton on the subject, but since I think the
> > numbers
> > being presented are incorrect (either by te builders not really
> > understanding what they are building, or possibly intentional
> > misdirection):
> >
> > Turning Intel OMP on and off (and MKL as well, since it tends to pull
> > in
> > omp, depending which one is linked in).
> > There is a HUGE difference.  This is consistent with my experience
> > before
> > when it was added.
> >
> >
> > default mnist:
> >
> > python ../example/image-classification/train_mnist.py
> > INFO:root:start with arguments Namespace(add_stn=False,
> batch_size=64,
> > disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
> > gpus=None, image_shape='1, 28, 28', initializer='default',
> > kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
> > lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
> > monitor=0, network='mlp', num_classes=10, num_epochs=20,
> > num_examples=6, num_layers=None, optimizer='sgd',
> > profile_server_suffix='', profile_worker_suffix='', save_period=1,
> > test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear',
> > wd=0.0001)
> >
> > INTEL OMP:
> >
> > ldd libmxnet.so | grep omp
> > libomp.so =>
> > /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > (0x7f978fde7000)
> >
> > :root:Epoch[0] Batch [0-100]Speed: 31548.09 samples/sec
> > accuracy=0.780012
> > INFO:root:Epoch[0] Batch [100-20

OMP

2019-06-18 Thread Chris Olivier

I am very reluctant to feed the trolls again, and this will be teh last
time I address Pedro or Anton on the subject, but since I think the numbers
being presented are incorrect (either by te builders not really
understanding what they are building, or possibly intentional misdirection):

Turning Intel OMP on and off (and MKL as well, since it tends to pull in
omp, depending which one is linked in).
There is a HUGE difference.  This is consistent with my experience before
when it was added.


default mnist:

python ../example/image-classification/train_mnist.py
INFO:root:start with arguments Namespace(add_stn=False, batch_size=64,
disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
gpus=None, image_shape='1, 28, 28', initializer='default',
kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
monitor=0, network='mlp', num_classes=10, num_epochs=20,
num_examples=6, num_layers=None, optimizer='sgd',
profile_server_suffix='', profile_worker_suffix='', save_period=1,
test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)

INTEL OMP:

ldd libmxnet.so | grep omp
libomp.so =>
/home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
(0x7f978fde7000)

:root:Epoch[0] Batch [0-100]Speed: 31548.09 samples/sec
accuracy=0.780012
INFO:root:Epoch[0] Batch [100-200]  Speed: 16073.21 samples/sec
accuracy=0.920469
INFO:root:Epoch[0] Batch [200-300]  Speed: 19075.91 samples/sec
accuracy=0.928281
INFO:root:Epoch[0] Batch [300-400]  Speed: 23211.36 samples/sec
accuracy=0.942813
INFO:root:Epoch[0] Batch [400-500]  Speed: 22139.79 samples/sec
accuracy=0.938750
INFO:root:Epoch[0] Batch [500-600]  Speed: 23225.52 samples/sec
accuracy=0.946562
INFO:root:Epoch[0] Batch [600-700]  Speed: 19547.41 samples/sec
accuracy=0.953281
INFO:root:Epoch[0] Batch [700-800]  Speed: 24111.73 samples/sec
accuracy=0.951562
INFO:root:Epoch[0] Batch [800-900]  Speed: 13959.88 samples/sec
accuracy=0.957500
INFO:root:Epoch[0] Train-accuracy=0.925423
INFO:root:Epoch[0] Time cost=3.806
INFO:root:Epoch[0] Validation-accuracy=0.962580
INFO:root:Epoch[1] Batch [0-100]Speed: 24560.21 samples/sec
accuracy=0.968131
INFO:root:Epoch[1] Batch [100-200]  Speed: 23457.03 samples/sec
accuracy=0.966250


LIBGOMP:

ldd libmxnet.so | grep omp
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
(0x7f25c25dd000)

INFO:root:Epoch[0] Batch [0-100]Speed: 1731.01 samples/sec
 accuracy=0.782488
INFO:root:Epoch[0] Batch [100-200]  Speed: 3551.32 samples/sec
 accuracy=0.907813
INFO:root:Epoch[0] Batch [200-300]  Speed: 1991.00 samples/sec
 accuracy=0.927188
INFO:root:Epoch[0] Batch [300-400]  Speed: 2175.45 samples/sec
 accuracy=0.937969
INFO:root:Epoch[0] Batch [400-500]  Speed: 1644.95 samples/sec
 accuracy=0.942187
INFO:root:Epoch[0] Batch [500-600]  Speed: 6444.58 samples/sec
 accuracy=0.950156
INFO:root:Epoch[0] Batch [600-700]  Speed: 7842.16 samples/sec
 accuracy=0.947969
INFO:root:Epoch[0] Batch [700-800]  Speed: 9412.07 samples/sec
 accuracy=0.953750
INFO:root:Epoch[0] Batch [800-900]  Speed: 12707.58 samples/sec
accuracy=0.953125

That being said, there's other issued beyond speed.  The DEFAULT build from
makefile (not CMake) uses Intel OMP mkl (I showed before) and mysteriously
it has no issues?  This seems highly suspicious.  All I see is a lot of
hand-waving and conjecture and pointing to StackOverflow posts made by
people who may be of questionable pedigree to begin with.  This smells of a
Pedro-ego-fight rather than one of purely technical merit.  Also, if one
knows how OMP works,  they would be very suspicious of the "intermittent
hangs" claim -- that's probably just broken race conditions elsewhere until
proven differently.  It'd tend freeze on the first use if something is
wrong (try using libgomp after a fork and see), since worker threads"
wouldn't be assigned/joined properly.  IntelOMP is faster, but also has
other advantages, such as allowing OMP after a fork.

I actually addressed a lot of issues and ask for clarification in the
original PR's way back when, but they're all just ignored.

-Chris

Re: [VOTE] Remove conflicting OpenMP from CMake builds

2019-06-17 Thread Chris Olivier

I am curious why you're being so militant troll about this.  libomp is used
in every MKL build (download mxnet-mkl yourself and see).  I don't see any
convincing reason to change it and so far as I can tell, no real issue has
been proven to be related.  Anyway, I am reluctant to feed trolls any more
than this, so I don't really have much else to add.

ldd /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so
linux-vdso.so.1 (0x7ffc989cf000)
libmklml_intel.so =>
/usr/local/lib/python3.6/dist-packages/mxnet/libmklml_intel.so
(0x7f0afb7c1000)
   * libiomp5.so =>
/usr/local/lib/python3.6/dist-packages/mxnet/libiomp5.so
(0x7f0afb3e5000)*
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x7f0afb1dd000)
libmkldnn.so.0 =>
/usr/local/lib/python3.6/dist-packages/mxnet/libmkldnn.so.0
(0x7f0afa7ba000)
libgfortran.so.3 =>
/usr/local/lib/python3.6/dist-packages/mxnet/libgfortran.so.3
(0x7f0afa493000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x7f0afa28f000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6
(0x7f0af9f06000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f0af9b68000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1
(0x7f0af995)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
(0x7f0af9731000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f0af934)
/lib64/ld-linux-x86-64.so.2 (0x7f0b073f4000)
libquadmath.so.0 =>
/usr/local/lib/python3.6/dist-packages/mxnet/libquadmath.so.0
(0x7f0af910)


On Mon, Jun 17, 2019 at 10:58 AM Pedro Larroy 
wrote:

> I had read the "Apache Voting Process" guide here:
> https://www.apache.org/foundation/voting.html  and I thought code
> changes could be discussed on the mailing list in cases where the PR
> is stuck or there's no response for a long time, also I understood
> that -1's have to be justified.  Could you, or someone more familiar
> which the Apache way enlighten on how to move this issue forward in a
> constructive way?
>
> Thanks a lot.
>
> Pedro.
>
> On Mon, Jun 17, 2019 at 10:46 AM Pedro Larroy
>  wrote:
> >
> > Thanks.
> >
> > How do we go on advancing this PR then? all the questions have been
> > answered, performance numbers provided and more. Until how long can a
> > veto stand? Also without replies to contributors.
> >
> > Pedro.
> >
> > On Fri, Jun 14, 2019 at 5:44 PM Sheng Zha  wrote:
> > >
> > > This vote is invalid as the original PR has been vetoed by a
> committer. A vote on dev@ won't help you circumvent a veto.
> > >
> > > -sz
> > >
> > > On 2019/06/14 23:59:33, Pedro Larroy 
> wrote:
> > > > Hi all
> > > >
> > > > This is a 5-day vote to act and wrap up an outstanding PR that
> removes
> > > > linkage with multiple OpenMP from 3rdparty and uses the system
> > > > provided one which might resolve a number of difficult to debug
> issues
> > > > and possible undefined behaviour.
> > > >
> > > > https://github.com/apache/incubator-mxnet/pull/12160
> > > >
> > > > See the comments in the thread for more details but in short, linking
> > > > with multiple openmp versions seems to lead to undefined behaviour,
> > > > plus it would simplify not having to deal with a custom openmp
> version
> > > > and rely on the platform provided one.
> > > >
> > > > This is expected to simplify builds and address a number of problems.
> > > > Seems it doesn't cause any performance degradation, (the Gluon tests
> > > > run almost 4x faster in my 64 core machine).
> > > >
> > > > There has been in depth study of performance implications by
> > > > contributors like Stanislav Tsukrov and Anton Chernov.  All the
> > > > concerns and comments from the reviewers have been addressed and we
> > > > can't keep asking open ended questions to block PRs. Reviewers are
> > > > expected to be proactive and responsive to contributors so we keep
> > > > encouraging active contributors.
> > > >
> > > > please vote to merge this PR accordingly:
> > > >
> > > > +1 = approve
> > > > +0 = no opinion
> > > > -1 = disapprove (provide reason)
> > > >
> > > > If we observe regressions reported by any internal performance
> systems
> > > > or by contributors the PR can be reverted easily. So it's not a one
> > > > way door. But it will be useful to try this in master for a while.
> > > >
> > > > Thank you.
> > > >
> > > > Pedro.
> > > >
>

Re: DGL crashes in the recent master branch

2019-05-21 Thread Chris Olivier

Might be helpful if you wrote a unit test for this and other behaviors that
DGL depends upon to reduce the likelihood that it happens again.  Just a
suggestion.  That would show good ownership, imho.

On Tue, May 21, 2019 at 6:11 PM Chris Olivier  wrote:

> Thanks for clarifying, Da.
>
> On Tue, May 21, 2019 at 5:44 PM Zheng, Da 
> wrote:
>
>> DGL is a framework of deep learning on graphs. https://www.dgl.ai/
>>
>> It's not that MXNet is responsible to be compatible with DGL. The crashes
>> are caused by bugs in MXNet.
>>
>> Best,
>> Da
>>
>> On 5/21/19, 5:39 PM, "Chris Olivier"  wrote:
>>
>> Curious what is DGL and what is Apache/MXNet’s responsibility to it to
>> maintain compatibility rather than the other way around?
>>
>> On Tue, May 21, 2019 at 3:39 PM Zheng, Da 
>> wrote:
>>
>> > Hello all,
>> >
>> > I recently find that DGL don’t run with the recent MXNet. DGL
>> crashes with
>> > memory errors.
>> > Yesterday we have identified a bug in DLPack and Junru has
>> implemented a
>> > fix: https://github.com/apache/incubator-mxnet/pull/15016
>> > However, there are some other bugs that causes DGL to crash with a
>> memory
>> > error. I’m still searching among the PRs to identify the one that
>> causes
>> > the issue. I think we should make sure that MXNet 1.5 release works
>> with
>> > DGL correctly.
>> >
>> > Best,
>> > Da
>> >
>>
>>
>>

Re: DGL crashes in the recent master branch

2019-05-21 Thread Chris Olivier

Thanks for clarifying, Da.

On Tue, May 21, 2019 at 5:44 PM Zheng, Da  wrote:

> DGL is a framework of deep learning on graphs. https://www.dgl.ai/
>
> It's not that MXNet is responsible to be compatible with DGL. The crashes
> are caused by bugs in MXNet.
>
> Best,
> Da
>
> On 5/21/19, 5:39 PM, "Chris Olivier"  wrote:
>
> Curious what is DGL and what is Apache/MXNet’s responsibility to it to
> maintain compatibility rather than the other way around?
>
> On Tue, May 21, 2019 at 3:39 PM Zheng, Da 
> wrote:
>
> > Hello all,
> >
> > I recently find that DGL don’t run with the recent MXNet. DGL
> crashes with
> > memory errors.
> > Yesterday we have identified a bug in DLPack and Junru has
> implemented a
> > fix: https://github.com/apache/incubator-mxnet/pull/15016
> > However, there are some other bugs that causes DGL to crash with a
> memory
> > error. I’m still searching among the PRs to identify the one that
> causes
> > the issue. I think we should make sure that MXNet 1.5 release works
> with
> > DGL correctly.
> >
> > Best,
> > Da
> >
>
>
>

Re: DGL crashes in the recent master branch

2019-05-21 Thread Chris Olivier

Curious what is DGL and what is Apache/MXNet’s responsibility to it to
maintain compatibility rather than the other way around?

On Tue, May 21, 2019 at 3:39 PM Zheng, Da  wrote:

> Hello all,
>
> I recently find that DGL don’t run with the recent MXNet. DGL crashes with
> memory errors.
> Yesterday we have identified a bug in DLPack and Junru has implemented a
> fix: https://github.com/apache/incubator-mxnet/pull/15016
> However, there are some other bugs that causes DGL to crash with a memory
> error. I’m still searching among the PRs to identify the one that causes
> the issue. I think we should make sure that MXNet 1.5 release works with
> DGL correctly.
>
> Best,
> Da
>

Re: Running MxNet compiled with Cuda 9 on a machine with Cuda 10

2019-02-06 Thread Chris Olivier

I don’t recall any special reason that libcudart and the rest of the cuda
libs (other than the driver lib libcuda.so.1) were not linked statically.
Is there any reason they are not?

On Wed, Feb 6, 2019 at 3:03 PM Timur Shenkao  wrote:

> From my experience with MxNet (and TF), nothing good comes out of such
> attempts.
> It will look for several CUDA 9 specific dependencies.
> I ended having several CUDA installations (9.2, 10, etc.) on one server.
>
> On Wed, Feb 6, 2019 at 8:58 PM Seok Hyun Jin  wrote:
>
> > Hi,
> >
> > I’m trying to run MxNet compiled with Cuda 9 on a machine with only Cuda
> > 10.
> > For other Cuda applications, this seems to be possible as long as I
> specify
> > the correct target GPU generation and statically link the runtime into
> the
> > application when I run ‘nvcc’ (at the cost of having a much fatter
> binary).
> >
> > But does this hold true for MxNet as well? Is there anything about MxNet
> > that makes it not runnable on a machine with a more recent Cuda version?
> >
> > I'm using containerized build for Jetson to build the libmxnet.so myself.
> >
> > I tried running MxNet with Cuda 9 on Cuda 10 machine but it keeps looking
> > for libcudart.so.9.0 .
> >
> > Thanks!
> >
> > Best,
> >
> > Jin
> >
>

[Announcement] New Committer -- Steffen Rochel

2019-02-04 Thread Chris Olivier

Dear Community:

Please join me to welcome Steffen Rochel (steffenroc...@gmail.com) as a new
committer of Apache (incubating) MXNet!

Steffen has played a role in nearly every MXNet release in the past 18
months, managed several of the wiki pages and has contributed in expanding
the community by managing and hosting meetups in different parts of the
world.

-Chris

Re: Order of includes in cpplint

2019-01-08 Thread Chris Olivier

opinions vary.  there’s arguments the other way, too, such as an ideal
precompiled header compiler would work best with invariant headers first as
well as template matching order in some cases

On Tue, Jan 8, 2019 at 2:44 PM Pedro Larroy 
wrote:

> Hi MXNet community
>
> cpplint seems to complain when the order of includes is not  [own,
> system, other]
>
> But the general best practice in C++ is [own, project, 3rd party,
> system] for the reasons explained in this stackoverflow answer:  (
> https://stackoverflow.com/questions/614302/c-header-order )
>
> A contribution to cpplint could be made to make this configurable:
>
> https://github.com/cpplint/cpplint/blob/master/cpplint.py#L605
>
> Thoughts?
>
> Pedro.
>

Re: Julia Package Release Process

2019-01-07 Thread Chris Olivier

Do you not have write permissions on mxnet repo?

On Mon, Jan 7, 2019 at 6:13 AM iblis  wrote:

> just found that I don't have the permission to transfer issues of
> dmlc/MXNet.jl.
> Could anyone help me on this?
>
> On 1/7/19 12:16 PM, iblis wrote:
> > okay.
> > Before disabling the issue tracker, I'm going to transfer the issue
> > from MXNet.jl to main repo.
> > (via
> https://help.github.com/articles/transferring-an-issue-to-another-repository/
> )
> >
> > On 1/7/19 12:17 AM, Chris Olivier wrote:
> >> +1 for disabling issue tracker and putting a note on original repo (if
> it
> >> isn’t already there) that work/issue tracking has moved to mxnet (using
> >> julia label in github or Jira). m
> >>
> >>
> >> On Sun, Jan 6, 2019 at 1:19 AM iblis  wrote:
> >>
> >>> Before PR #10149 got merged (Oct 5, 2018) into main repo,
> >>> julia code is developed and maintained in the separate repo --
> >>> dmlc/MXNet.jl.
> >>>
> >>> After that PR, there are no further development happened in
> dmlc/MXNet.jl.
> >>> We work with the main repo now.
> >>> But the original MXNet.jl repo is still there, it just isn't deleted or
> >>> archived yet.
> >>> I receive some issue ticks from this repo occasionally,
> >>> maybe we should just disable the issue tracker of it.
> >>>
> >>>> Does it work with other frameworks other than mxnet?
> >>> hmm, I'm not sure what you mean.
> >>>
> >>> On 1/6/19 1:18 PM, Chris Olivier wrote:
> >>>> Curious:  Why is the julia code maintained in a separate repo? I was
> >>> under
> >>>> the impression that it was donated/permanently merged into the mxnet
> >>> source
> >>>> tree.  Does it work with other frameworks other than mxnet?
> >>>>
> >>>> On Sat, Jan 5, 2019 at 8:32 PM Iblis Lin 
> wrote:
> >>>>
> >>>>> If there is trademark issue, how about this option:
> >>>>>  3) transferring the MXNet.jl repo ownership from DMLC to Apache.
> >>>>>
> >>>>> On 1/6/19 6:45 AM, Justin Mclean wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>>> 1) Reuse the old repo: https://github.com/dmlc/MXNet.jl
> >>>>>>>It's under DMLC. I have the committer bit of this repo.
> >>>>>>
> >>>>>> I'm not 100% sure that would be allowed from a branding/trademark
> >>>>> perspective, any distribution owned by a 3rd party can't be called
> >>> "Apache
> >>>>> MXNet".
> >>>>>>
> >>>>>>> 2) A new repo under ASF's organization?
> >>>>>>
> >>>>>> That seems preferable I think.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Justin
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Iblis Lin
> >>>>> 林峻頤
> >>>>>
> >>>>
> >>>
> >>> --
> >>> Iblis Lin
> >>> 林峻頤
> >>>
> >>
> >
>
> --
> Iblis Lin
> 林峻頤
>

Re: Julia Package Release Process

2019-01-06 Thread Chris Olivier

+1 for disabling issue tracker and putting a note on original repo (if it
isn’t already there) that work/issue tracking has moved to mxnet (using
julia label in github or Jira). m


On Sun, Jan 6, 2019 at 1:19 AM iblis  wrote:

> Before PR #10149 got merged (Oct 5, 2018) into main repo,
> julia code is developed and maintained in the separate repo --
> dmlc/MXNet.jl.
>
> After that PR, there are no further development happened in dmlc/MXNet.jl.
> We work with the main repo now.
> But the original MXNet.jl repo is still there, it just isn't deleted or
> archived yet.
> I receive some issue ticks from this repo occasionally,
> maybe we should just disable the issue tracker of it.
>
> > Does it work with other frameworks other than mxnet?
> hmm, I'm not sure what you mean.
>
> On 1/6/19 1:18 PM, Chris Olivier wrote:
> > Curious:  Why is the julia code maintained in a separate repo? I was
> under
> > the impression that it was donated/permanently merged into the mxnet
> source
> > tree.  Does it work with other frameworks other than mxnet?
> >
> > On Sat, Jan 5, 2019 at 8:32 PM Iblis Lin  wrote:
> >
> >> If there is trademark issue, how about this option:
> >> 3) transferring the MXNet.jl repo ownership from DMLC to Apache.
> >>
> >> On 1/6/19 6:45 AM, Justin Mclean wrote:
> >>> Hi,
> >>>
> >>>>1) Reuse the old repo: https://github.com/dmlc/MXNet.jl
> >>>>   It's under DMLC. I have the committer bit of this repo.
> >>>
> >>> I'm not 100% sure that would be allowed from a branding/trademark
> >> perspective, any distribution owned by a 3rd party can't be called
> "Apache
> >> MXNet".
> >>>
> >>>>2) A new repo under ASF's organization?
> >>>
> >>> That seems preferable I think.
> >>>
> >>> Thanks,
> >>> Justin
> >>>
> >>
> >> --
> >> Iblis Lin
> >> 林峻頤
> >>
> >
>
> --
> Iblis Lin
> 林峻頤
>

Re: Julia Package Release Process

2019-01-05 Thread Chris Olivier

Curious:  Why is the julia code maintained in a separate repo? I was under
the impression that it was donated/permanently merged into the mxnet source
tree.  Does it work with other frameworks other than mxnet?

On Sat, Jan 5, 2019 at 8:32 PM Iblis Lin  wrote:

> If there is trademark issue, how about this option:
>3) transferring the MXNet.jl repo ownership from DMLC to Apache.
>
> On 1/6/19 6:45 AM, Justin Mclean wrote:
> > Hi,
> >
> >>   1) Reuse the old repo: https://github.com/dmlc/MXNet.jl
> >>  It's under DMLC. I have the committer bit of this repo.
> >
> > I'm not 100% sure that would be allowed from a branding/trademark
> perspective, any distribution owned by a 3rd party can't be called "Apache
> MXNet".
> >
> >>   2) A new repo under ASF's organization?
> >
> > That seems preferable I think.
> >
> > Thanks,
> > Justin
> >
>
> --
> Iblis Lin
> 林峻頤
>

Re: Cambricon MLU support for MXNet.

2018-12-16 Thread Chris Olivier

small point: mshadow is being deprecated. probably you shouldn’t invest too
much time on it. just an FYI

On Sun, Dec 16, 2018 at 6:33 PM 张昊翀  wrote:

> Dear MXNet community,
>
> We are from Cambricon, a leading supplier of artificial intelligence
> chips. We have two product lines, including IP products (e.g., Cambricon
> 1A/1H) and chip products (e.g., MLU100 released in May 2018)
>
> We are now adapting MXNet on Cambricon products. During the follow-up
> session, we plan to open source, and hope to merge these new features into
> the master branch of MXNet and to be a part of MXNet's long-term support.
> We firmly believe that these MLU features will promote the MXNet community
> development.
> To this end, we are ready to accept the rigorous inspection of MXNet
> community. In addition, we need advice from the community to achieve high
> quality implementation. On this basis, we very much hope to reach a
> full-scale long-term cooperation with the community.
>
> In order to achieve the above goals, we hope to keep in touch with the
> community on some issues. Looking forward to your valuable feedback.
>
> 1. MLU100 mainly focuses on inference, and we plan to first support the
> inference part of MXNet. The training part of MXNet on MLU will be released
> in the future. Is that acceptable for MXNet community?
>
> 2. Though MLU can support various operators/networks, to guarantee high
> quality, all supported operators submitted to the community should undergo
> rigorous stress test. Thus, at the beginning, we plan to release a small
> number of supported operators and networks, and more of them will be
> continuously added. Is that acceptable or do we have to support all
> networks in the ModelZoo in the first release?
>
> 3. Currently we plan to support both Python and C++ APIs. More details on
> supported APIs will be provided in a follow-up proposal.
>
> 4. We need to modify the mShadow in order to support tensor memory
> operations.
>
> 5. In order to enable the community to run and fully test our code, we
> want to provide the community with a complete test environment. At present,
> we are considering the following three ways.
> A) Provides several remote servers for community and integrates with the
> community's Jenkins.
> B) Provide a cloud platform to the community.
> C) Donate MLU100 to the community's testing platform. However, we don’t
> know the specific ways of donation, and we hope to get help. We are
> wondering about how MXNet's test servers are managed.
>
> About more technical details, a proposal will be submitted to the
> community before releasing the code.
>
> In addition to the above points, the remaining questions and suggestions
> are also welcome. Thanks!
>
> More about Cambricon:
> Cambricon is the artificial intelligence computing pioneer that engineers
> and successfully commercializes world’s first dedicated machine learning
> processor. To bring its unique AI processors from edge to cloud, enriching
> and advancing human life, is the firm mission of the company. Dr. Tianshi
> Chen is the founder and CEO of Cambricon, where he brings over 10 years
> experience in the fields of micro-processor architecture and artificial
> intelligence.
> In 2016, Cambricon released Cambricon 1A processor, the first commercial
> machine learning specific processor in the world. Later, during the 3rd
> World Internet Conference, Cambricon 1A processor was elected as one of
> “World Leading Internet Scientific and Technological Achievements“. In May
> 2018, Cambricon released MLU100, a machine learning chip which is in mass
> production now. By offering revolutionary technology and products,
> Cambricon has established and remains active relationships with various
> companies in the AI industry.
>
>
> Regards,
> Haochong Zhang
> Cambricon MXNet Development Team
>
>
>

Re: Scala standard library is included in the mxnet jar

2018-12-04 Thread Chris Olivier

Pedro,

It would be polite to ask if there is a reason it is included before
categorically declaring it is wrong.

I am not involved in the scala library and what's included in it, but maybe
there's a good reason for it. Or maybe there isn't.  Either way, it's best
to ask first :)

Thanks,

-Chris

Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Chris Olivier

I don’t think that does anything at all, as stated in my other email.
Someone can look into the omp code to be sure but my suspicion is that the
environment variable is only read on startup, and at any rate, better to be
set through the api at runtime

On Thu, Nov 29, 2018 at 8:11 AM Pedro Larroy 
wrote:

> To be precise, what would be the consequences of not having these env
> variables set in the engine threads related to OMP?
> Given your experience with OpenMP I hope you can help us answer these
> questions.
>
> Hopefully we can get the same effect (if any) of these setenvs using
> some openmp call or a pragma. Definitely we shouldn't be mutating the
> environment from a different thread from what I understand, which is
> the likely cause of the random crashes some users are experiencing.
>
> Pedro
> On Thu, Nov 29, 2018 at 5:00 PM Pedro Larroy
>  wrote:
> >
> > Chris.  The problem is with setenv, not with getenv. We don't want to
> > remove any getenv call, just these misplaced setenvs:
> >
> >
> >
> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L61
> >
> > Please check the code above carefully and give us your feedback. Based
> > on your email I think we don't yet have a common understanding of the
> > root cause of this issue.
> >
> > Pedro.
> > On Thu, Nov 29, 2018 at 4:02 PM Chris Olivier 
> wrote:
> > >
> > > - getenv should be thread safe as long as nothing is calling
> putenv/setenv
> > > in another thread (the environment doesn’t change) as stated here:
> > >
> > > http://www.cplusplus.com/reference/cstdlib/getenv/
> > >
> > > it’s a simple library call, so to be sure either way, one can check the
> > > actual source and see (in case some particular implementation is
> acting in
> > > a particularly thread-unsafe manner). This should be vetted before
> making
> > > any high-impact decisions such as trying to go remove every getenv
> call in
> > > the whole system.
> > >
> > > - locking after fork is possibly due to libgomp not supporting forking
> such
> > > that after a fork, a call is made to release the blocked omp threads
> and
> > > the main thread waits for the omp threads to finish, but the omp
> threads
> > > belong to the pre-forked process and thus never execute, causing that
> > > forked process to freeze.  This behavior has been witnessed before.
> > >
> > >
> > >
> > >
> > > On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> > > wrote:
> > >
> > > > Hi all.
> > > >
> > > > There are two important issues / fixes that should go in the next
> > > > release in my radar:
> > > >
> > > > 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> > > > There is a bug in shape inference on CPU when not using MKL, also we
> > > > are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> > > > I'm finishing a fix for these issues in the above PR.
> > > >
> > > > 2) https://github.com/apache/incubator-mxnet/issues/13438
> > > > We are seeing crashes due to unsafe setenv in multithreaded code.
> > > > Setenv / getenv from multiple threads is not safe and is causing
> > > > segfaults. This piece of code (the handlers in pthread_atfork)
> already
> > > > caused a very difficult to diagnose hang in a previous release, where
> > > > a fork inside cudnn would deadlock the engine.
> > > >
> > > > I would remove setenv from 2) as a mitigation, but we would need to
> > > > check for regressions as we could be creating additional threads
> > > > inside the engine.
> > > >
> > > > I would suggest that we address these two major issues before the
> next
> > > > release.
> > > >
> > > > Pedro
> > > >
> > > >
> > > >
> > > > On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel <
> steffenroc...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Dear MXNet community,
> > > > >
> > > > > I will be the release manager for the upcoming Apache MXNet 1.4.0
> > > > release.
> > > > > Sergey Kolychev will be co-managing the release and providing help
> from
> > > > the
> > > > > committers side.
> > > > > A release candidate will be cut on November 29, 2018 and voting
> will
>

Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Chris Olivier

By the way, have you traced a problem to these calls?

I am a bit skeptical that this is problematic here for the following reason:

At the time of arfork(), the new process doesn’t have any other threads to
speak of that are calling getenv(). Any globals from the last process are
owned by that process and copy-on-write in the new process. This would mean
that the getenv() in the old process wouldn’t be affected by putenv() in
the newly forked process and like I said, at this time, the newly forked
process tends to be single-threaded.



On Thu, Nov 29, 2018 at 8:11 AM Pedro Larroy 
wrote:

> To be precise, what would be the consequences of not having these env
> variables set in the engine threads related to OMP?
> Given your experience with OpenMP I hope you can help us answer these
> questions.
>
> Hopefully we can get the same effect (if any) of these setenvs using
> some openmp call or a pragma. Definitely we shouldn't be mutating the
> environment from a different thread from what I understand, which is
> the likely cause of the random crashes some users are experiencing.
>
> Pedro
> On Thu, Nov 29, 2018 at 5:00 PM Pedro Larroy
>  wrote:
> >
> > Chris.  The problem is with setenv, not with getenv. We don't want to
> > remove any getenv call, just these misplaced setenvs:
> >
> >
> >
> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L61
> >
> > Please check the code above carefully and give us your feedback. Based
> > on your email I think we don't yet have a common understanding of the
> > root cause of this issue.
> >
> > Pedro.
> > On Thu, Nov 29, 2018 at 4:02 PM Chris Olivier 
> wrote:
> > >
> > > - getenv should be thread safe as long as nothing is calling
> putenv/setenv
> > > in another thread (the environment doesn’t change) as stated here:
> > >
> > > http://www.cplusplus.com/reference/cstdlib/getenv/
> > >
> > > it’s a simple library call, so to be sure either way, one can check the
> > > actual source and see (in case some particular implementation is
> acting in
> > > a particularly thread-unsafe manner). This should be vetted before
> making
> > > any high-impact decisions such as trying to go remove every getenv
> call in
> > > the whole system.
> > >
> > > - locking after fork is possibly due to libgomp not supporting forking
> such
> > > that after a fork, a call is made to release the blocked omp threads
> and
> > > the main thread waits for the omp threads to finish, but the omp
> threads
> > > belong to the pre-forked process and thus never execute, causing that
> > > forked process to freeze.  This behavior has been witnessed before.
> > >
> > >
> > >
> > >
> > > On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> > > wrote:
> > >
> > > > Hi all.
> > > >
> > > > There are two important issues / fixes that should go in the next
> > > > release in my radar:
> > > >
> > > > 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> > > > There is a bug in shape inference on CPU when not using MKL, also we
> > > > are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> > > > I'm finishing a fix for these issues in the above PR.
> > > >
> > > > 2) https://github.com/apache/incubator-mxnet/issues/13438
> > > > We are seeing crashes due to unsafe setenv in multithreaded code.
> > > > Setenv / getenv from multiple threads is not safe and is causing
> > > > segfaults. This piece of code (the handlers in pthread_atfork)
> already
> > > > caused a very difficult to diagnose hang in a previous release, where
> > > > a fork inside cudnn would deadlock the engine.
> > > >
> > > > I would remove setenv from 2) as a mitigation, but we would need to
> > > > check for regressions as we could be creating additional threads
> > > > inside the engine.
> > > >
> > > > I would suggest that we address these two major issues before the
> next
> > > > release.
> > > >
> > > > Pedro
> > > >
> > > >
> > > >
> > > > On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel <
> steffenroc...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Dear MXNet community,
> > > > >
> > > > > I will be the release manager for the upcoming Apache MXNet 1.4.0
> > &g

Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Chris Olivier

I see. Yeah probably those can be removed. I haven’t checked the source,
but I would be surprised if omp even looked at the environment variable
after initial startup since looking up environment variables is a slow
linear search each time.

On Thu, Nov 29, 2018 at 8:09 AM Pedro Larroy 
wrote:

> Chris.  The problem is with setenv, not with getenv. We don't want to
> remove any getenv call, just these misplaced setenvs:
>
>
> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L61
>
> Please check the code above carefully and give us your feedback. Based
> on your email I think we don't yet have a common understanding of the
> root cause of this issue.
>
> Pedro.
> On Thu, Nov 29, 2018 at 4:02 PM Chris Olivier 
> wrote:
> >
> > - getenv should be thread safe as long as nothing is calling
> putenv/setenv
> > in another thread (the environment doesn’t change) as stated here:
> >
> > http://www.cplusplus.com/reference/cstdlib/getenv/
> >
> > it’s a simple library call, so to be sure either way, one can check the
> > actual source and see (in case some particular implementation is acting
> in
> > a particularly thread-unsafe manner). This should be vetted before making
> > any high-impact decisions such as trying to go remove every getenv call
> in
> > the whole system.
> >
> > - locking after fork is possibly due to libgomp not supporting forking
> such
> > that after a fork, a call is made to release the blocked omp threads and
> > the main thread waits for the omp threads to finish, but the omp threads
> > belong to the pre-forked process and thus never execute, causing that
> > forked process to freeze.  This behavior has been witnessed before.
> >
> >
> >
> >
> > On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> > wrote:
> >
> > > Hi all.
> > >
> > > There are two important issues / fixes that should go in the next
> > > release in my radar:
> > >
> > > 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> > > There is a bug in shape inference on CPU when not using MKL, also we
> > > are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> > > I'm finishing a fix for these issues in the above PR.
> > >
> > > 2) https://github.com/apache/incubator-mxnet/issues/13438
> > > We are seeing crashes due to unsafe setenv in multithreaded code.
> > > Setenv / getenv from multiple threads is not safe and is causing
> > > segfaults. This piece of code (the handlers in pthread_atfork) already
> > > caused a very difficult to diagnose hang in a previous release, where
> > > a fork inside cudnn would deadlock the engine.
> > >
> > > I would remove setenv from 2) as a mitigation, but we would need to
> > > check for regressions as we could be creating additional threads
> > > inside the engine.
> > >
> > > I would suggest that we address these two major issues before the next
> > > release.
> > >
> > > Pedro
> > >
> > >
> > >
> > > On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel <
> steffenroc...@gmail.com>
> > > wrote:
> > > >
> > > > Dear MXNet community,
> > > >
> > > > I will be the release manager for the upcoming Apache MXNet 1.4.0
> > > release.
> > > > Sergey Kolychev will be co-managing the release and providing help
> from
> > > the
> > > > committers side.
> > > > A release candidate will be cut on November 29, 2018 and voting will
> > > start
> > > > December 7, 2018. Release notes have been drafted here [1]. If you
> have
> > > any
> > > > additional features in progress and would like to include it in this
> > > > release, please assure they have been merged by November 27, 2018.
> > > Release
> > > > schedule is available here [2].
> > > >
> > > > Feel free to add any other comments/suggestions. Please help to
> review
> > > and
> > > > merge outstanding PR's and resolve issues impacting the quality of
> the
> > > > 1.4.0 release.
> > > >
> > > > Regards,
> > > >
> > > > Steffen
> > > >
> > > > [1]
> > > >
> > >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
> > > >
> > > > [2]
> > >
> https://cwiki.apache.org/confluence/display/MXN

Re: [Announce] Upcoming Apache MXNet (incubating) 1.4.0 release

2018-11-29 Thread Chris Olivier

- getenv should be thread safe as long as nothing is calling putenv/setenv
in another thread (the environment doesn’t change) as stated here:

http://www.cplusplus.com/reference/cstdlib/getenv/

it’s a simple library call, so to be sure either way, one can check the
actual source and see (in case some particular implementation is acting in
a particularly thread-unsafe manner). This should be vetted before making
any high-impact decisions such as trying to go remove every getenv call in
the whole system.

- locking after fork is possibly due to libgomp not supporting forking such
that after a fork, a call is made to release the blocked omp threads and
the main thread waits for the omp threads to finish, but the omp threads
belong to the pre-forked process and thus never execute, causing that
forked process to freeze.  This behavior has been witnessed before.




On Thu, Nov 29, 2018 at 6:13 AM Pedro Larroy 
wrote:

> Hi all.
>
> There are two important issues / fixes that should go in the next
> release in my radar:
>
> 1) https://github.com/apache/incubator-mxnet/pull/13409/files
> There is a bug in shape inference on CPU when not using MKL, also we
> are running activation on CPU via MKL when we compile CUDNN+MKLDNN.
> I'm finishing a fix for these issues in the above PR.
>
> 2) https://github.com/apache/incubator-mxnet/issues/13438
> We are seeing crashes due to unsafe setenv in multithreaded code.
> Setenv / getenv from multiple threads is not safe and is causing
> segfaults. This piece of code (the handlers in pthread_atfork) already
> caused a very difficult to diagnose hang in a previous release, where
> a fork inside cudnn would deadlock the engine.
>
> I would remove setenv from 2) as a mitigation, but we would need to
> check for regressions as we could be creating additional threads
> inside the engine.
>
> I would suggest that we address these two major issues before the next
> release.
>
> Pedro
>
>
>
> On Sun, Nov 25, 2018 at 11:41 PM Steffen Rochel 
> wrote:
> >
> > Dear MXNet community,
> >
> > I will be the release manager for the upcoming Apache MXNet 1.4.0
> release.
> > Sergey Kolychev will be co-managing the release and providing help from
> the
> > committers side.
> > A release candidate will be cut on November 29, 2018 and voting will
> start
> > December 7, 2018. Release notes have been drafted here [1]. If you have
> any
> > additional features in progress and would like to include it in this
> > release, please assure they have been merged by November 27, 2018.
> Release
> > schedule is available here [2].
> >
> > Feel free to add any other comments/suggestions. Please help to review
> and
> > merge outstanding PR's and resolve issues impacting the quality of the
> > 1.4.0 release.
> >
> > Regards,
> >
> > Steffen
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Notes
> >
> > [2]
> https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.4.0+Release+Plan+and+Status
> >
> >
> >
> >
> > On Tue, Nov 20, 2018 at 7:15 PM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > Spoke too soon[1], looks like others have been adding Turing support as
> > > well (thanks to those helping with this).  I believe there's still a
> few
> > > changes we'd have to make to claim support though (mshadow CMake
> changes,
> > > PyPi package creation tweaks).
> > >
> > > 1:
> > >
> > >
> https://github.com/apache/incubator-mxnet/commit/2c3357443ec3d49a11e93c89f278264ce10c2f08
> > >
> > > On Tue, Nov 20, 2018 at 7:00 PM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Hey Steffen, I'd like to be able to merge this PR for version 1.4:
> > > > https://github.com/apache/incubator-mxnet/pull/13310 . It fixes a
> > > > regression in master which causes incorrect feature vectors to be
> output
> > > > when using the TensorRT feature.  (Thanks to Nathalie for helping me
> > > track
> > > > down the root cause of the issue).   I'm currently blocked on a CI
> issue
> > > I
> > > > haven't seen before, but hope to have it resolved by EOW.
> > > >
> > > > One call-out I would make is that we currently don't support Turing
> > > > architecture (sm_75).  I've been slowly trying to add support, but I
> > > don't
> > > > think I'd have capacity to do this done by EOW.  Does anyone feel
> > > strongly
> > > > we need this in the 1.4 release?  From my perspective this will
> already
> > > be
> > > > a strong release without it.
> > > >
> > > > On Tue, Nov 20, 2018 at 6:42 PM Steffen Rochel <
> steffenroc...@gmail.com>
> > > > wrote:
> > > >
> > > >> Thanks Patrick, lets target to get the PR's merged this week.
> > > >>
> > > >> Call for contributions from the community: Right now we have 10 PR
> > > >> awaiting
> > > >> merge
> > > >> <
> > > >>
> > >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge+
> > > >> >
> > > >> and
> > > >>

Re: [Discussion] MXNet CMake build - raise minimal required version

2018-11-22 Thread Chris Olivier

yes that flag can be removed. profiler should always be. built in

On Thu, Nov 22, 2018 at 7:44 AM Anton Chernov  wrote:

> You can find relevant information regarding the profiling flag here:
>
> https://github.com/apache/incubator-mxnet/issues/11563
>
>
> чт, 22 нояб. 2018 г. в 16:06, Chris Olivier :
>
> > what is meant by:
> >
> >
> > *Profiling*
> > The profiler is always on even for production release builds, because
> MXNet
> > can not be build without it [2].  ?
> >
> > you mean it is always built or it is turned on (ie recording and saving
> > profiling information)?  I am not aware of it being turned on by default.
> >
> >
> > profiler has no overhead when built in but not turned on.
> >
> >
> > On Thu, Nov 22, 2018 at 2:35 AM Anton Chernov 
> wrote:
> >
> > > Dear MXNet community,
> > >
> > > I propose to raise the minimal required cmake version that is needed to
> > > build MXNet to 3.10 which was tagged on March 16 2018 [1].
> > >
> > > The effort of repairing cmake scripts in general is targeting to
> > deprecate
> > > make and maintain only 1 build system.
> > >
> > > *Need*
> > >
> > > The build system is the foundation of every software project. It's
> > quality
> > > is directly impacting the quality of the project. The MXNet build
> system
> > is
> > > fragile, partially broken and not maintained.
> > >
> > > Users of MXNet and developers are confused by the fact that 2 build
> > systems
> > > exist at the same time: make and CMake.
> > >
> > > The main functional areas which are impacted by the current state of
> the
> > > cmake files are:
> > >
> > > *OpenMP*
> > > The current CMake files mix OpenMP libraries from different compliers
> > which
> > > is undefined behaviour. It leads to indeterministic crashes on some
> > > platforms. Build and deployment are very hard. No evidence exists that
> > > proves that there is any benefit of having llvm OpenMP library as a
> > > submodule in MXNet.
> > >
> > > *BLAS and LAPACK*
> > > Basic math library usage is mixed up. It is hard and confusing to
> > configure
> > > and the choosing logic of the most optimal library is not present. MKL
> > and
> > > OpenBLAS are intermixed in an unpredictable manner.
> > >
> > > *Profiling*
> > > The profiler is always on even for production release builds, because
> > MXNet
> > > can not be build without it [2].
> > >
> > > *CUDA*
> > > CUDA is detected by 3 different files in the current cmake scripts and
> > the
> > > choice of those is based on a obscure logic with involves different
> > > versions of cmake and platforms which it's building on
> > >
> > > * CMakeLists.txt
> > > * cmake/FirstClassLangCuda.cmake
> > > * 3rdparty/mshadow/cmake/Cuda.cmake
> > >
> > >
> > > *Confusing and misleading cmake user options*
> > > For example, USE_CUDA / USE_OLDCMAKECUDA. Some of them will do or not
> do
> > > what they supposed to based on cmake generator version and version of
> > cmake
> > > [3].
> > > There are currently more than 30 build parameters for MXNet none of
> them
> > > documented. Some of them not even located in the main CMakeLists.txt
> > file,
> > > for example 'BLAS'.
> > >
> > >
> > > *Issues*
> > > There is a significant amount of github issues related to cmake or
> build
> > in
> > > general. New tickets are issued frequently.
> > >
> > > * #8702 (https://github.com/apache/incubator-mxnet/issues/8702)
> > >  [DISCUSSION] Should we deprecate Makefile and only use CMake?
> > > * #5079 (https://github.com/apache/incubator-mxnet/issues/5079)
> >  troubles
> > > building python interface on raspberry pi 3
> > > * #1722 (https://github.com/apache/incubator-mxnet/issues/1722)
> >  problem:
> > > compile mxnet with hdfs
> > > * #11549 (https://github.com/apache/incubator-mxnet/issues/11549) Pip
> > > package can be much faster (OpenCV version?)
> > > * #11417 (https://github.com/apache/incubator-mxnet/issues/11417)
> > > libomp.so
> > > dependency (need REAL fix)
> > > * #8532 (https://github.com/apache/incubator-mxnet/issues/8532)
> > >  mxnet-mkl
> > > (v0.12.0) crash when using (conda-installed) numpy with MKL //
> > (indir

Re: [Discussion] Remove bundled llvm OpenMP

2018-11-22 Thread Chris Olivier

Do you not work on CI mostly? My apologies for thinking that was some sort
of team effort between you and a few others that were passionate about CI
keeping the CI system running smoothly.

You have source code, you have the line the assertion is on. If you can’t
describe what’s going wrong that causes the assertion, then I don’t really
have anything more to add to this conversation beyond what’s below:

The whole “mixing omp libraries” is something that occurs in production
every day and certainly in everything that uses mkl.  It may occasionally
cause problems for some edge cases when there is super-complex linking
strategies and dynamic loading.  But this is not one of those edge cases.
Mostly blaming this is a red herring for other thread-related problems and
people switch omp library and the timing of their code changes and they
stop seeing the problem. I’ve spent my entire career doing heavily
multiphreaded c++ development and i’ve seen that a million times.  is the
suggestion that libiomp be removed from mkl? have you spoken with intel?
have you consulted Intel at all?

and what you are seeing isn’t some “hard to debug random crash”. you’re
seeing an assertion which is probably related to omp trying to create a
thread pool after a fork and something was done in the mxnet code to make
that sketchy to do. I’d suggest filing an issue with the llvm openmp just
like you’d file with any other not-well-understood behavior in mxnet.

The lack of root-causing the problem and knee-jerk solution here makes me
uncomfortable.

if you want to see performance differences there’s an environment variable
you can set in the mxnet omp tuning code that will print overhead and
execution times for the current omp library.

On Thu, Nov 22, 2018 at 7:12 AM Anton Chernov  wrote:

> Hi Chris,
>
> Thank you for your answer. If you have noticed the initial email comes from
> me, Anton Chernov (@lebeg on Github) and thus the proposal is not from any
> 'Ci' team that you've mentioned, but from me personally.
>
> You are writing:
>
> > someone is doing something unhealthy when they fork ...
>
> I'm missing any context to understand what you mean.
>
> > we get a lot of performance gain from OMP ...
>
> There is no data that would prove this statement and therefore it is a
> random guess.
>
> > in many months, no investigation has occurred as to WHY the assertion is
> failing.
>
> The investigation has concluded that this is happening due to undefined
> behaviour which is, in my opinion, a suffient answer that does not require
> to go any deeper.
>
> > the pr is vetoed until such a time that the actual root cause of the
> problem is known.
>
> And considering the statements above there is no valid reason to veto the
> PR.
>
>
> Best
> Anton
>
> чт, 22 нояб. 2018 г. в 15:38, Chris Olivier :
>
> > 3x less overhead*
> >
> > On Thu, Nov 22, 2018 at 6:25 AM Chris Olivier 
> > wrote:
> >
> > > someone is doing something unhealthy when they fork, which is causing
> an
> > > assertion in the openmp library. the same assertion that would fire in
> > mkl,
> > > which is linked to libiomp5 (exact same omp library). this is new
> > behavior
> > > and most likely due to an error or suboptimal approach in the forking
> > logic
> > > in mxnet.
> > >
> > > in order to circumvent the assert, the Ci team is proposing to remove
> the
> > > library completely which is equivalent to cutting off your leg to make
> > the
> > > pain from stubbing your toe go away.
> > >
> > > we get a lot of performance gain from OMP. is has about a 1/3 less
> > > overhead for entering omp regions and also supports omp regions after a
> > > fork, which libgomp does not.
> > >
> > > in many months, no investigation has occurred as to WHY the assertion
> is
> > > failing.
> > >
> > > the pr is vetoed until such a time that the actual root cause of the
> > > problem is known.
> > >
> > >
> > > thanks,
> > >
> > > -Chris.
> > >
> > >
> > >
> > >
> > > On Thu, Nov 22, 2018 at 4:36 AM Anton Chernov 
> > wrote:
> > >
> > >> Dear MXNet community,
> > >>
> > >> I would like to drive attention to an important issue that is present
> in
> > >> the MXNet CMake build: usage of bundled llvm OpenMP library.
> > >>
> > >> I have opened a PR to remove it:
> > >> https://github.com/apache/incubator-mxnet/pull/12160
> > >>
> > >> The issue was closed, but I am strong in my oppinion that it's the
> right
> > >>

Re: [Discussion] Remove bundled llvm OpenMP

2018-11-22 Thread Chris Olivier

3x less overhead*

On Thu, Nov 22, 2018 at 6:25 AM Chris Olivier  wrote:

> someone is doing something unhealthy when they fork, which is causing an
> assertion in the openmp library. the same assertion that would fire in mkl,
> which is linked to libiomp5 (exact same omp library). this is new behavior
> and most likely due to an error or suboptimal approach in the forking logic
> in mxnet.
>
> in order to circumvent the assert, the Ci team is proposing to remove the
> library completely which is equivalent to cutting off your leg to make the
> pain from stubbing your toe go away.
>
> we get a lot of performance gain from OMP. is has about a 1/3 less
> overhead for entering omp regions and also supports omp regions after a
> fork, which libgomp does not.
>
> in many months, no investigation has occurred as to WHY the assertion is
> failing.
>
> the pr is vetoed until such a time that the actual root cause of the
> problem is known.
>
>
> thanks,
>
> -Chris.
>
>
>
>
> On Thu, Nov 22, 2018 at 4:36 AM Anton Chernov  wrote:
>
>> Dear MXNet community,
>>
>> I would like to drive attention to an important issue that is present in
>> the MXNet CMake build: usage of bundled llvm OpenMP library.
>>
>> I have opened a PR to remove it:
>> https://github.com/apache/incubator-mxnet/pull/12160
>>
>> The issue was closed, but I am strong in my oppinion that it's the right
>> thing to do.
>>
>> *Background*
>> If you want to use OpenMP pragmas in your code for parallelization you
>> would supply a special flag to the compiler:
>>
>> - Clang / -fopenmp
>> https://openmp.llvm.org/
>>
>> - GCC / -fopenmp
>> https://gcc.gnu.org/onlinedocs/libgomp/Enabling-OpenMP.html
>>
>> - Intel / [Q]openmp
>>
>> https://software.intel.com/en-us/node/522689#6E24682E-F411-4AE3-A04D-ECD81C7008D1
>>
>> - Visual Studio: /openmp (Enable OpenMP 2.0 Support)
>> https://msdn.microsoft.com/en-us/library/tt15eb9t.aspx
>>
>> Each of the compilers would enable the '#pragma omp' directive during
>> C/C++
>> compilation and arrange for automatic linking of the OpenMP runtime
>> library
>> supplied by each complier separately.
>>
>> Thus, to use the advantages of an OpenMP implementation one has to compile
>> the code with the corresponding compiler.
>>
>> Currently, in MXNet CMake build scripts a bundled version of llvm OpenMP
>> is
>> used ([1] and [2]) to replace the OpenMP library supplied by the compiler.
>>
>> I will quote here the README from the MKL-DNN (Intel(R) Math Kernel
>> Library
>> for Deep Neural Networks):
>>
>> "Intel MKL-DNN uses OpenMP* for parallelism and requires an OpenMP runtime
>> library to work. As different OpenMP runtimes may not be binary compatible
>> it's important to ensure that only one OpenMP runtime is used throughout
>> the application. Having more than one OpenMP runtime initialized may lead
>> to undefined behavior resulting in incorrect results or crashes." [3]
>>
>> And:
>>
>> "Using GNU compiler with -fopenmp and -liomp5 options will link the
>> application with both Intel and GNU OpenMP runtime libraries. This will
>> lead to undefined behavior of the application." [4]
>>
>> As can be seen from ldd for MXNet:
>>
>> $ ldd build/tests/mxnet_unit_tests | grep omp
>> libomp.so => /.../mxnet/build/3rdparty/openmp/runtime/src/libomp.so
>> (0x7f697bc55000)
>> libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
>> (0x7f69660cd000)
>>
>> *Performance*
>>
>> The only performance data related to OpenMP in MXNet I was able to find is
>> here:
>>
>> https://github.com/apache/incubator-mxnet/issues/9744#issuecomment-367711172
>>
>> Which in my understanding is testing imact of different environment
>> variables for the same setup (using same bundled OpenMP library).
>>
>> The libraries may differ in implementation and the Thread Affinity
>> Interface [5] may have significant impact on performance.
>>
>> All compliers support it:
>>
>> - Clang / KMP_AFFINITY
>>
>> https://github.com/clang-ykt/openmp/blob/master/runtime/src/kmp_affinity.cpp
>>
>> - GCC / GOMP_CPU_AFFINITY
>>
>> https://gcc.gnu.org/onlinedocs/gcc-4.7.1/libgomp/GOMP_005fCPU_005fAFFINITY.html
>>
>> - Intel / KMP_AFFINITY
>>
>> https://software.intel.com/en-us/node/522689#6E24682E-F411-4AE3-A04D-ECD81C7008D1
>>
>> - Visual Studio / SetThreadAffinityMask
>>
>> https

Re: [Discussion] MXNet CMake build - raise minimal required version

2018-11-22 Thread Chris Olivier

i have not seen any proof that any crashes are due to llvm openmp usage.

On Thu, Nov 22, 2018 at 2:35 AM Anton Chernov  wrote:

> Dear MXNet community,
>
> I propose to raise the minimal required cmake version that is needed to
> build MXNet to 3.10 which was tagged on March 16 2018 [1].
>
> The effort of repairing cmake scripts in general is targeting to deprecate
> make and maintain only 1 build system.
>
> *Need*
>
> The build system is the foundation of every software project. It's quality
> is directly impacting the quality of the project. The MXNet build system is
> fragile, partially broken and not maintained.
>
> Users of MXNet and developers are confused by the fact that 2 build systems
> exist at the same time: make and CMake.
>
> The main functional areas which are impacted by the current state of the
> cmake files are:
>
> *OpenMP*
> The current CMake files mix OpenMP libraries from different compliers which
> is undefined behaviour. It leads to indeterministic crashes on some
> platforms. Build and deployment are very hard. No evidence exists that
> proves that there is any benefit of having llvm OpenMP library as a
> submodule in MXNet.
>
> *BLAS and LAPACK*
> Basic math library usage is mixed up. It is hard and confusing to configure
> and the choosing logic of the most optimal library is not present. MKL and
> OpenBLAS are intermixed in an unpredictable manner.
>
> *Profiling*
> The profiler is always on even for production release builds, because MXNet
> can not be build without it [2].
>
> *CUDA*
> CUDA is detected by 3 different files in the current cmake scripts and the
> choice of those is based on a obscure logic with involves different
> versions of cmake and platforms which it's building on
>
> * CMakeLists.txt
> * cmake/FirstClassLangCuda.cmake
> * 3rdparty/mshadow/cmake/Cuda.cmake
>
>
> *Confusing and misleading cmake user options*
> For example, USE_CUDA / USE_OLDCMAKECUDA. Some of them will do or not do
> what they supposed to based on cmake generator version and version of cmake
> [3].
> There are currently more than 30 build parameters for MXNet none of them
> documented. Some of them not even located in the main CMakeLists.txt file,
> for example 'BLAS'.
>
>
> *Issues*
> There is a significant amount of github issues related to cmake or build in
> general. New tickets are issued frequently.
>
> * #8702 (https://github.com/apache/incubator-mxnet/issues/8702)
>  [DISCUSSION] Should we deprecate Makefile and only use CMake?
> * #5079 (https://github.com/apache/incubator-mxnet/issues/5079)   troubles
> building python interface on raspberry pi 3
> * #1722 (https://github.com/apache/incubator-mxnet/issues/1722)   problem:
> compile mxnet with hdfs
> * #11549 (https://github.com/apache/incubator-mxnet/issues/11549) Pip
> package can be much faster (OpenCV version?)
> * #11417 (https://github.com/apache/incubator-mxnet/issues/11417)
> libomp.so
> dependency (need REAL fix)
> * #8532 (https://github.com/apache/incubator-mxnet/issues/8532)
>  mxnet-mkl
> (v0.12.0) crash when using (conda-installed) numpy with MKL // (indirectly)
> * #11131 (https://github.com/apache/incubator-mxnet/issues/11131)
> mxnet-cu92 low efficiency  // (indirectly)
> * #10743 (https://github.com/apache/incubator-mxnet/issues/10743) CUDA
> 9.1.xx failed if not set OLDCMAKECUDA on cmake 3.10.3 with unix makefile or
> Ninja generator
> * #10742 (https://github.com/apache/incubator-mxnet/issues/10742) typo in
> cpp-package/CMakeLists.txt
> * #10737 (https://github.com/apache/incubator-mxnet/issues/10737) Cmake is
> running again when execute make install
> * #10543 (https://github.com/apache/incubator-mxnet/issues/10543) Failed
> to
> build from source when set USE_CPP_PACKAGE = 1, fatal error C1083: unabel
> to open file: “mxnet-cpp/op.h”: No such file or directory
> * #10217 (https://github.com/apache/incubator-mxnet/issues/10217) Building
> with OpenCV causes link errors
> * #10175 (https://github.com/apache/incubator-mxnet/issues/10175) MXNet
> MKLDNN build dependency/flow discussion
> * #10009 (https://github.com/apache/incubator-mxnet/issues/10009)
> [CMAKE][IoT] Remove pthread from android_arm64 build
> * #9944 (https://github.com/apache/incubator-mxnet/issues/9944)   MXNet
> MinGW-w64 build error // (indirectly)
> * #9868 (https://github.com/apache/incubator-mxnet/issues/9868)   MKL and
> CMake
> * #9516 (https://github.com/apache/incubator-mxnet/issues/9516)   cmake
> cuda arch issues
> * #9105 (https://github.com/apache/incubator-mxnet/issues/9105)
>  libmxnet.so load path error
> * #9096 (https://github.com/apache/incubator-mxnet/issues/9096)   MXNet
> built with GPerftools crashes
> * #8786 (https://github.com/apache/incubator-mxnet/issues/8786)   Link
> failure on DEBUG=1 (static member symbol not defined) // (indirectly)
> * #8729 (https://github.com/apache/incubator-mxnet/issues/8729)   Build
> amalgamation using a docker // (indirectly)
> * #8667

Re: [Discussion] Remove bundled llvm OpenMP

2018-11-22 Thread Chris Olivier

someone is doing something unhealthy when they fork, which is causing an
assertion in the openmp library. the same assertion that would fire in mkl,
which is linked to libiomp5 (exact same omp library). this is new behavior
and most likely due to an error or suboptimal approach in the forking logic
in mxnet.

in order to circumvent the assert, the Ci team is proposing to remove the
library completely which is equivalent to cutting off your leg to make the
pain from stubbing your toe go away.

we get a lot of performance gain from OMP. is has about a 1/3 less overhead
for entering omp regions and also supports omp regions after a fork, which
libgomp does not.

in many months, no investigation has occurred as to WHY the assertion is
failing.

the pr is vetoed until such a time that the actual root cause of the
problem is known.


thanks,

-Chris.




On Thu, Nov 22, 2018 at 4:36 AM Anton Chernov  wrote:

> Dear MXNet community,
>
> I would like to drive attention to an important issue that is present in
> the MXNet CMake build: usage of bundled llvm OpenMP library.
>
> I have opened a PR to remove it:
> https://github.com/apache/incubator-mxnet/pull/12160
>
> The issue was closed, but I am strong in my oppinion that it's the right
> thing to do.
>
> *Background*
> If you want to use OpenMP pragmas in your code for parallelization you
> would supply a special flag to the compiler:
>
> - Clang / -fopenmp
> https://openmp.llvm.org/
>
> - GCC / -fopenmp
> https://gcc.gnu.org/onlinedocs/libgomp/Enabling-OpenMP.html
>
> - Intel / [Q]openmp
>
> https://software.intel.com/en-us/node/522689#6E24682E-F411-4AE3-A04D-ECD81C7008D1
>
> - Visual Studio: /openmp (Enable OpenMP 2.0 Support)
> https://msdn.microsoft.com/en-us/library/tt15eb9t.aspx
>
> Each of the compilers would enable the '#pragma omp' directive during C/C++
> compilation and arrange for automatic linking of the OpenMP runtime library
> supplied by each complier separately.
>
> Thus, to use the advantages of an OpenMP implementation one has to compile
> the code with the corresponding compiler.
>
> Currently, in MXNet CMake build scripts a bundled version of llvm OpenMP is
> used ([1] and [2]) to replace the OpenMP library supplied by the compiler.
>
> I will quote here the README from the MKL-DNN (Intel(R) Math Kernel Library
> for Deep Neural Networks):
>
> "Intel MKL-DNN uses OpenMP* for parallelism and requires an OpenMP runtime
> library to work. As different OpenMP runtimes may not be binary compatible
> it's important to ensure that only one OpenMP runtime is used throughout
> the application. Having more than one OpenMP runtime initialized may lead
> to undefined behavior resulting in incorrect results or crashes." [3]
>
> And:
>
> "Using GNU compiler with -fopenmp and -liomp5 options will link the
> application with both Intel and GNU OpenMP runtime libraries. This will
> lead to undefined behavior of the application." [4]
>
> As can be seen from ldd for MXNet:
>
> $ ldd build/tests/mxnet_unit_tests | grep omp
> libomp.so => /.../mxnet/build/3rdparty/openmp/runtime/src/libomp.so
> (0x7f697bc55000)
> libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> (0x7f69660cd000)
>
> *Performance*
>
> The only performance data related to OpenMP in MXNet I was able to find is
> here:
>
> https://github.com/apache/incubator-mxnet/issues/9744#issuecomment-367711172
>
> Which in my understanding is testing imact of different environment
> variables for the same setup (using same bundled OpenMP library).
>
> The libraries may differ in implementation and the Thread Affinity
> Interface [5] may have significant impact on performance.
>
> All compliers support it:
>
> - Clang / KMP_AFFINITY
>
> https://github.com/clang-ykt/openmp/blob/master/runtime/src/kmp_affinity.cpp
>
> - GCC / GOMP_CPU_AFFINITY
>
> https://gcc.gnu.org/onlinedocs/gcc-4.7.1/libgomp/GOMP_005fCPU_005fAFFINITY.html
>
> - Intel / KMP_AFFINITY
>
> https://software.intel.com/en-us/node/522689#6E24682E-F411-4AE3-A04D-ECD81C7008D1
>
> - Visual Studio / SetThreadAffinityMask
>
> https://docs.microsoft.com/en-us/windows/desktop/api/winbase/nf-winbase-setthreadaffinitymask
>
> *Issues*
>
> Failed OpenMP assertion when loading MXNet compiled with DEBUG=1
> https://github.com/apache/incubator-mxnet/issues/10856
>
> libomp.so dependency (need REAL fix)
> https://github.com/apache/incubator-mxnet/issues/11417
>
> mxnet-mkl (v0.12.0) crash when using (conda-installed) numpy with MKL
> https://github.com/apache/incubator-mxnet/issues/8532
>
> Performance regression when OMP_NUM_THREADS environment variable is not set
> https://github.com/apache/incubator-mxnet/issues/9744
>
> Poor concat CPU performance on CUDA builds
> https://github.com/apache/incubator-mxnet/issues/11905
>
> I would appreciate hearing your thoughts.
>
>
> Best
> Anton
>
> [1]
>
> https://github.com/apache/incubator-mxnet/blob/master/CMakeLists.txt#L400-L405
> [2] https://github.com/apache/incubator-mxnet/tree/master/3rdparty

Re: [VOTE] - Adopt "Become a Committer and PPMC Member" Document

2018-11-01 Thread Chris Olivier

+1

On Thu, Nov 1, 2018 at 6:08 AM Carin Meier  wrote:

> Reminder - vote ends tomorrow morning at 6:00 am EST
>
> On Mon, Oct 29, 2018 at 6:46 PM Carin Meier  wrote:
>
> > This vote is to adopt the document
> >
> https://cwiki.apache.org/confluence/display/MXNET/Become+an+Apache+MXNet+%28incubating%29+Committer+and+PPMC+Member+Proposal
> > to replace the current document
> > https://cwiki.apache.org/confluence/display/MXNET/Becoming+a+Committer
> >
> > The dev discussion thread is here
> >
> https://lists.apache.org/thread.html/e61ffa26af374de7a99c475d406e462a00b26cfc1155e232198dd53e@%3Cdev.mxnet.apache.org%3E
> >
> > The vote will be a procedural issue vote as defined
> > https://www.apache.org/foundation/voting.html
> >
> > Votes on procedural issues follow the common format of majority rule
> > unless otherwise stated. That is, if there are more favourable votes than
> > unfavourable ones, the issue is considered to have passed -- regardless
> of
> > the number of votes in each category. (If the number of votes seems too
> > small to be representative of a community consensus, the issue is
> typically
> > not pursued. However, see the description of lazy consensus
> >  for a
> > modifying factor.)
> >
> > The vote will run until Friday Nov 2nd at 6:00 am EST
> >
> > Thanks,
> > Carin
> >
> >
>

Re: [VOTE] - Adopt "Become a Committer and PPMC Member" Document

2018-10-29 Thread Chris Olivier

well, if something needs consensus to pass, then saying “you need to keep
discussing until consensus is reached” seems like it could be abused by
someone who was just willing to not accept a verdict and continues to push,
right? And if someone were to walk away saying “I don’t want to discuss
this any further”, which is fair in that situation, then they’re the “bad
guy”? While it sounds like a noble persuit, I just feel like this could be
abused.

On Mon, Oct 29, 2018 at 5:53 PM Carin Meier  wrote:

> Chris,
>
> Is there are rewording that you would find more acceptable? Again, we can
> have more time to edit and revise the document. There is not a time limit
> on this. I might have been too hasty to start the vote thinking the
> discussion was wrapped up.
>
> - Carin
>
> On Mon, Oct 29, 2018 at 8:50 PM Chris Olivier 
> wrote:
>
> > or another example if something is downvoted, this also implies that
> after
> > a vote is over, it’s approprorate to continue pushing the subject trying
> to
> > just wear everyone down even though the outcome is clear. We’ve seen this
> > before, actually.
> >
> > On Mon, Oct 29, 2018 at 5:41 PM Chris Olivier 
> > wrote:
> >
> > > -1 “strive to meet consensus”? This seems to imply the consensus is the
> > > natural expected state. So in the case where someone submits that we
> > should
> > > start a nuclear war, then our bylaws would state that we should all try
> > to
> > > agree to start a nuclear war.
> > >
> > > On Mon, Oct 29, 2018 at 4:41 PM Tianqi Chen  wrote:
> > >
> > >> Hi Carin:
> > >> Sorry for the last minute request, but given the way we write down
> > the
> > >> PMC, committer privileges, I feel we need to add an additional line:
> > >>
> > >>- "PMC/committer should strive to be diplomatic and reach consensus
> > >> with
> > >> discussion when possible."
> > >>
> > >>Since I don't really want us to give an impression of abusing veto
> > >> rights.
> > >>
> > >> Thanks!
> > >> Tianqi
> > >>
> > >> On Mon, Oct 29, 2018 at 3:47 PM Carin Meier 
> > wrote:
> > >>
> > >> > This vote is to adopt the document
> > >> >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/MXNET/Become+an+Apache+MXNet+%28incubating%29+Committer+and+PPMC+Member+Proposal
> > >> > to replace the current document
> > >> >
> > https://cwiki.apache.org/confluence/display/MXNET/Becoming+a+Committer
> > >> >
> > >> > The dev discussion thread is here
> > >> >
> > >> >
> > >>
> >
> https://lists.apache.org/thread.html/e61ffa26af374de7a99c475d406e462a00b26cfc1155e232198dd53e@%3Cdev.mxnet.apache.org%3E
> > >> >
> > >> > The vote will be a procedural issue vote as defined
> > >> > https://www.apache.org/foundation/voting.html
> > >> >
> > >> > Votes on procedural issues follow the common format of majority rule
> > >> unless
> > >> > otherwise stated. That is, if there are more favourable votes than
> > >> > unfavourable ones, the issue is considered to have passed --
> > regardless
> > >> of
> > >> > the number of votes in each category. (If the number of votes seems
> > too
> > >> > small to be representative of a community consensus, the issue is
> > >> typically
> > >> > not pursued. However, see the description of lazy consensus
> > >> > <https://www.apache.org/foundation/voting.html#LazyConsensus> for a
> > >> > modifying factor.)
> > >> >
> > >> > The vote will run until Friday Nov 2nd at 6:00 am EST
> > >> >
> > >> > Thanks,
> > >> > Carin
> > >> >
> > >>
> > >
> >
>

Re: [VOTE] - Adopt "Become a Committer and PPMC Member" Document

2018-10-29 Thread Chris Olivier

-0 but keep it in if you want

On Mon, Oct 29, 2018 at 5:50 PM Chris Olivier  wrote:

> or another example if something is downvoted, this also implies that after
> a vote is over, it’s approprorate to continue pushing the subject trying to
> just wear everyone down even though the outcome is clear. We’ve seen this
> before, actually.
>
> On Mon, Oct 29, 2018 at 5:41 PM Chris Olivier 
> wrote:
>
>> -1 “strive to meet consensus”? This seems to imply the consensus is the
>> natural expected state. So in the case where someone submits that we should
>> start a nuclear war, then our bylaws would state that we should all try to
>> agree to start a nuclear war.
>>
>> On Mon, Oct 29, 2018 at 4:41 PM Tianqi Chen  wrote:
>>
>>> Hi Carin:
>>> Sorry for the last minute request, but given the way we write down
>>> the
>>> PMC, committer privileges, I feel we need to add an additional line:
>>>
>>>- "PMC/committer should strive to be diplomatic and reach consensus
>>> with
>>> discussion when possible."
>>>
>>>Since I don't really want us to give an impression of abusing veto
>>> rights.
>>>
>>> Thanks!
>>> Tianqi
>>>
>>> On Mon, Oct 29, 2018 at 3:47 PM Carin Meier 
>>> wrote:
>>>
>>> > This vote is to adopt the document
>>> >
>>> >
>>> https://cwiki.apache.org/confluence/display/MXNET/Become+an+Apache+MXNet+%28incubating%29+Committer+and+PPMC+Member+Proposal
>>> > to replace the current document
>>> > https://cwiki.apache.org/confluence/display/MXNET/Becoming+a+Committer
>>> >
>>> > The dev discussion thread is here
>>> >
>>> >
>>> https://lists.apache.org/thread.html/e61ffa26af374de7a99c475d406e462a00b26cfc1155e232198dd53e@%3Cdev.mxnet.apache.org%3E
>>> >
>>> > The vote will be a procedural issue vote as defined
>>> > https://www.apache.org/foundation/voting.html
>>> >
>>> > Votes on procedural issues follow the common format of majority rule
>>> unless
>>> > otherwise stated. That is, if there are more favourable votes than
>>> > unfavourable ones, the issue is considered to have passed --
>>> regardless of
>>> > the number of votes in each category. (If the number of votes seems too
>>> > small to be representative of a community consensus, the issue is
>>> typically
>>> > not pursued. However, see the description of lazy consensus
>>> > <https://www.apache.org/foundation/voting.html#LazyConsensus> for a
>>> > modifying factor.)
>>> >
>>> > The vote will run until Friday Nov 2nd at 6:00 am EST
>>> >
>>> > Thanks,
>>> > Carin
>>> >
>>>
>>

Re: [VOTE] - Adopt "Become a Committer and PPMC Member" Document

2018-10-29 Thread Chris Olivier

or another example if something is downvoted, this also implies that after
a vote is over, it’s approprorate to continue pushing the subject trying to
just wear everyone down even though the outcome is clear. We’ve seen this
before, actually.

On Mon, Oct 29, 2018 at 5:41 PM Chris Olivier  wrote:

> -1 “strive to meet consensus”? This seems to imply the consensus is the
> natural expected state. So in the case where someone submits that we should
> start a nuclear war, then our bylaws would state that we should all try to
> agree to start a nuclear war.
>
> On Mon, Oct 29, 2018 at 4:41 PM Tianqi Chen  wrote:
>
>> Hi Carin:
>> Sorry for the last minute request, but given the way we write down the
>> PMC, committer privileges, I feel we need to add an additional line:
>>
>>- "PMC/committer should strive to be diplomatic and reach consensus
>> with
>> discussion when possible."
>>
>>Since I don't really want us to give an impression of abusing veto
>> rights.
>>
>> Thanks!
>> Tianqi
>>
>> On Mon, Oct 29, 2018 at 3:47 PM Carin Meier  wrote:
>>
>> > This vote is to adopt the document
>> >
>> >
>> https://cwiki.apache.org/confluence/display/MXNET/Become+an+Apache+MXNet+%28incubating%29+Committer+and+PPMC+Member+Proposal
>> > to replace the current document
>> > https://cwiki.apache.org/confluence/display/MXNET/Becoming+a+Committer
>> >
>> > The dev discussion thread is here
>> >
>> >
>> https://lists.apache.org/thread.html/e61ffa26af374de7a99c475d406e462a00b26cfc1155e232198dd53e@%3Cdev.mxnet.apache.org%3E
>> >
>> > The vote will be a procedural issue vote as defined
>> > https://www.apache.org/foundation/voting.html
>> >
>> > Votes on procedural issues follow the common format of majority rule
>> unless
>> > otherwise stated. That is, if there are more favourable votes than
>> > unfavourable ones, the issue is considered to have passed -- regardless
>> of
>> > the number of votes in each category. (If the number of votes seems too
>> > small to be representative of a community consensus, the issue is
>> typically
>> > not pursued. However, see the description of lazy consensus
>> > <https://www.apache.org/foundation/voting.html#LazyConsensus> for a
>> > modifying factor.)
>> >
>> > The vote will run until Friday Nov 2nd at 6:00 am EST
>> >
>> > Thanks,
>> > Carin
>> >
>>
>

Re: [VOTE] - Adopt "Become a Committer and PPMC Member" Document

2018-10-29 Thread Chris Olivier

-1 “strive to meet consensus”? This seems to imply the consensus is the
natural expected state. So in the case where someone submits that we should
start a nuclear war, then our bylaws would state that we should all try to
agree to start a nuclear war.

On Mon, Oct 29, 2018 at 4:41 PM Tianqi Chen  wrote:

> Hi Carin:
> Sorry for the last minute request, but given the way we write down the
> PMC, committer privileges, I feel we need to add an additional line:
>
>- "PMC/committer should strive to be diplomatic and reach consensus with
> discussion when possible."
>
>Since I don't really want us to give an impression of abusing veto
> rights.
>
> Thanks!
> Tianqi
>
> On Mon, Oct 29, 2018 at 3:47 PM Carin Meier  wrote:
>
> > This vote is to adopt the document
> >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Become+an+Apache+MXNet+%28incubating%29+Committer+and+PPMC+Member+Proposal
> > to replace the current document
> > https://cwiki.apache.org/confluence/display/MXNET/Becoming+a+Committer
> >
> > The dev discussion thread is here
> >
> >
> https://lists.apache.org/thread.html/e61ffa26af374de7a99c475d406e462a00b26cfc1155e232198dd53e@%3Cdev.mxnet.apache.org%3E
> >
> > The vote will be a procedural issue vote as defined
> > https://www.apache.org/foundation/voting.html
> >
> > Votes on procedural issues follow the common format of majority rule
> unless
> > otherwise stated. That is, if there are more favourable votes than
> > unfavourable ones, the issue is considered to have passed -- regardless
> of
> > the number of votes in each category. (If the number of votes seems too
> > small to be representative of a community consensus, the issue is
> typically
> > not pursued. However, see the description of lazy consensus
> >  for a
> > modifying factor.)
> >
> > The vote will run until Friday Nov 2nd at 6:00 am EST
> >
> > Thanks,
> > Carin
> >
>

Re: [Discussion] Recognise Reviewers, Besides Committers and PMC

2018-10-23 Thread Chris Olivier

It's not my intent to kill this (I just asked a question).  What are
mentors' input?

On Mon, Oct 22, 2018 at 3:50 PM Tianqi Chen  wrote:

> To be clear, we are not splitting the committers into reviewers, we are
> recognizing an additional set of contributors who could become potential
> committers and recognizing them as committers
>
> Tianqi
>
> On Mon, Oct 22, 2018 at 3:23 PM Chris Olivier 
> wrote:
>
> > Are there any other major Apache projects which have this designation?  I
> > am always continually suspicious of efforts to reinvent Apache rules from
> > other non-Apache projects, when Apache projects have historically been
> > quite successful within the Apache platform.  In fact, operating outside
> of
> > Apache norms is already a major problem as everyone knows.  We are only
> > just now splitting Committer/PMC into two separate groups. Splitting into
> > three seems a bit much at this juncture unless there's some good
> > precedents.
> >
> >
> >
> >
> > On Mon, Oct 22, 2018 at 2:17 PM Tianqi Chen  wrote:
> >
> > > The situation most projects are facing(including us), is lack of code
> > > reviews. Code reviews are the most important part of the project, and
> > > high-quality reviews are extremely time-consuming, maybe as much as so
> > > as the code itself. Usually, it is only committers do the code reviews,
> > the
> > > code reviews from committers are important, as they are the serve as
> > > the gate-keeper of the quality of the code.  In my experience, I
> > > usually find the reviews from non-committer super helpful, and they
> > > help the committer to catch problems that are otherwise overlooked.
> > >
> > > However, it is very hard to get contributors to do code reviews unless
> we
> > > solicit them. It is definitely harder than getting code contributions.
> > The
> > > Reviewer mechanism could provide a way to do so. We can recognize
> > > contributors, bring them as reviewers and encourage them to do the code
> > > reviews by explicitly soliciting. The reviewers can learn from the
> > > committer reviews,
> > > which serves as a role model for what is being expected. Naturally,
> this
> > > likely helps us find more good reviewers and bought them committer.
> > >
> > > Cheers
> > > Tianqi
> > >
> > > On Mon, Oct 22, 2018 at 1:09 PM Anirudh  wrote:
> > >
> > > > -1. I dont see the need for additional level of hierarchy. I totally
> am
> > > for
> > > > recognizing good code reviewers. We can recognize this by making them
> > > > committers. Being a good reviewer should be sufficient to become a
> > > > committer in my opinion. (Assuming that there is a seperation between
> > > PPMC
> > > > and committers).
> > > >
> > > > Anirudh
> > > >
> > > > On Mon, Oct 22, 2018 at 8:28 AM Qing Lan 
> wrote:
> > > >
> > > > > +1
> > > > > Let's have a reviewer list somewhere with a certain format: such as
> > > C++,
> > > > > Gluon, Scala/Java based on language or some other category. etc. In
> > the
> > > > > future, label bot would automatically assign reviewers based on
> this
> > > kind
> > > > > of documentation.
> > > > >
> > > > > Thanks,
> > > > > Qing
> > > > >
> > > > > On 10/21/18, 11:44 PM, "YiZhi Liu"  wrote:
> > > > >
> > > > > +1
> > > > > I also suggest add reviewer list link to the PR template, so
> that
> > > > > developers can easily request review from those reviewers.
> > > > > On Sun, Oct 21, 2018 at 8:30 PM Tianqi Chen  >
> > > > wrote:
> > > > > >
> > > > > > I was suggesting something more concrete:
> > > > > >
> > > > > > - Add a Reviewers section to
> > > > > >
> > > > >
> > https://github.com/apache/incubator-mxnet/blob/master/CONTRIBUTORS.md
> > > to
> > > > > > list a list of Reviewers.
> > > > > > - This is a "pesudo role", but holds weight as committers
> > > > should
> > > > > highly
> > > > > > value their reviews during the PR process.
> > > > > > - The committers/PMC could actively look for good
> contributors
>

Re: [Discussion] Recognise Reviewers, Besides Committers and PMC

2018-10-22 Thread Chris Olivier

I think your last word you meant "reviewers", right?
yeah, this was also my understanding. A new "below-committer" level called
"reviewer".  So 3 levels now...

On Mon, Oct 22, 2018 at 3:50 PM Tianqi Chen  wrote:

> To be clear, we are not splitting the committers into reviewers, we are
> recognizing an additional set of contributors who could become potential
> committers and recognizing them as committers
>
> Tianqi
>
> On Mon, Oct 22, 2018 at 3:23 PM Chris Olivier 
> wrote:
>
> > Are there any other major Apache projects which have this designation?  I
> > am always continually suspicious of efforts to reinvent Apache rules from
> > other non-Apache projects, when Apache projects have historically been
> > quite successful within the Apache platform.  In fact, operating outside
> of
> > Apache norms is already a major problem as everyone knows.  We are only
> > just now splitting Committer/PMC into two separate groups. Splitting into
> > three seems a bit much at this juncture unless there's some good
> > precedents.
> >
> >
> >
> >
> > On Mon, Oct 22, 2018 at 2:17 PM Tianqi Chen  wrote:
> >
> > > The situation most projects are facing(including us), is lack of code
> > > reviews. Code reviews are the most important part of the project, and
> > > high-quality reviews are extremely time-consuming, maybe as much as so
> > > as the code itself. Usually, it is only committers do the code reviews,
> > the
> > > code reviews from committers are important, as they are the serve as
> > > the gate-keeper of the quality of the code.  In my experience, I
> > > usually find the reviews from non-committer super helpful, and they
> > > help the committer to catch problems that are otherwise overlooked.
> > >
> > > However, it is very hard to get contributors to do code reviews unless
> we
> > > solicit them. It is definitely harder than getting code contributions.
> > The
> > > Reviewer mechanism could provide a way to do so. We can recognize
> > > contributors, bring them as reviewers and encourage them to do the code
> > > reviews by explicitly soliciting. The reviewers can learn from the
> > > committer reviews,
> > > which serves as a role model for what is being expected. Naturally,
> this
> > > likely helps us find more good reviewers and bought them committer.
> > >
> > > Cheers
> > > Tianqi
> > >
> > > On Mon, Oct 22, 2018 at 1:09 PM Anirudh  wrote:
> > >
> > > > -1. I dont see the need for additional level of hierarchy. I totally
> am
> > > for
> > > > recognizing good code reviewers. We can recognize this by making them
> > > > committers. Being a good reviewer should be sufficient to become a
> > > > committer in my opinion. (Assuming that there is a seperation between
> > > PPMC
> > > > and committers).
> > > >
> > > > Anirudh
> > > >
> > > > On Mon, Oct 22, 2018 at 8:28 AM Qing Lan 
> wrote:
> > > >
> > > > > +1
> > > > > Let's have a reviewer list somewhere with a certain format: such as
> > > C++,
> > > > > Gluon, Scala/Java based on language or some other category. etc. In
> > the
> > > > > future, label bot would automatically assign reviewers based on
> this
> > > kind
> > > > > of documentation.
> > > > >
> > > > > Thanks,
> > > > > Qing
> > > > >
> > > > > On 10/21/18, 11:44 PM, "YiZhi Liu"  wrote:
> > > > >
> > > > > +1
> > > > > I also suggest add reviewer list link to the PR template, so
> that
> > > > > developers can easily request review from those reviewers.
> > > > > On Sun, Oct 21, 2018 at 8:30 PM Tianqi Chen  >
> > > > wrote:
> > > > > >
> > > > > > I was suggesting something more concrete:
> > > > > >
> > > > > > - Add a Reviewers section to
> > > > > >
> > > > >
> > https://github.com/apache/incubator-mxnet/blob/master/CONTRIBUTORS.md
> > > to
> > > > > > list a list of Reviewers.
> > > > > > - This is a "pesudo role", but holds weight as committers
> > > > should
> > > > > highly
> > > > > > value their reviews during the PR process.
> >

Re: [Discussion] Recognise Reviewers, Besides Committers and PMC

2018-10-22 Thread Chris Olivier

Are there any other major Apache projects which have this designation?  I
am always continually suspicious of efforts to reinvent Apache rules from
other non-Apache projects, when Apache projects have historically been
quite successful within the Apache platform.  In fact, operating outside of
Apache norms is already a major problem as everyone knows.  We are only
just now splitting Committer/PMC into two separate groups. Splitting into
three seems a bit much at this juncture unless there's some good precedents.




On Mon, Oct 22, 2018 at 2:17 PM Tianqi Chen  wrote:

> The situation most projects are facing(including us), is lack of code
> reviews. Code reviews are the most important part of the project, and
> high-quality reviews are extremely time-consuming, maybe as much as so
> as the code itself. Usually, it is only committers do the code reviews, the
> code reviews from committers are important, as they are the serve as
> the gate-keeper of the quality of the code.  In my experience, I
> usually find the reviews from non-committer super helpful, and they
> help the committer to catch problems that are otherwise overlooked.
>
> However, it is very hard to get contributors to do code reviews unless we
> solicit them. It is definitely harder than getting code contributions.  The
> Reviewer mechanism could provide a way to do so. We can recognize
> contributors, bring them as reviewers and encourage them to do the code
> reviews by explicitly soliciting. The reviewers can learn from the
> committer reviews,
> which serves as a role model for what is being expected. Naturally, this
> likely helps us find more good reviewers and bought them committer.
>
> Cheers
> Tianqi
>
> On Mon, Oct 22, 2018 at 1:09 PM Anirudh  wrote:
>
> > -1. I dont see the need for additional level of hierarchy. I totally am
> for
> > recognizing good code reviewers. We can recognize this by making them
> > committers. Being a good reviewer should be sufficient to become a
> > committer in my opinion. (Assuming that there is a seperation between
> PPMC
> > and committers).
> >
> > Anirudh
> >
> > On Mon, Oct 22, 2018 at 8:28 AM Qing Lan  wrote:
> >
> > > +1
> > > Let's have a reviewer list somewhere with a certain format: such as
> C++,
> > > Gluon, Scala/Java based on language or some other category. etc. In the
> > > future, label bot would automatically assign reviewers based on this
> kind
> > > of documentation.
> > >
> > > Thanks,
> > > Qing
> > >
> > > On 10/21/18, 11:44 PM, "YiZhi Liu"  wrote:
> > >
> > > +1
> > > I also suggest add reviewer list link to the PR template, so that
> > > developers can easily request review from those reviewers.
> > > On Sun, Oct 21, 2018 at 8:30 PM Tianqi Chen 
> > wrote:
> > > >
> > > > I was suggesting something more concrete:
> > > >
> > > > - Add a Reviewers section to
> > > >
> > > https://github.com/apache/incubator-mxnet/blob/master/CONTRIBUTORS.md
> to
> > > > list a list of Reviewers.
> > > > - This is a "pesudo role", but holds weight as committers
> > should
> > > highly
> > > > value their reviews during the PR process.
> > > > - The committers/PMC could actively look for good contributors
> and
> > > nominate
> > > > them as Reviewer.
> > > > - Contributors are encouraged to seek reviews from the list of
> > > reviewers.
> > > > - The committers should actively solicit code reviews from the
> > > reviewers
> > > > when reviewing PRs and take their reviews into serious
> > consideration.
> > > >
> > > > - PMCs should actively look for new committers in the current
> > > Reviewers
> > > >- Notably, the history reviews plus contribution likely will
> > > provide a
> > > > good indication on whether the person can uphold the quality
> > > standard of
> > > > the codebase, and provide helpful feedbacks(which is the trait
> that
> > > needed
> > > > from committer to merge code)
> > > >
> > > > Tianqi
> > > >
> > > >
> > > > On Sun, Oct 21, 2018 at 5:13 PM Steffen Rochel <
> > > steffenroc...@gmail.com>
> > > > wrote:
> > > >
> > > > > +1
> > > > > With the release announcement for MXNet 1.3 all contributors
> > incl.
> > > code
> > > > > reviewers have been recognized. I suggest all future release
> > > announcements
> > > > > should include such recognition. Are you suggesting to
> highlight
> > > most
> > > > > active reviewers in release announcement or regularly (e.g.
> > > monthly),
> > > > > specifically from non-committers?
> > > > >
> > > > > On Sun, Oct 21, 2018 at 10:11 AM Tianqi Chen <
> tqc...@apache.org>
> > > wrote:
> > > > >
> > > > > > Also re another email-thread(I sent out one with my
> > > institutional email
> > > > > > which get blocked initially, so this one was a bit
> duplication
> > > of that).
> > > > > I
> > > > > > think it should really be the job of committers to recognize
> >

Re: [Discussion] Separating PMC and Committership

2018-10-18 Thread Chris Olivier

IMHO it’s not a great idea to develop a hard criteria for committer and PMC
as if it were some sort of checklist. If that were the case, then people
would tend to be just laser-focused on checking items off the list rather
than a bonafied drive to improve the product and the community.  It would
also make it difficult to consider other intangeables in the decision.


On Thu, Oct 18, 2018 at 5:43 AM Carin Meier  wrote:

> Thanks Micheal for making the process clearer to me. It helps quite a bit.
>
> Also thanks to Chris and Steffen for your clarification and input.
>
> I think there are two issues that are intermingled in considering this. One
> relates to separating levels of committer and PMC member. The other, as
> Steffen pointed out, relates to the criteria which we use to consider
> people for these levels of membership. I would propose that to make it
> easier to achieve consensus, we consider them each as their own proposal.
>
> The proposal of separating levels of committer and PMC member can be
> considered on the Apache definitions of rights and responsibilities here
> https://www.apache.org/foundation/how-it-works.html#roles: Since the PMC
> member has more rights and responsibilities than a committer, I think it
> implies a stricter criteria, (although it would be unspecified in the
> proposal).
>
> The proposal of redefining our project's criteria in respect to how we
> consider nomination to those roles could be a separate discussion and vote
> since there are other issues that we might want to tackle such as inclusion
> of non-code contributions and general alignment to the Apache definitions.
>
> We can of course choose to tackle the proposal of redefining the criteria
> first or do the separation of levels first since the discussion is already
> in progress.
>
> Thoughts?
>
> - Carin
>
>
>
>
>
>
> On Thu, Oct 18, 2018 at 2:04 AM Steffen Rochel 
> wrote:
>
> > Haibin's proposed "For active contributors we first invite them to become
> > our committers. Later on as they make significant contribution, we can
> > invite them to PMC."
> > Several people raised the question what defines "active contributors" and
> > "make significant contribution". In my view the discussion has not
> answered
> > the questions and it is not clear to me what changes are proposed to
> > https://cwiki.apache.org/confluence/display/MXNET/Becoming+a+Committer .
> > I'm making the assumption that the proposal is to simplify the path for
> > becoming a committer to grow the committer community. So far I have not
> > heard what changes or simplifications are proposed. Without a change I
> fail
> > to see the benefit of this proposal to increase the number of committers.
> > I agree that the path from committer to PMC member should be clarified as
> > well and suggest to align with expectations and responsibilities of PMC
> > members.
> > I'm also under the assumption that the proposal only applies for future
> > committers and PMC members, not for existing PMC members and this
> > assumption should be clarified.
> >
> > Steffen
> >
> > On Wed, Oct 17, 2018 at 4:29 PM Chris Olivier 
> > wrote:
> >
> > > I believe the assumption has always been that current PMC members will
> > > remain PMC members.
> > >
> > > On Wed, Oct 17, 2018 at 3:51 PM Michael Wall  wrote:
> > >
> > > > I too think separating committers from PMC is a good idea for your
> > > project
> > > > given the desire to grow committers and the concerns I have seen
> trying
> > > to
> > > > add new committers.  I saw at least one other mentor, Jim on this
> > thread
> > > > too.
> > > >
> > > > Is the plan to leave all current PMC members in the PMC?  If that is
> > not
> > > > the plan, perhaps more discussion is required before moving on.
> > > >
> > > > Assuming you feel the discussion is done, someone needs to start a
> > vote.
> > > > This would be a procedural change as outlined on
> > > > https://www.apache.org/foundation/voting.html
> > > >
> > > > If I were doing it, I would announce on this thread I am starting a
> > vote
> > > on
> > > > this matter tomorrow or some specified time.  I might even outline
> what
> > > the
> > > > vote will be.  This give people a chance to speak up if they think
> more
> > > > discussion is needed.  Assuming no more discussion, start a [VOTE]
> > thread
> > > > on the dev list.
> > > &g

Re: [Discussion] Separating PMC and Committership

2018-10-17 Thread Chris Olivier

I believe the assumption has always been that current PMC members will
remain PMC members.

On Wed, Oct 17, 2018 at 3:51 PM Michael Wall  wrote:

> I too think separating committers from PMC is a good idea for your project
> given the desire to grow committers and the concerns I have seen trying to
> add new committers.  I saw at least one other mentor, Jim on this thread
> too.
>
> Is the plan to leave all current PMC members in the PMC?  If that is not
> the plan, perhaps more discussion is required before moving on.
>
> Assuming you feel the discussion is done, someone needs to start a vote.
> This would be a procedural change as outlined on
> https://www.apache.org/foundation/voting.html
>
> If I were doing it, I would announce on this thread I am starting a vote on
> this matter tomorrow or some specified time.  I might even outline what the
> vote will be.  This give people a chance to speak up if they think more
> discussion is needed.  Assuming no more discussion, start a [VOTE] thread
> on the dev list.
>
> I am used to seeing [VOTE] and [DISCUSS] in the subject line of such emails
> but I didn't find any official guidance on that.  Maybe it is a project by
> project decision, I did find
> https://cwiki.apache.org/confluence/display/EDGENT/Sample+process+emails.
> I totally parsed right over the [Discussion] in the subject this thread but
> I'll be on the look out for it in the future.
>
> Thanks
>
> Mike
>
> On Wed, Oct 17, 2018 at 6:05 PM Carin Meier  wrote:
>
> > Let me rephrase the question.
> >
> > Since I'm new to the committer/PMC process, I wondering what the next
> step
> > is in a proposed change of process like this.
> >
> > If we gauge that there is a significant enough interest do we propose a
> > vote? Is there enough interest and information to have a vote in this
> case?
> >
> > (Anyone feel free to answer the question - mentor or not :) )
> >
> > - Carin
> >
> > On Tue, Oct 16, 2018 at 7:48 PM Carin Meier 
> wrote:
> >
> > > This has been a very interesting discussion and I think it underlined a
> > > desire to increase the committer pool and community for the project.
> I'm
> > > wondering now what the next steps would look like?
> > >
> > > Do any mentors have any advice on how to proceed?
> > >
> > > - Carin
> > >
> > > On Thu, Oct 11, 2018 at 1:23 PM Jim Jagielski  wrote:
> > >
> > >> In my experience, and in my opinion, it makes sense to distinguish and
> > >> differentiate between a committer and a PMC member. The latter shows
> > just a
> > >> bit more "investment" in the project and has obtained a bit more merit
> > due
> > >> to their continued efforts.
> > >>
> > >> Of course, what we also need is some public governance model that
> shows
> > >> what these levels are, what they mean and how to obtain them. The
> > following
> > >> is the normal setup for Apache projects:
> > >>
> > >> https://www.apache.org/foundation/how-it-works.html#roles
> > >>
> > >> The nice this is that this also allows for a very low-bar-to-entry for
> > >> committer-ship while still maintain a somewhat higher bar for the
> PPMC,
> > >> which is great for community building.
> > >>
> > >> > On Oct 9, 2018, at 2:11 PM, Haibin Lin 
> > >> wrote:
> > >> >
> > >> > Dear MXNet community,
> > >> >
> > >> > In the past when we invite a person to become a committer, he/she is
> > >> > automatically made a PMC member. However, a lot of communities keep
> a
> > >> small
> > >> > PMC, and a bigger and more diverse committers to enrich the
> community.
> > >> This
> > >> > has the benefit of having two opportunities to encourage
> contribution.
> > >> This
> > >> > can also help lower the bar for inviting committers, which helps
> build
> > >> > consensus in our already large PMC. I'd like to propose the
> following:
> > >> >
> > >> > For active contributors we first invite them to become our
> committers.
> > >> > Later on as they make significant contribution, we can invite them
> to
> > >> PMC.
> > >> >
> > >> >
> > >> > ===
> > >> > Comments from Marco:
> > >> >
> > >> > That's a great idea!
> > >> >
> > >> > The hard question is how to differentiate between a committer and a
> > PMC
> > >> > member and where we set the bar for each. If I understand you right,
> > you
> > >> > are proposing to honor active contributions by volume (or another
> > >> similar
> > >> > metric). While I think that's a good idea in general, I have a few
> > >> thoughts:
> > >> >
> > >> > We definitely have a lot of active people in the project, but let's
> > say
> > >> > that they contribute a substantial amount, but their contributions
> > >> can't go
> > >> > in as-is because they lack quality, consistency, testing or they
> don't
> > >> > match with the overall style and best practices. For a
> code-committer,
> > >> this
> > >> > would still be a no-go in my opinion. That person would still
> require
> > >> some
> > >> > guidance and mentoring until they are

Re: [Discussion] Separating PMC and Committership

2018-10-09 Thread Chris Olivier

is it convenient to define the difference and the rights and privileges of
each? write access, private list, voting and veto power, etc?

On Tue, Oct 9, 2018 at 11:11 AM Haibin Lin  wrote:

> Dear MXNet community,
>
> In the past when we invite a person to become a committer, he/she is
> automatically made a PMC member. However, a lot of communities keep a small
> PMC, and a bigger and more diverse committers to enrich the community. This
> has the benefit of having two opportunities to encourage contribution. This
> can also help lower the bar for inviting committers, which helps build
> consensus in our already large PMC. I'd like to propose the following:
>
> For active contributors we first invite them to become our committers.
> Later on as they make significant contribution, we can invite them to PMC.
>
>
> ===
> Comments from Marco:
>
> That's a great idea!
>
> The hard question is how to differentiate between a committer and a PMC
> member and where we set the bar for each. If I understand you right, you
> are proposing to honor active contributions by volume (or another similar
> metric). While I think that's a good idea in general, I have a few
> thoughts:
>
> We definitely have a lot of active people in the project, but let's say
> that they contribute a substantial amount, but their contributions can't go
> in as-is because they lack quality, consistency, testing or they don't
> match with the overall style and best practices. For a code-committer, this
> would still be a no-go in my opinion. That person would still require some
> guidance and mentoring until they are aligned with the project style and
> guidelines as otherwise they might accept low-quality PRs. I know we can
> revert that, but let's avoid confrontation as much as possible.
>
> The minimum bar for a code committer would then be:
> - (almost) unaltered acceptance of their PRs (of course, some PRs are
> intentionally made for discussions and those would even be a plus!)
> - following mxnets community guidelines, rules and styles
> - giving useful reviews (in order to see how they would be as reviewers if
> they were a committer)
> The would be weighted differently on a case by case base, but this could be
> a starting point to describe what we are looking for.
>
> From committer to PMC on the other hand, the difference is quite small.
> Something I personally would be looking for are three things:
> - judgement
> - community engagement
> - Apache way
> While a committer might be chosen due to their contributions, they wouldn't
> be evaluated that strictly for the above points. A PMC member is a
> representative of the project who steers the long term development of it.
> Thus, they should be active on our channels like dev@, make good reviews
> on
> GitHub (if applicable), express good judgement and reasoning during votes
> and generally show that they are generally helpful to the project on a
> non-code level.
>
> These are just some thoughts of mine to help start of this discussions. It
> would be good to hear what other people are looking for while evaluating
> candidates and if there's anything they would like to highlight.
>
> ==
>
> Comments from Carin:
>
> I think it is a good idea. Here is a bit of reasoning behind my thoughts.
>
> *Pros of separating Committer and PMC *
>  - It would allow us to bring on more committers than the previous criteria
> which would help in giving people the tools they need to be productive.
>  - The increased productivity should allow PRs to be reviewed and merged
> more quickly.
>  - Provide a more welcoming experience for people posting new PRs to have
> them processed faster.
>  - Also provide an additional layer of membership (PMC) after a committer
> to help motivate involvement.
>
> *Cons of separating*
>  - There is a possibility of having someone as a committer that isn't as
> closely aligned to the standards and quality suffers.
> *Possible Mitigation*
> - We do have a robust CI that should ensure that basic functionality
> doesn't break.
> - Do additional communication when a new committer is announced what
> the expectation of the standards of committership is.
> - Two votes now need to happen for a person since there are two levels.
>*Possible Mitigation*
> - If we are convinced the person would be a good PMC member as well, we
> could vote them as both at the same time.
>
> I think it would be a good change to try and see how it works out over a
> period of a few months. The nice thing is that if we feel like it isn't
> working well, we can always change the process.
>
> ==
>
>
> Best,
> Haibin
>

Re: [DISCUSS] Use modernized C++11 range loops uniformly throughout the project

2018-10-04 Thread Chris Olivier

-0.5 if keeping it as a warning.


On Thu, Oct 4, 2018 at 1:23 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> I agree that active C++ developers should be the ones making these
> choices.  The downside to that is that many of these people are already
> pretty busy.  To make the best use possible of their time it would probably
> make sense to create a concise doc with proposed style changes and ETAs
> rather than focusing on a single change (modernized range loops).  We've
> got the infrastructure in place now to support any decisions made, and I
> think it's in the best interest of the project as it will (1) unify coding
> style to help readability and (2) make the lifes easier for reviewers which
> are a critical resource on this project.
>
> I've update my PR such that it still modernizes all existing loops, but it
> will now only issue a warning in the case that someone commits an older or
> slower version of a loop.  Of course as we've mentioned a few times this
> only applies to cases where loops can be drop-in replaces with range
> loops.  I.e. no loops indexes are actively used, etc.
>
> Would you be alright with merging this change with warnings instead of
> errors for the time being Chris?
>
> On Wed, Oct 3, 2018 at 7:20 PM Pedro Larroy 
> wrote:
>
> > +1
> >
> > @Chris: do you have data on the performance difference? As far as I know
> > there's a "rewrite rule" like the one between lambdas and C++ functors,
> so
> > performance should be very well defined, but maybe there's something that
> > you are point out that we are missing.
> >
> > Having a consistent and concise code base is beneficial, I think what
> > Kellen is advocating is to use range loops whenever possible, not
> > prescribing its usage on every case, if you have to iterate backwards
> there
> > are other ways such as backwards iterator or others.
> >
> > On Fri, Sep 28, 2018 at 6:54 AM Chris Olivier 
> > wrote:
> >
> > > -1
> > >
> > > Range loops aren’t always the most performant way. In addition,
> sometimes
> > > you want the index. Or maybe you want to iterate backwards, or not
> start
> > > from the first, etc. Maybe you want the iterator because you remove it
> > from
> > > the list at the bottom of the loop Seems like a rule for the sake
> of
> > > having a rule.
> > >
> > > On Fri, Sep 28, 2018 at 2:12 AM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Hello MXNet devs,
> > > >
> > > > I'd like to discuss uniformly adopting C++11 range loops in the MXNet
> > > > project.  The benefits I see are:
> > > >
> > > > *  Improved C++ readability (examples below).
> > > > *  Consistency with other languages.  The range-loops are quite
> similar
> > > to
> > > > loops almost all other programming languages.  Given we're a project
> > that
> > > > supports many languages this language consistency could be positive
> for
> > > our
> > > > community.
> > > > * Consistency within the same project.  Currently different authors
> > have
> > > > different loops styles which hurts codebase readability.
> > > > *  Best available performance.  There are often multiple ways to
> write
> > > > loops in C++ with subtle differences in performance and memory usage
> > > > between loop methods.  Using range-loops ensures we get the best
> > possible
> > > > perf using an intuitive loop pattern.
> > > > *  Slightly lower chance for bugs / OOB accesses when dealing with
> > > indexing
> > > > in an array for example.
> > > >
> > > > If we decide to enable this uniformly throughout the project we can
> > > enable
> > > > this policy with a simple clang-tidy configuration change.  There
> would
> > > be
> > > > no need for reviewers to have to manually provide feedback when
> someone
> > > > uses an older C++ loops style.
> > > >
> > > > -Kellen
> > > >
> > > > Reference PR:  https://github.com/apache/incubator-mxnet/pull/12356/
> > > > Previous clang-tidy discussion on the list:
> > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/b0ae5a9df5dfe0d9074cb2ebe432264db4fa2175b89fa43a5f6e36be@%3Cdev.mxnet.apache.org%3E
> > > >
> > > > -
> > > > Examples:
> > > > for (auto axis_iter = param.axis.begin() ; axis_iter!=
> > param.axis.end();
> > > > ++axis_iter) {
> > > > CHECK_LT(*axis_iter, static_cast(ishape.ndim()));
> > > > stride_[reverse_index] = ishape[*axis_iter];
> > > > ...
> > > > -->
> > > > for (int axis : param.axis) {
> > > > CHECK_LT(axis, static_cast(ishape.ndim()));
> > > > stride_[reverse_index] = ishape[axis];
> > > > ...
> > > > --
> > > > for (size_t i = 0; i < in_array.size(); i++) {
> > > > auto  = in_array[i];
> > > > pre_temp_buf_.emplace_back(nd.shape(), nd.ctx(), true,
> nd.dtype());
> > > > }
> > > > -->
> > > > for (auto & nd : in_array) {
> > > > pre_temp_buf_.emplace_back(nd.shape(), nd.ctx(), true,
> nd.dtype());
> > > > }
> > > >
> > >
> >
>

Jira and Github

2018-10-04 Thread Chris Olivier

Jira has new tight integration with github. I think we should look into
enabling it.

https://blog.github.com/2018-10-04-announcing-the-new-github-and-jira-software-cloud-integration/

Re: [DISCUSS] Use modernized C++11 range loops uniformly throughout the project

2018-09-30 Thread Chris Olivier

unless you don’t think that’s reasonable...

On Sun, Sep 30, 2018 at 7:59 AM Chris Olivier  wrote:

> If you get three +1’s from the top 6 contributors of C++ code (by volume),
> I’ll switch to -0, since the ones committing the most C++ code will be the
> most impacted and probably it should be their decision, imho.
>
> On Sun, Sep 30, 2018 at 12:28 AM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
>> About 60 but they're all addressed In the ref PR.
>>
>> On Sun, Sep 30, 2018, 6:12 AM Chris Olivier 
>> wrote:
>>
>> > How many errors exist in the code base right now if it were to be
>> enabled?
>> >
>> > On Sat, Sep 29, 2018 at 7:03 PM Naveen Swamy 
>> wrote:
>> >
>> > > Thanks Kellen & Anton, for your detailed explanation and links to
>> > > advantages, appreciate it.
>> > > changing my vote to *-0*, I suggest to show as warnings.
>> > >
>> > > On Sat, Sep 29, 2018 at 8:06 PM Anton Chernov 
>> > wrote:
>> > >
>> > > > And if you want a more authoritative opinion on that check out what
>> the
>> > > C++
>> > > > core guidelines are saying [1]:
>> > > >
>> > > > > ES.71: Prefer a range-for-statement to a for-statement when there
>> is
>> > a
>> > > > choice
>> > > > > Reason
>> > > > > Readability. Error prevention. Efficiency.
>> > > >
>> > > > Best regards
>> > > > Anton
>> > > >
>> > > > [1]
>> > > >
>> > > >
>> > >
>> >
>> https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Res-for-range
>> > > >
>> > > >
>> > > > сб, 29 сент. 2018 г. в 16:13, Anton Chernov :
>> > > >
>> > > > > +1
>> > > > >
>> > > > > Maybe it's not necessary to enforce usage of range-based for, but
>> I
>> > > would
>> > > > > highly encourage to to it due to already named advantages. If code
>> > > would
>> > > > be
>> > > > > introduced using the old-style there could be a comment suggesting
>> > the
>> > > > new
>> > > > > way. But why do the manual work and not leave that to the
>> automated
>> > > tool?
>> > > > >
>> > > > > And since it's already automated - wouldn't it be better to keep a
>> > > > unified
>> > > > > modern style?
>> > > > >
>> > > > > Just to make this a trend - C++ evolves quickly and this will not
>> be
>> > > only
>> > > > > upgrade that would needed to be made. And the easier such upgrades
>> > get
>> > > > > accepted the easier in general is to upgrade the codebase.
>> > > > >
>> > > > > Soon the standard will get ranges and concepts and this will
>> change
>> > the
>> > > > > way C++ applications get written significantly. It is a good
>> habit to
>> > > be
>> > > > > open for changes and keep up with the trends. By using the new
>> > > > > possibilities the language can offer you prepare yourself for
>> further
>> > > > > changes and are more likely to accept them, evolving your
>> programming
>> > > > style.
>> > > > >
>> > > > > Take a look at a new examples on modern usages (taken from [1]):
>> > > > >
>> > > > > // since C++17
>> > > > > for (auto&& [first,second] : mymap) {
>> > > > > // use first and second
>> > > > > }
>> > > > >
>> > > > > // since C++20
>> > > > > for (auto& x : foo().items()) { /* .. */ } // undefined behavior
>> if
>> > > foo()
>> > > > > returns by value
>> > > > > for (T thing = foo(); auto& x : thing.items()) { /* ... */ } // OK
>> > > > >
>> > > > > // since C++11
>> > > > > struct cow_string { /* ... */ };
>> > > > > // a copy-on-write string cow_string str = /* ... */;
>> > > > > // for(auto x : str) { /* ... */ } // may cause deep copy
>> > > > > for(auto x : std::as_const(str)) { /* ... */ }
>> > > > >
>> >

Re: [DISCUSS] Use modernized C++11 range loops uniformly throughout the project

2018-09-30 Thread Chris Olivier

If you get three +1’s from the top 6 contributors of C++ code (by volume),
I’ll switch to -0, since the ones committing the most C++ code will be the
most impacted and probably it should be their decision, imho.

On Sun, Sep 30, 2018 at 12:28 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> About 60 but they're all addressed In the ref PR.
>
> On Sun, Sep 30, 2018, 6:12 AM Chris Olivier  wrote:
>
> > How many errors exist in the code base right now if it were to be
> enabled?
> >
> > On Sat, Sep 29, 2018 at 7:03 PM Naveen Swamy  wrote:
> >
> > > Thanks Kellen & Anton, for your detailed explanation and links to
> > > advantages, appreciate it.
> > > changing my vote to *-0*, I suggest to show as warnings.
> > >
> > > On Sat, Sep 29, 2018 at 8:06 PM Anton Chernov 
> > wrote:
> > >
> > > > And if you want a more authoritative opinion on that check out what
> the
> > > C++
> > > > core guidelines are saying [1]:
> > > >
> > > > > ES.71: Prefer a range-for-statement to a for-statement when there
> is
> > a
> > > > choice
> > > > > Reason
> > > > > Readability. Error prevention. Efficiency.
> > > >
> > > > Best regards
> > > > Anton
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Res-for-range
> > > >
> > > >
> > > > сб, 29 сент. 2018 г. в 16:13, Anton Chernov :
> > > >
> > > > > +1
> > > > >
> > > > > Maybe it's not necessary to enforce usage of range-based for, but I
> > > would
> > > > > highly encourage to to it due to already named advantages. If code
> > > would
> > > > be
> > > > > introduced using the old-style there could be a comment suggesting
> > the
> > > > new
> > > > > way. But why do the manual work and not leave that to the automated
> > > tool?
> > > > >
> > > > > And since it's already automated - wouldn't it be better to keep a
> > > > unified
> > > > > modern style?
> > > > >
> > > > > Just to make this a trend - C++ evolves quickly and this will not
> be
> > > only
> > > > > upgrade that would needed to be made. And the easier such upgrades
> > get
> > > > > accepted the easier in general is to upgrade the codebase.
> > > > >
> > > > > Soon the standard will get ranges and concepts and this will change
> > the
> > > > > way C++ applications get written significantly. It is a good habit
> to
> > > be
> > > > > open for changes and keep up with the trends. By using the new
> > > > > possibilities the language can offer you prepare yourself for
> further
> > > > > changes and are more likely to accept them, evolving your
> programming
> > > > style.
> > > > >
> > > > > Take a look at a new examples on modern usages (taken from [1]):
> > > > >
> > > > > // since C++17
> > > > > for (auto&& [first,second] : mymap) {
> > > > > // use first and second
> > > > > }
> > > > >
> > > > > // since C++20
> > > > > for (auto& x : foo().items()) { /* .. */ } // undefined behavior if
> > > foo()
> > > > > returns by value
> > > > > for (T thing = foo(); auto& x : thing.items()) { /* ... */ } // OK
> > > > >
> > > > > // since C++11
> > > > > struct cow_string { /* ... */ };
> > > > > // a copy-on-write string cow_string str = /* ... */;
> > > > > // for(auto x : str) { /* ... */ } // may cause deep copy
> > > > > for(auto x : std::as_const(str)) { /* ... */ }
> > > > >
> > > > > Regarding performance: it's really easy to prove that generated
> > > assembly
> > > > > is not changing at all. There is a really handy tool for that [2].
> > You
> > > > can
> > > > > check online the assembly for different language constructs and
> > > different
> > > > > compilers.
> > > > >
> > > > > Best regards,
> > > > > Anton
> > > > >
> > > > > [1] https://en.cppreference.com/w/cpp/language/

Re: [DISCUSS] Use modernized C++11 range loops uniformly throughout the project

2018-09-29 Thread Chris Olivier

How many errors exist in the code base right now if it were to be enabled?

On Sat, Sep 29, 2018 at 7:03 PM Naveen Swamy  wrote:

> Thanks Kellen & Anton, for your detailed explanation and links to
> advantages, appreciate it.
> changing my vote to *-0*, I suggest to show as warnings.
>
> On Sat, Sep 29, 2018 at 8:06 PM Anton Chernov  wrote:
>
> > And if you want a more authoritative opinion on that check out what the
> C++
> > core guidelines are saying [1]:
> >
> > > ES.71: Prefer a range-for-statement to a for-statement when there is a
> > choice
> > > Reason
> > > Readability. Error prevention. Efficiency.
> >
> > Best regards
> > Anton
> >
> > [1]
> >
> >
> https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Res-for-range
> >
> >
> > сб, 29 сент. 2018 г. в 16:13, Anton Chernov :
> >
> > > +1
> > >
> > > Maybe it's not necessary to enforce usage of range-based for, but I
> would
> > > highly encourage to to it due to already named advantages. If code
> would
> > be
> > > introduced using the old-style there could be a comment suggesting the
> > new
> > > way. But why do the manual work and not leave that to the automated
> tool?
> > >
> > > And since it's already automated - wouldn't it be better to keep a
> > unified
> > > modern style?
> > >
> > > Just to make this a trend - C++ evolves quickly and this will not be
> only
> > > upgrade that would needed to be made. And the easier such upgrades get
> > > accepted the easier in general is to upgrade the codebase.
> > >
> > > Soon the standard will get ranges and concepts and this will change the
> > > way C++ applications get written significantly. It is a good habit to
> be
> > > open for changes and keep up with the trends. By using the new
> > > possibilities the language can offer you prepare yourself for further
> > > changes and are more likely to accept them, evolving your programming
> > style.
> > >
> > > Take a look at a new examples on modern usages (taken from [1]):
> > >
> > > // since C++17
> > > for (auto&& [first,second] : mymap) {
> > > // use first and second
> > > }
> > >
> > > // since C++20
> > > for (auto& x : foo().items()) { /* .. */ } // undefined behavior if
> foo()
> > > returns by value
> > > for (T thing = foo(); auto& x : thing.items()) { /* ... */ } // OK
> > >
> > > // since C++11
> > > struct cow_string { /* ... */ };
> > > // a copy-on-write string cow_string str = /* ... */;
> > > // for(auto x : str) { /* ... */ } // may cause deep copy
> > > for(auto x : std::as_const(str)) { /* ... */ }
> > >
> > > Regarding performance: it's really easy to prove that generated
> assembly
> > > is not changing at all. There is a really handy tool for that [2]. You
> > can
> > > check online the assembly for different language constructs and
> different
> > > compilers.
> > >
> > > Best regards,
> > > Anton
> > >
> > > [1] https://en.cppreference.com/w/cpp/language/range-for
> > > [2] https://gcc.godbolt.org
> > >
> > > сб, 29 сент. 2018 г. в 13:15, kellen sunderland <
> > > kellen.sunderl...@gmail.com>:
> > >
> > >> It's more readable because it's concise and it's consistent for many
> > types
> > >> you're looping over (i.e. primitive arrays, stl iterators, etc all
> work
> > >> the
> > >> same way).  It's also useful because it's consistent with other
> > >> programming
> > >> languages, making C++ codebases much easier to read for novice and
> > >> intermediate developers.  IMO it also leads to better naming in loop
> > >> bodies
> > >> as the concise style means you're less likely to have important 1
> letter
> > >> variable names describing loop elements (e.g. no int i =0 or it ...).
> > >> More
> > >> motivation can be found in the cpp standards proposals for C++11
> > >> http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2005/n1868.html
> and
> > >> http://open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3853.htm.
> > >>
> > >>
> > >>
> > >> On Sat, Sep 29, 2018 at 6:38 PM Naveen Swamy 
> > wrote:
> > >>
> > >> > Kellen,
> > >> >
> > >&

Re: [DISCUSS] Use modernized C++11 range loops uniformly throughout the project

2018-09-28 Thread Chris Olivier

ok then, my vote is still -1, however, because it’s just adding needless
friction for developers imho.

On Fri, Sep 28, 2018 at 7:42 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> "Range loops aren’t always the most performant way" Do you have an example
> where there's a perf difference?
>
> "In addition, sometimes you want the index. Or maybe you want to iterate
> backwards, or not start from the first, etc. Maybe you want the iterator
> because you remove it from the list at the bottom of the loop Seems
> like a rule for the sake of having a rule."
>
> I should have been more clear about this point.  If you're using the index
> in the loop, doing reverse iteration, or not iterating from start-to-end
> this inspection is smart enough to realize it and will not suggest
> optimizing that type of loop.  The loops that would be changes are _only_
> the loops which are detected as equivalent to range-loops.  Examples can be
> found here:
> https://clang.llvm.org/extra/clang-tidy/checks/modernize-loop-convert.html
> or you can look at what's been changed in the ref PR.  I've initially set
> our confidence level at 'reasonable' but we could also set to 'safe' which
> would further reduce the number of loops the check would apply to.
>
> -Kellen
>
> On Fri, Sep 28, 2018 at 3:54 PM Chris Olivier 
> wrote:
>
> > -1
> >
> > Range loops aren’t always the most performant way. In addition, sometimes
> > you want the index. Or maybe you want to iterate backwards, or not start
> > from the first, etc. Maybe you want the iterator because you remove it
> from
> > the list at the bottom of the loop Seems like a rule for the sake of
> > having a rule.
> >
> > On Fri, Sep 28, 2018 at 2:12 AM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > Hello MXNet devs,
> > >
> > > I'd like to discuss uniformly adopting C++11 range loops in the MXNet
> > > project.  The benefits I see are:
> > >
> > > *  Improved C++ readability (examples below).
> > > *  Consistency with other languages.  The range-loops are quite similar
> > to
> > > loops almost all other programming languages.  Given we're a project
> that
> > > supports many languages this language consistency could be positive for
> > our
> > > community.
> > > * Consistency within the same project.  Currently different authors
> have
> > > different loops styles which hurts codebase readability.
> > > *  Best available performance.  There are often multiple ways to write
> > > loops in C++ with subtle differences in performance and memory usage
> > > between loop methods.  Using range-loops ensures we get the best
> possible
> > > perf using an intuitive loop pattern.
> > > *  Slightly lower chance for bugs / OOB accesses when dealing with
> > indexing
> > > in an array for example.
> > >
> > > If we decide to enable this uniformly throughout the project we can
> > enable
> > > this policy with a simple clang-tidy configuration change.  There would
> > be
> > > no need for reviewers to have to manually provide feedback when someone
> > > uses an older C++ loops style.
> > >
> > > -Kellen
> > >
> > > Reference PR:  https://github.com/apache/incubator-mxnet/pull/12356/
> > > Previous clang-tidy discussion on the list:
> > >
> > >
> >
> https://lists.apache.org/thread.html/b0ae5a9df5dfe0d9074cb2ebe432264db4fa2175b89fa43a5f6e36be@%3Cdev.mxnet.apache.org%3E
> > >
> > > -
> > > Examples:
> > > for (auto axis_iter = param.axis.begin() ; axis_iter!=
> param.axis.end();
> > > ++axis_iter) {
> > > CHECK_LT(*axis_iter, static_cast(ishape.ndim()));
> > > stride_[reverse_index] = ishape[*axis_iter];
> > > ...
> > > -->
> > > for (int axis : param.axis) {
> > > CHECK_LT(axis, static_cast(ishape.ndim()));
> > > stride_[reverse_index] = ishape[axis];
> > > ...
> > > --
> > > for (size_t i = 0; i < in_array.size(); i++) {
> > > auto  = in_array[i];
> > > pre_temp_buf_.emplace_back(nd.shape(), nd.ctx(), true, nd.dtype());
> > > }
> > > -->
> > > for (auto & nd : in_array) {
> > > pre_temp_buf_.emplace_back(nd.shape(), nd.ctx(), true, nd.dtype());
> > > }
> > >
> >
>

Re: [DISCUSS] Use modernized C++11 range loops uniformly throughout the project

2018-09-28 Thread Chris Olivier

-1

Range loops aren’t always the most performant way. In addition, sometimes
you want the index. Or maybe you want to iterate backwards, or not start
from the first, etc. Maybe you want the iterator because you remove it from
the list at the bottom of the loop Seems like a rule for the sake of
having a rule.

On Fri, Sep 28, 2018 at 2:12 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Hello MXNet devs,
>
> I'd like to discuss uniformly adopting C++11 range loops in the MXNet
> project.  The benefits I see are:
>
> *  Improved C++ readability (examples below).
> *  Consistency with other languages.  The range-loops are quite similar to
> loops almost all other programming languages.  Given we're a project that
> supports many languages this language consistency could be positive for our
> community.
> * Consistency within the same project.  Currently different authors have
> different loops styles which hurts codebase readability.
> *  Best available performance.  There are often multiple ways to write
> loops in C++ with subtle differences in performance and memory usage
> between loop methods.  Using range-loops ensures we get the best possible
> perf using an intuitive loop pattern.
> *  Slightly lower chance for bugs / OOB accesses when dealing with indexing
> in an array for example.
>
> If we decide to enable this uniformly throughout the project we can enable
> this policy with a simple clang-tidy configuration change.  There would be
> no need for reviewers to have to manually provide feedback when someone
> uses an older C++ loops style.
>
> -Kellen
>
> Reference PR:  https://github.com/apache/incubator-mxnet/pull/12356/
> Previous clang-tidy discussion on the list:
>
> https://lists.apache.org/thread.html/b0ae5a9df5dfe0d9074cb2ebe432264db4fa2175b89fa43a5f6e36be@%3Cdev.mxnet.apache.org%3E
>
> -
> Examples:
> for (auto axis_iter = param.axis.begin() ; axis_iter!= param.axis.end();
> ++axis_iter) {
> CHECK_LT(*axis_iter, static_cast(ishape.ndim()));
> stride_[reverse_index] = ishape[*axis_iter];
> ...
> -->
> for (int axis : param.axis) {
> CHECK_LT(axis, static_cast(ishape.ndim()));
> stride_[reverse_index] = ishape[axis];
> ...
> --
> for (size_t i = 0; i < in_array.size(); i++) {
> auto  = in_array[i];
> pre_temp_buf_.emplace_back(nd.shape(), nd.ctx(), true, nd.dtype());
> }
> -->
> for (auto & nd : in_array) {
> pre_temp_buf_.emplace_back(nd.shape(), nd.ctx(), true, nd.dtype());
> }
>

Re: Feedback request for new Java API

2018-09-27 Thread Chris Olivier

My $0.02 since I’m working with a lot of java and scala lately, including
the interaction between the two:

Please keep in mind the more complex dependency issues that will be
introduced by requiring Java users to now have to pull in a large scala
dependency base. In addition, a lot is Scala is compiler-dependent  with
different jars for 2.10,2.12,etc. and so they’re in different places than
the regular jars and having a Java person suddenly have to deal with this
is a good way to make him say “no thanks”.  My day job build and runtime
environment systems, for instance, don’t handle the mixing very well
without having to hack a bunch of build files and even then it causes
problems from time to time.  Scala experts won’t have problems but it’s a
steep learning curve for Java (or C++) folks.

On Thu, Sep 27, 2018 at 6:14 PM YiZhi Liu  wrote:

> I vote for "2.) Leave the existing macro in place and add another
> which generates a Java friendly version"
>
> @Qing @Andrew, could you give some examples, so that people can better
> understand how it provides "best possible experience" to Java users.
>
> I have no strong preference between having JavaShape & JavaContext or not.
> On Thu, Sep 27, 2018 at 5:56 PM Andrew Ayres 
> wrote:
> >
> > That's not really the conversation I'm wanting to have. I want a
> discussion
> > about the macros with respect to NDArray so that we can get agreement on
> > our path forward with respect to implementing the NDArray wrapper.
> >
> > The design that was put forth and agreed to was for a a Java wrapper
> around
> > the Scala API. Adding a bunch of Java friendly methods inside the Scala
> > code would create a mess for users. Maintenance would be essentially the
> > same for both because either way you're going to be updating Java methods
> > when you make Scala changes.
> >
> > Let's please stick with the issue in the original email.
> >
> > Thanks,
> > Andrew
> >
> > On Thu, Sep 27, 2018 at 5:22 PM Qing Lan  wrote:
> >
> > > I would like to loop this back a layer. Current, there is a discussion
> in
> > > the MXNet Scala community on the ways to implement the Java APIs.
> Currently
> > > there are two thoughts:
> > >
> > > 1. Make Scala Java Friendly (Create Java compatible methods in the
> Scala
> > > Class. such as NDArray with Java compatible constructor)
> > > 2. Make Java friendly wrappers in Scala (Andrew's explanation below)
> > >
> > > The first approach require minimum input from our side to implement
> > > however bring user a bunch of useless api they may not want to use. It
> also
> > > makes Scala package heavier. The good thing is these two packages
> require
> > > minimum maintenance cost. As a tradeoff, if any time in the future we
> want
> > > to make Java big (make Java as the primary language supported by
> MXNet),
> > > then the migration from Scala to Java will be harmful. Spark consider
> this
> > > carefully and decide not to change much on their Scala code base to
> make it
> > > more Java.
> > >
> > > The second approach will make unique NDArray, Shape, Context and more.
> The
> > > good thing about this is we can always holds a version control on Java.
> > > Some breaking changes on Scala may not influence much on Java. It did
> the
> > > best way to decouple the module and good for us to build unique
> pipeline
> > > for Java. The bad thing with this design is the maintenance cost as we
> need
> > > to keep two code bases, but it also make Java side easy to change to
> make
> > > it better compatible with users.
> > >
> > > Thanks,
> > > Qing
> > >
> > > On 9/27/18, 3:25 PM, "Andrew Ayres"  wrote:
> > >
> > > Hi,
> > >
> > > Currently, we're working to implement a new Java API and would like
> > > some
> > > feedback from the community on an implementation detail. In short,
> the
> > > new
> > > Java API will use the existing Scala API (in a manner similar to
> how
> > > the
> > > current Clojure API works). This basically means that we're making
> Java
> > > friendly wrappers to call the existing Scala API.
> > >
> > > The feedback we're looking for is on the implementation of NDArray.
> > > Scala's
> > > NDArray has a significant amount of code which is generated via
> macros
> > > and
> > > we've got two viable paths to move forward:
> > >
> > > 1.) Change the macro to generate Java friendly methods  - To do
> this
> > > we'll
> > > modify the macro so that the generated methods won't have
> > > default/optional
> > > arguments. There may also have to be some changes to parameter
> types to
> > > make them Java friendly. The big advantage here is that ongoing
> > > maintenance
> > > will easier. The disadvantages are that we'll be changing the
> existing
> > > Scala NDArray Infer API (it's marked experimental) and Scala users
> will
> > > lose the ability to use the default and optional arguments.
> > >
> > > 2.) Leave the existing macro in place and add another which
>

Re: Remove MKLML as dependency

2018-09-20 Thread Chris Olivier

I fixed this issue by adding linkage to openblas in the cmake. It was
already being done in the makefile, so there's no github issue.

On Thu, Sep 20, 2018 at 10:01 AM Lv, Tao A  wrote:

> " MKLML does not have a complete blas library and if you don’t link in
> another blas library like open blas, some functions will blow up (ie some
> of the linalg functions)."
> - Is there any GitHub issue for this problem? Maybe we can take a look.
>
> "I was not aware of MKLML still being required with MKLDNN."
> - Just to clarify, MKL-DNN doesn't require MKLML. For performance, MKL-DNN
> requires the GEMM functions which can be provided by both MKL and MKLML.
>
> -Original Message-
> From: Chris Olivier [mailto:cjolivie...@gmail.com]
> Sent: Friday, September 21, 2018 12:07 AM
> To: dev@mxnet.incubator.apache.org
> Subject: Re: Remove MKLML as dependency
>
> MKLML does not have a complete blas library and if you don’t link in
> another blas library like open blas, some functions will blow up (ie some
> of the linalg functions).
>
> I was not aware of MKLML still being required with MKLDNN. I’ve never
> gotten a definitive answer about this from Da, although I’ve asked a couple
> of times.
>
> What does Da say about all of this?
>
> Unless there’s good reason to the contrary, removing MKLML and requiring
> the larger, strangely licensed standalone MKL for everyone seems a bit
> heavy-handed.
>
> On Thu, Sep 20, 2018 at 7:41 AM Lv, Tao A  wrote:
>
> > Hah, seems it's a little confusing here. I think the "Intel MKL" in
> > the first statement includes both the full MKL and MKLML library. And
> > the "dynamic library" there obviously means the MKLML which is
> > delivered in MKL-DNN repo.
> >
> > MKLML is a subset of full MKL and includes all BLAS functions for both
> > single precision and double precision. From this point of view, I
> > think it can be used as a BLAS library, but cannot be used as full MKL.
> >
> > -tao
> >
> > -Original Message-
> > From: Chris Olivier [mailto:cjolivie...@gmail.com]
> > Sent: Thursday, September 20, 2018 9:36 PM
> > To: dev@mxnet.incubator.apache.org
> > Subject: Re: Remove MKLML as dependency
> >
> > thanks for the info. I am still a little confused — your statement
> > said “MKL” and not “MKLML”, so my question is still the same.  Are
> > GEMMS in MKLML or just MKL? I know MKLML doesn’t have a blas library
> > like the main MKL.
> >
> > On Wed, Sep 19, 2018 at 11:49 PM Lv, Tao A  wrote:
> >
> > > Hi Chris, please kindly check the statements here:
> > > https://github.com/intel/mkl-dnn#installation
> > >
> > > " Intel MKL-DNN can take advantage of optimized matrix-matrix
> > > multiplication (GEMM) function from Intel MKL. The dynamic library
> > > with this functionality is included in the repository. "
> > >
> > > " You can choose to build Intel MKL-DNN without binary dependency.
> > > The resulting version will be fully functional, however performance
> > > of certain convolution shapes and sizes and inner product relying on
> > > SGEMM function may be suboptimal."
> > >
> > > -tao
> > >
> > > -Original Message-
> > > From: Chris Olivier [mailto:cjolivie...@gmail.com]
> > > Sent: Thursday, September 20, 2018 11:20 AM
> > > To: dev@mxnet.incubator.apache.org
> > > Subject: Re: Remove MKLML as dependency
> > >
> > > maybe I missed it, but what does MKLML have that mkldnn doesn’t have
> > > that makes it necessary?
> > >
> > > what’s the motivation for removing it?
> > >
> > > On Tue, Sep 18, 2018 at 11:31 PM Lv, Tao A  wrote:
> > >
> > > > If you just want to test the performance, I think you need link
> > > > MKL for BLAS and MKL-DNN for NN. Also MKL-DNN should link MKL for
> > > > better performance.
> > > >
> > > > Here are some ways for you to install full MKL library if you
> > > > don't have
> > > > one:
> > > > 1. Register and download from intel website:
> > > > https://software.intel.com/en-us/mkl
> > > > 2. Apt-get/yum: currently it need configure Intel’s repositories.
> > > > a.
> > > >
> > > https://software.intel.com/en-us/articles/installing-intel-free-libs
> > > -a
> > > nd-python-yum-repo
> > > > b. https://software.intel.com/en-us/articles/
> > > > thatinstalling-intel-free-libs-and-python-apt-repo
> &g

Re: Remove MKLML as dependency

2018-09-20 Thread Chris Olivier

MKLML does not have a complete blas library and if you don’t link in
another blas library like open blas, some functions will blow up (ie some
of the linalg functions).

I was not aware of MKLML still being required with MKLDNN. I’ve never
gotten a definitive answer about this from Da, although I’ve asked a couple
of times.

What does Da say about all of this?

Unless there’s good reason to the contrary, removing MKLML and requiring
the larger, strangely licensed standalone MKL for everyone seems a bit
heavy-handed.

On Thu, Sep 20, 2018 at 7:41 AM Lv, Tao A  wrote:

> Hah, seems it's a little confusing here. I think the "Intel MKL" in the
> first statement includes both the full MKL and MKLML library. And the
> "dynamic library" there obviously means the MKLML which is delivered in
> MKL-DNN repo.
>
> MKLML is a subset of full MKL and includes all BLAS functions for both
> single precision and double precision. From this point of view, I think it
> can be used as a BLAS library, but cannot be used as full MKL.
>
> -tao
>
> -Original Message-
> From: Chris Olivier [mailto:cjolivie...@gmail.com]
> Sent: Thursday, September 20, 2018 9:36 PM
> To: dev@mxnet.incubator.apache.org
> Subject: Re: Remove MKLML as dependency
>
> thanks for the info. I am still a little confused — your statement said
> “MKL” and not “MKLML”, so my question is still the same.  Are GEMMS in
> MKLML or just MKL? I know MKLML doesn’t have a blas library like the main
> MKL.
>
> On Wed, Sep 19, 2018 at 11:49 PM Lv, Tao A  wrote:
>
> > Hi Chris, please kindly check the statements here:
> > https://github.com/intel/mkl-dnn#installation
> >
> > " Intel MKL-DNN can take advantage of optimized matrix-matrix
> > multiplication (GEMM) function from Intel MKL. The dynamic library
> > with this functionality is included in the repository. "
> >
> > " You can choose to build Intel MKL-DNN without binary dependency. The
> > resulting version will be fully functional, however performance of
> > certain convolution shapes and sizes and inner product relying on
> > SGEMM function may be suboptimal."
> >
> > -tao
> >
> > -Original Message-
> > From: Chris Olivier [mailto:cjolivie...@gmail.com]
> > Sent: Thursday, September 20, 2018 11:20 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: Re: Remove MKLML as dependency
> >
> > maybe I missed it, but what does MKLML have that mkldnn doesn’t have
> > that makes it necessary?
> >
> > what’s the motivation for removing it?
> >
> > On Tue, Sep 18, 2018 at 11:31 PM Lv, Tao A  wrote:
> >
> > > If you just want to test the performance, I think you need link MKL
> > > for BLAS and MKL-DNN for NN. Also MKL-DNN should link MKL for better
> > > performance.
> > >
> > > Here are some ways for you to install full MKL library if you don't
> > > have
> > > one:
> > > 1. Register and download from intel website:
> > > https://software.intel.com/en-us/mkl
> > > 2. Apt-get/yum: currently it need configure Intel’s repositories.
> > > a.
> > >
> > https://software.intel.com/en-us/articles/installing-intel-free-libs-a
> > nd-python-yum-repo
> > > b. https://software.intel.com/en-us/articles/
> > > thatinstalling-intel-free-libs-and-python-apt-repo
> > > <https://software.intel.com/en-us/articles/installing-intel-free-lib
> > > s-
> > > and-python-apt-repo> 3. pip install mkl / mkl-devel: ‘mkl’ package
> > > and-python-apt-repo> has
> > > the runtime and ‘mkl-devel’ includes everything with the headers
> > > a.
> > > https://software.intel.com/en-us/articles/installing-the-intel-distr
> > > ib ution-for-python-and-intel-performance-libraries-with-pip-and
> > > 4. conda install: also has mkl and mkl-devel
> > > a. https://anaconda.org/intel/mkl
> > > b. https://anaconda.org/intel/mkl-devel
> > >
> > > If you want to redistribute MKL with MXNet, you may need take care
> > > of the license issue. Currently, MKL is using ISSL (
> > > https://software.intel.com/en-us/license/intel-simplified-software-l
> > > ic
> > > ense
> > > ).
> > >
> > > -Original Message-
> > > From: Zai, Alexander [mailto:alex...@amazon.com.INVALID]
> > > Sent: Wednesday, September 19, 2018 12:49 PM
> > > To: dev@mxnet.incubator.apache.org
> > > Subject: Re: Remove MKLML as dependency
> > >
> > > Will test it out tomorrow.
> > >
>

Re: Remove MKLML as dependency

2018-09-20 Thread Chris Olivier

thanks for the info. I am still a little confused — your statement said
“MKL” and not “MKLML”, so my question is still the same.  Are GEMMS in
MKLML or just MKL? I know MKLML doesn’t have a blas library like the main
MKL.

On Wed, Sep 19, 2018 at 11:49 PM Lv, Tao A  wrote:

> Hi Chris, please kindly check the statements here:
> https://github.com/intel/mkl-dnn#installation
>
> " Intel MKL-DNN can take advantage of optimized matrix-matrix
> multiplication (GEMM) function from Intel MKL. The dynamic library with
> this functionality is included in the repository. "
>
> " You can choose to build Intel MKL-DNN without binary dependency. The
> resulting version will be fully functional, however performance of certain
> convolution shapes and sizes and inner product relying on SGEMM function
> may be suboptimal."
>
> -tao
>
> -Original Message-
> From: Chris Olivier [mailto:cjolivie...@gmail.com]
> Sent: Thursday, September 20, 2018 11:20 AM
> To: dev@mxnet.incubator.apache.org
> Subject: Re: Remove MKLML as dependency
>
> maybe I missed it, but what does MKLML have that mkldnn doesn’t have that
> makes it necessary?
>
> what’s the motivation for removing it?
>
> On Tue, Sep 18, 2018 at 11:31 PM Lv, Tao A  wrote:
>
> > If you just want to test the performance, I think you need link MKL
> > for BLAS and MKL-DNN for NN. Also MKL-DNN should link MKL for better
> > performance.
> >
> > Here are some ways for you to install full MKL library if you don't
> > have
> > one:
> > 1. Register and download from intel website:
> > https://software.intel.com/en-us/mkl
> > 2. Apt-get/yum: currently it need configure Intel’s repositories.
> > a.
> >
> https://software.intel.com/en-us/articles/installing-intel-free-libs-and-python-yum-repo
> > b. https://software.intel.com/en-us/articles/
> > thatinstalling-intel-free-libs-and-python-apt-repo
> > <https://software.intel.com/en-us/articles/installing-intel-free-libs-
> > and-python-apt-repo> 3. pip install mkl / mkl-devel: ‘mkl’ package has
> > the runtime and ‘mkl-devel’ includes everything with the headers
> > a.
> > https://software.intel.com/en-us/articles/installing-the-intel-distrib
> > ution-for-python-and-intel-performance-libraries-with-pip-and
> > 4. conda install: also has mkl and mkl-devel
> > a. https://anaconda.org/intel/mkl
> > b. https://anaconda.org/intel/mkl-devel
> >
> > If you want to redistribute MKL with MXNet, you may need take care of
> > the license issue. Currently, MKL is using ISSL (
> > https://software.intel.com/en-us/license/intel-simplified-software-lic
> > ense
> > ).
> >
> > -Original Message-
> > From: Zai, Alexander [mailto:alex...@amazon.com.INVALID]
> > Sent: Wednesday, September 19, 2018 12:49 PM
> > To: dev@mxnet.incubator.apache.org
> > Subject: Re: Remove MKLML as dependency
> >
> > Will test it out tomorrow.
> >
> > On the side, what is the best way to test MKL build for MXnet. MKL is
> > licensed?
> >
> > Best,
> > Alex
> >
> > On 9/18/18, 7:50 PM, "Lv, Tao A"  wrote:
> >
> > Hi Alex,
> >
> > Thanks for bringing this up.
> >
> > The original intention of MKLML is to provide a light and
> > easy-to-access library for ML/DL community. It's released with MKL-DNN
> > under Apache-2.0 license.
> >
> > AFAIK, MKL-DNN still relies on it for better performance. So I'm
> > afraid there will be a performance regression in MKL pip packages if
> > MKLML is simply removed.
> >
> > Have you ever tried the build without MKLML and how does the
> > performance look like?
> >
> > -tao
> >
> > -Original Message-
> > From: Alex Zai [mailto:aza...@gmail.com]
> > Sent: Wednesday, September 19, 2018 4:49 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: Remove MKLML as dependency
> >
> > On our build from source page we have a list of blas libraries
> > that are recommended:
> > https://mxnet.incubator.apache.org/install/build_from_source.html
> >
> > MKL-DNN
> > MKL
> > MKLML
> > Apple Accelerate
> > OpenBlas
> >
> > MKLML is a subset of MKL (
> https://github.com/intel/mkl-dnn/issues/102)
> > and therefore MKLML users can just use MKL instead. Does anyone
> > see an issue with me removing this? It would simplify out doc page and
> build file.
> >
> > Alex
> >
> >
> >
>

Re: Remove MKLML as dependency

2018-09-19 Thread Chris Olivier

maybe I missed it, but what does MKLML have that mkldnn doesn’t have that
makes it necessary?

what’s the motivation for removing it?

On Tue, Sep 18, 2018 at 11:31 PM Lv, Tao A  wrote:

> If you just want to test the performance, I think you need link MKL for
> BLAS and MKL-DNN for NN. Also MKL-DNN should link MKL for better
> performance.
>
> Here are some ways for you to install full MKL library if you don't have
> one:
> 1. Register and download from intel website:
> https://software.intel.com/en-us/mkl
> 2. Apt-get/yum: currently it need configure Intel’s repositories.
> a.
> https://software.intel.com/en-us/articles/installing-intel-free-libs-and-python-yum-repo
> b. https://software.intel.com/en-us/articles/
> thatinstalling-intel-free-libs-and-python-apt-repo
> 
> 3. pip install mkl / mkl-devel: ‘mkl’ package has the runtime and
> ‘mkl-devel’ includes everything with the headers
> a.
> https://software.intel.com/en-us/articles/installing-the-intel-distribution-for-python-and-intel-performance-libraries-with-pip-and
> 4. conda install: also has mkl and mkl-devel
> a. https://anaconda.org/intel/mkl
> b. https://anaconda.org/intel/mkl-devel
>
> If you want to redistribute MKL with MXNet, you may need take care of the
> license issue. Currently, MKL is using ISSL (
> https://software.intel.com/en-us/license/intel-simplified-software-license
> ).
>
> -Original Message-
> From: Zai, Alexander [mailto:alex...@amazon.com.INVALID]
> Sent: Wednesday, September 19, 2018 12:49 PM
> To: dev@mxnet.incubator.apache.org
> Subject: Re: Remove MKLML as dependency
>
> Will test it out tomorrow.
>
> On the side, what is the best way to test MKL build for MXnet. MKL is
> licensed?
>
> Best,
> Alex
>
> On 9/18/18, 7:50 PM, "Lv, Tao A"  wrote:
>
> Hi Alex,
>
> Thanks for bringing this up.
>
> The original intention of MKLML is to provide a light and
> easy-to-access library for ML/DL community. It's released with MKL-DNN
> under Apache-2.0 license.
>
> AFAIK, MKL-DNN still relies on it for better performance. So I'm
> afraid there will be a performance regression in MKL pip packages if MKLML
> is simply removed.
>
> Have you ever tried the build without MKLML and how does the
> performance look like?
>
> -tao
>
> -Original Message-
> From: Alex Zai [mailto:aza...@gmail.com]
> Sent: Wednesday, September 19, 2018 4:49 AM
> To: dev@mxnet.incubator.apache.org
> Subject: Remove MKLML as dependency
>
> On our build from source page we have a list of blas libraries that
> are recommended:
> https://mxnet.incubator.apache.org/install/build_from_source.html
>
> MKL-DNN
> MKL
> MKLML
> Apple Accelerate
> OpenBlas
>
> MKLML is a subset of MKL (https://github.com/intel/mkl-dnn/issues/102)
> and therefore MKLML users can just use MKL instead. Does anyone see an
> issue with me removing this? It would simplify out doc page and build file.
>
> Alex
>
>
>

Re: Off-Heap Memory Management in MXNet Scala

2018-09-11 Thread Chris Olivier

do you log on finalize() if the object wasn’t properly freed (ie
NDArray.finalize())? is that available in Scala?

On Tue, Sep 11, 2018 at 6:12 PM Qing Lan  wrote:

> Nice document! Way better than current .dispose() in Scala!
>
> Thanks,
> Qing
>
> On 9/11/18, 6:04 PM, "Chris Olivier"  wrote:
>
> wow, incredible document!
>
> On Tue, Sep 11, 2018 at 2:37 PM Naveen Swamy 
> wrote:
>
> > Hi All,
> >
> > I am working on managing Off-Heap Memory Management and have written
> a
> > proposal here based on my prototype and research I did.
> >
> > Please review the doc and provide your feedback ?
> >
> >
> https://cwiki.apache.org/confluence/display/MXNET/JVM+Memory+Management
> >
> > I had offline discussion with a few people I work with and added
> their
> > feedback to the doc as well.
> >
> > Thanks, Naveen
> >
>
>
>

Re: Off-Heap Memory Management in MXNet Scala

2018-09-11 Thread Chris Olivier

wow, incredible document!

On Tue, Sep 11, 2018 at 2:37 PM Naveen Swamy  wrote:

> Hi All,
>
> I am working on managing Off-Heap Memory Management and have written a
> proposal here based on my prototype and research I did.
>
> Please review the doc and provide your feedback ?
>
> https://cwiki.apache.org/confluence/display/MXNET/JVM+Memory+Management
>
> I had offline discussion with a few people I work with and added their
> feedback to the doc as well.
>
> Thanks, Naveen
>

Re: [RESULT][VOTE] Release MXNet version 1.3.0

2018-09-07 Thread Chris Olivier

nit: using the "-" before peoples' names makes this kind of hard to read
for me, since "-" is part of "-1"

On Fri, Sep 7, 2018 at 11:18 AM Roshani Nagmote 
wrote:

> Hi All,
>
> So, this vote passes with *seven* +1, *two* 0  and *three* -1 votes.
>
> *+1 votes*
> *Committers:*
> - Joshua Zhang
> - Carin
> - Naveen
> - Indu
> - Haibin
>
> *Community:*
> - Pigeon Lucky
> - Steffen
> *0 votes:*
> *Community:*
> - Thomas
> - Aaron
> *-1 votes:*
> *Committers:*
> - Sandeep
> - Anirudh
>
> *Community:*
> - Hagay
>
> *Vote Thread:*
>
>
> https://lists.apache.org/thread.html/8ad6f14811be465cdf663d6962980fd95e12193626292631a21ec6f1@%3Cdev.mxnet.apache.org%3E
>
>
> I will continue with the release process on general@ and the release
> announcement will follow in the next few days.
>
> Thanks,
> Roshani
>

Re: [VOTE] Release MXNet version 1.3.0.RC0

2018-09-04 Thread Chris Olivier

btw, there are no vetoes on package releases:

VOTES ON PACKAGE RELEASES


Votes on whether a package is ready to be released use majority approval
 -- i.e.
at least three PMC members must vote affirmatively for release, and there
must be more positive than negative votes.Releases may not be vetoed. Generally
the community will cancel the release vote if anyone identifies serious
problems, but in most cases the ultimate decision, lies with the individual
serving as release manager. The specifics of the process may vary from
project to project, but the 'minimum quorum of three +1 votes' rule is
universal.

On Tue, Sep 4, 2018 at 7:12 PM Sheng Zha  wrote:

> Thanks for sharing your opinions, Thomas. Your recognition and respect of
> people's efforts on preparing the release candidate are certainly
> appreciated.
>
> Now that the vote is set to fail thanks to the veto, there will be plenty
> of opportunities to include those bug fixes, including the one Zhi
> mentioned [1], which was already merged in the master and yet chose not to
> block this release with [2]. I will be happy to work with Roshani to
> prepare another release candidate once ready.
>
> -sz
>
> [1]
>
> https://lists.apache.org/thread.html/f02e952bec22c82cb00a6741390a78f55373311c97464997bb455a6c@%3Cdev.mxnet.apache.org%3E
> [2]
>
> https://lists.apache.org/thread.html/85d3fcabb3437ba7f1af455cf69aa13eb3afd1ea1d1f6f891e9c339c@%3Cdev.mxnet.apache.org%3E
>
> On Tue, Sep 4, 2018 at 6:02 PM Thomas DELTEIL 
> wrote:
>
> > -0
> > (non-binding)
> >
> > If I may add some nuancing plus a personal data point as one of the users
> > commenting in the bug report in question:
> >
> > - Performance vs. Basic functionality => I don't think high performance
> > use-cases and basic functionality are two obviously opposed concepts and
> > see no contradiction in Hagay's and Sandeep's statements.
> > Float16 support is feature of MXNet that provides more than twice the
> > performance of Float32 on supported platforms, hence the high performance
> > use-case. The bug is that the basic functionality of reloading a saved
> > float16 models is currently broken.
> >
> > - This bug vs Other bugs => Contrary the vast majority of the 140 open
> bugs
> > that are mentioned above, I would put to Sandeep's credit that this one
> bug
> > has a PR open that provides a fix for it. This would make it a better
> > candidate to get included in this release than a bug that has no fix
> ready
> > for it.
> >
> > - Personal datapoint: I recently did some experimentation with float16
> [1]
> > and actually coincidentally just published a video on optimizing
> > performance for Gluon. Float16 conversion is one of the most, if not the
> > most effective way to get performance out of MXNet [2]. I believe there
> is
> > a lot of value in publicizing more its use and hence making sure at least
> > the basic support for normal use-cases is present.
> >
> > Of course this needs to be balanced with the overhead of preparing a new
> > release candidate once the fixed is reviewed and merged, which seems to
> be
> > a lengthy and complex process in its own right, and the delay with
> > providing the other features present in 1.3 for users that are not
> running
> > off the nightly builds.
> >
> > All the best,
> >
> > Thomas
> >
> > [1] https://github.com/ThomasDelteil/PerformanceTricksMXNetGluon
> > [2]
> >
> >
> https://www.youtube.com/watch?v=Cqo7FPftNyo=0s=PLkEvNnRk8uVk6U515Pj-jHQUxFC4eDi3m
> >
> > Le mar. 4 sept. 2018 à 17:11, Sheng Zha  a écrit :
> >
> > > Sandeep,
> > >
> > > Thanks for explaining your veto. We have open bugs that impacted a lot
> > more
> > > than just 3 customers, just by referring to the number of commenters on
> > the
> > > issue [1].
> > >
> > > You said that this is for "high performance use cases", which
> contradicts
> > > with Hagay's assement that this is "basic functionality broken". Given
> > that
> > > this is for advanced use cases of using half-precision training, why is
> > it
> > > so much more important than any other open bug reports, that for this
> > > specific bug fix, we have to delay the access of regular users to the
> new
> > > MXNet 1.3 release by at least another week?
> > >
> > > Honestly, I'm concerned that your vote is biased by Amazon involvement,
> > > given that you quoted Amazon Rekognition.
> > >
> > > -sz
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/incubator-mxnet/issues?q=is%3Aissue+is%3Aopen+label%3ABug+sort%3Acomments-desc
> > >
> > > On Tue, Sep 4, 2018 at 4:51 PM sandeep krishnamurthy <
> > > sandeep.krishn...@gmail.com> wrote:
> > >
> > > > My initial vote of “-0” was due to lack of info from a user who had
> > said,
> > > > he overcame this issue for FP16 model.
> > > >
> > > >
> > > > However, suggested workaround [1] for the issue is not straight
> forward
> > > and
> > > > generally

Re: Updating MXNet's Cub

2018-08-24 Thread Chris Olivier

+1 for pointing to NVidia's repo for the newer Cub and subsequent versions.

On Fri, Aug 24, 2018 at 10:01 AM Hagay Lupesko  wrote:

> Hi all,
>
>
> One of MXNet’s submodule dependencies is a snapshot of Nvidia Cub (
> https://github.com/dmlc/cub) – the snapshot is of an older version of Cub
> (1.7), while the latest Nvidia Cub release is 1.8.  Note that dmlc/cub has
> no customizations of the source Cub repo.
>
>
> I’d like to suggest to update the existing Cub submodule to Nvidia’s Cub
> repo. Instead of the snapshot, MXNet will be using Nvidia’s repo and the
> latest release (both repos have the same BSD-3 license, so licensing should
> not be an issue).
>
>
> Wanted to get feedback from the community to make sure I'm not missing
> anything.
>
> if there are no objections I'll submit a PR for the change.
>
>
> Cheers,
>
> Hagay
>

Re: Proposal: Apache MXNet user group meetings

2018-08-22 Thread Chris Olivier

The wiki lists 2111 Univ Ave, East Palo Alto, which is a body shop and/or
pharmacy across the street from the Amazon building, which is at 2100 Univ
Ave.

On Wed, Aug 22, 2018 at 3:10 PM Marco de Abreu
 wrote:

> Thanks for your proposal, Denis. This sounds great!
>
> I especially like that we are making it easy for other groups to host these
> sessions. That way, everybody can offer support in the schedule and for the
> fields they feel most comfortable with. This makes it possible to directly
> hear about the pain points of our developers and then immediately transfer
> them into actionable items.
>
> Best regards
> Marco
>
> Davydenko, Denis  schrieb am Mi., 22. Aug.
> 2018, 20:04:
>
> > Hello, Apache MXNet community,
> >
> >
> >
> > I would like to submit for your consideration updated and improved
> > suggestion for Apache MXNet User Groups meetings: [1]. Main goals of this
> > initiative are:
> >
> > - make it easy: users and developers should be able to understand what is
> > Group Meeting and how to participate in it in under 60 seconds
> >
> > - make it real: users and developers will have access to Apache MXNet
> > committers and contributors live and can discuss their topics in person
> >
> > - make it scalable: it should be easy for Apache MXNet community members
> > to join initiative and host more User Groups meetings on their own
> schedule
> >
> > - make it fun: there should be minimal process to run User Groups
> meetings
> >
> >
> >
> > Please note this wiki page’s purpose is to introduce and describe
> > initiative for Apache MXNet community and seek improvements/changes
> > suggestions. User facing document will be separate and will only contain
> > short paragraph on what is User Group meetings are and how to participate
> > (see above: user should be able to understand it in under 60 seconds).
> >
> >
> >
> > There were couple quite successful pilot sessions hosted by group of
> > Apache MXNet developers in Berlin. Users clearly likes this channel of
> > communication and this initiative is an effort to streamline it for
> bigger
> > user participation.
> >
> >
> >
> > Also, we are looking to add small reference to User Groups meetings into
> > Apache MXNet documentation to bring user awareness up.
> >
> >
> >
> > [1]:
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=87299820
> >
> >
> >
> > --
> >
> > Thanks,
> >
> > Denis
> >
> >
>

Re: [DISCUSS] remove reference to https://gitter.im/dmlc/mxnet

2018-08-12 Thread Chris Olivier

+1

On Sun, Aug 12, 2018 at 2:31 PM Steffen Rochel 
wrote:

> https://github.com/apache/incubator-mxnet/blob/master/README.md has still
> reference to gitter channel on dmlc/mxnet. There is not a lot of traffic on
> the gitter channel. I'm suggesting to change the reference to ASF slack
> channel and discuss.mxnet.io.
> I already sent a message on the gitter channel how people can convert to
> popular Apache MXNet communication channels.
>
> Comments or concerns?
>
> Steffen
>

Re: How should MXNet treat nan values?

2018-07-21 Thread Chris Olivier

If you behave like numpy for sparse, then things like dividing any sparse
matrix by another sparse matrix will produce a dense matrix with a lot of
NaNs in it wherever it encountered a “missing” value in both the source and
destination positions of the sparse matrices (ie 0 divided by 0). If i
remember correctly, this was the process at first (creating a dense matrix
with lots of NaNs, which was equivalent to the same calculation with dense
matrices in numpy) but then was replaced by just assuming that you didn’t
want those NaNs and what you really wanted was a sparse result.

On Sat, Jul 21, 2018 at 12:31 PM Junru Shao  wrote:

> However, I am not 100% sure how much performance will be sacrificed if we
> stick to NumPy's approach which seems to check numeral exceptions on each
> step.
>
> I believe it will be great if we could make the default setting to be "no
> checking", and leave users an option to turn on these numeral exception
> checkings.
>
> On 2018/07/20 22:19:46, Leonard Lausen  wrote:
> > Hello MXNet community,
> >
> > It seems that there is currently no agreed upon principle to handle
> > `nan` values in operators. This has led to inconsistencies between
> > operators and also to inconsistency over releases. Some operators ignore
> > nan values (eg. argmax), others treated it as maximum (e.g. topk up to
> > mxnet v1.2) or just return “undefined” output (e.g. topk starting with
> > mxnet v1.3).
> >
> > Initially the change in topk was reported as a bug
> > (https://github.com/apache/incubator-mxnet/issues/8510) as some users
> > relied on the behavior. However (and rightfully) @asmushetzel, who
> > contributed the improved topk operator for mxnet v1.3 pointed out that
> > the change did not break any documented behavior.
> >
> > To go forward, please share your opinion how MXNet should handle `nan`
> > values. Should we continue to treat the behavior as undefined and
> > possibly silently changing between releases? Should we define a
> > reasonable standard (e.g. follow numpy) and treat operators that deviate
> > as buggy? Should we just document how operators behave currently and
> > warn if the behavior changes? Something else?
> >
> > Please make your opinion known so above issue can be resolved/closed and
> > general guidelines can be defined for future contributions, following
> > whatever consensus emerges.
> >
> > Thanks!
> > Leonard
> >
>

Re: Remove Caffe functionality

2018-07-19 Thread Chris Olivier

There’s more production systems still using Caffe 1 than you may realize.
While it may be easy for individual developers to move off of Caffe, it
will be quite some time before many production systems are off of it,
because, as you can imagine, it’s much more difficult to change a live
system without a good transition plan (which the caffe integration features
do).  (I can think of several major services at my day job that use Caffe).

On Thu, Jul 19, 2018 at 1:42 PM Afrooze, Sina  wrote:

> I think it'd be more sustainable long term to expect users to convert
> caffe -> caffe2 ->onnx->mxnet rather than caffe->mxnet. Perhaps an MXNet
> tutorial can show how to do this with minimum pain. - Sina
>
>
> On 7/19/18, 1:27 PM, "Steffen Rochel"  wrote:
>
> Hi Mu - do we have any indication how many users we do have on the
> functionality? I see only two github issues from 2017 open.
>
> On Thu, Jul 19, 2018 at 12:54 PM Mu Li  wrote:
>
> > Hi Anton,
> >
> > It's understandable that Caffe is old and its community is
> shrinking. For
> > this very reasons, existing caffe users are looking for
> alternatives. The
> > current converter is important for such users to transit to MXNet. It
> > addresses an important issue that: how I can reuse my previous caffe
> models
> > in MXNet.
> >
> > On Thu, Jul 19, 2018 at 11:53 AM, Anton Chernov  >
> > wrote:
> >
> > > Dear community,
> > >
> > > Currently MXNet has a Caffe framework integration (translator and
> > > converter) [1].
> > >
> > > There were some issues discovered with it, for example some tests
> were
> > > failing [2]. Since we decided to remove the flaky tests and
> proceed to
> > > making them stable I propose completely removing this functionality
> > > instead.
> > >
> > > There are multiple reasons to this:
> > >
> > > * Mind that this is Caffe 1 (not 2)
> > > * Some people mentioned: "Caffe is soo 2015."
> > > * Keeping functionality that is both unstable and old is a burden
> for
> > > maintenance.
> > > * Keeping functionality that nobody needs is not necessary overall
> > >
> > > Please let me know your thoughts.
> > >
> > > Best
> > > Anton
> > >
> > > [1]
> > > https://github.com/apache/incubator-mxnet/commits/
> > > master/tools/caffe_converter/convert_caffe_modelzoo.py
> > > [2]
> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > incubator-mxnet/detail/master/1207/pipeline/
> > >
> >
>
>
>
>

Re: [VOTE] Subscribe dev@ to Github Activities

2018-07-18 Thread Chris Olivier

to know about github discussions, you’d need to scan all issues and prs
constantly which isn’t a reasonable expectation. dev is where discussions
are supposed to happen in a apache, PERIOD.

Apache isn’t dmlc. I wish some people would stop trying to turn Apache
conventions into dmlc conventions.  seems this is a constant push from day
one.


On Wed, Jul 18, 2018 at 9:39 AM Sheng Zha  wrote:

> Thanks, I hear the concerns and it's not my intention to push people off
> the list. On the other hand, I think github discussions are no more
> "artificial" than discussions on dev list, and the good and important
> discussions warrant the same amount of attention. With this vote, I intend
> to make contributors' life easier by decoupling the recognized forum from
> the technology they use, and that github contributors can easily
> communicate with the community on the list.
>
> -sz
>
> On Wed, Jul 18, 2018 at 9:05 AM, Barber, Christopher <
> christopher.bar...@analog.com> wrote:
>
> > Can't you simply tell contributors to discuss changes on dev before
> > submitting a PR? Since the contribution guidelines don't tell developers
> to
> > post to dev, why would you expect them to do that?
> >
> > Is there an easy way to just subscribe to PR notifications or will
> someone
> > have to write a filter to avoid spamming dev with all GitHub
> notifications?
> > I think that if dev gets too much traffic, then people with casual
> interest
> > may find it easier to unsubscribe than to set up filters. Once someone
> > unsubscribes, they probably won't be coming back soon, so you should be
> > very careful with this.
> >
> > I don't see why artificially increasing the traffic on dev will do
> > anything to grow the community in any case.
> >
> > - C
> >
> > On 7/18/18, 11:17 AM, "Indhu"  wrote:
> >
> > Some mentors/contributors/committees feel that the amount of
> > discussions in
> > dev list is too less given the amount of commits that happen and more
> > discussions need to happen in the dev list to grow the community.
> >
> > In response some committees feel discussions actually happen in
> GitHub
> > PRs.
> > If the policy says "if it didn't happen in dev, it didn't happen",
> > let's
> > forward all GitHub discussions to dev so those discussions would
> count.
> > That's the motivation for this vote.
> >
> > I think when people say there needs to be more discussions in the dev
> > list,
> > I assume they mean the kind of discussions that happen *before* a PR
> is
> > created or even before someone starts working on anything. I don't
> > think
> > people are asking an email for every activity on GitHub. The correct
> > way to
> > address the problem would be for committees/contributors to stop
> > communicating in private channels (like Amazon or DMLC communication
> > channels) and do those discussions in the dev list instead.
> >
> > Indu
> >
> >
> > On Wed, Jul 18, 2018, 5:51 AM Barber, Christopher <
> > christopher.bar...@analog.com> wrote:
> >
> > > Can't people already subscribe to github notifications? I think it
> > is safe
> > > to assume that developers are already smart enough to figure out
> how
> > to do
> > > that if they want. What problem are you really trying to solve
> here?
> > >
> > > On 7/18/18, 4:49 AM, "Chris Olivier" 
> wrote:
> > >
> > > -1.  (changed from -0.9)
> > >
> > > seems more like a strategy (whether intentional or on accident)
> > to
> > > *not*
> > > have design discussions on dev by flooding it with noise and
> > then later
> > > claim it was discussed, even though you would have to sift
> > through
> > > thousands of emails to find it.
> > >
> > >
> > >
> > > On Wed, Jul 18, 2018 at 12:42 AM Rahul Huilgol <
> > rahulhuil...@gmail.com
> > > >
> > > wrote:
> > >
> > > > I pulled up some more stats so we can make an informed
> > decision.
> > > >
> > > > Here are some popular Apache projects and the number of
> emails
> > to
> > > their
> > > > dev@
> > > > list in the last 30 days
> > > > Apache Flink: 540 mails
> >

Re: [VOTE] Subscribe dev@ to Github Activities

2018-07-18 Thread Chris Olivier

-1.  (changed from -0.9)

seems more like a strategy (whether intentional or on accident) to *not*
have design discussions on dev by flooding it with noise and then later
claim it was discussed, even though you would have to sift through
thousands of emails to find it.



On Wed, Jul 18, 2018 at 12:42 AM Rahul Huilgol 
wrote:

> I pulled up some more stats so we can make an informed decision.
>
> Here are some popular Apache projects and the number of emails to their
> dev@
> list in the last 30 days
> Apache Flink: 540 mails
> Apache Spark: 249 mails
> Apache Hive: 481 mails
> Apache HBase: 300 mails
>
> Current dev list for MXNet: 348 mails
> Current commits list for MXNet: 5329 mails
> Making the proposed dev list for MXNet to be ~5677 mails.
>
> Sheng, even going by your comments that 1 of of those 4 mails are relevant
> for dev@, that's still a really high number of emails. (130 email lists
> doesn't say anything if we ignore the actual number of emails in those
> lists, especially when the 131st sends these many mails :) ). People are
> already talking about setting up filters here. Doesn't that defeat the
> purpose by making people filter out the discussion on Github? People can
> subscribe to commits@ if they find it more convenient to follow Github
> activity over email rather than Github.com.
>
> We should strive to maintain dev@ as a place for high quality discussion.
> It's upto the contributors to bring up something to dev@ if they believe
> it
> deserves a focused discussion in the community. That discussion may be
> started by the person who proposes code changes, or a reviewer who believes
> that a particular code change warrants further discussion.
>
> Regards,
> Rahul
>

Re: [VOTE] Subscribe dev@ to Github Activities

2018-07-18 Thread Chris Olivier

-0.9

Do any other Apache projects do this? Seems really odd. Jira was posting to
dev for maybe 3 days and people were complaining like crazy about the
noise, and that was just a few tickets. Now we’re talking about possibly
hundreds of emails per day. ALL PR comments, commit notificatios, issue
movement, tagging, etc.

It’s hard to imagine how this would be useful.

Also, does this also mean that claiming that anything said or done in
github “was discusssd on dev”?

-C

On Tue, Jul 17, 2018 at 2:24 PM Sheng Zha  wrote:

> Thanks, Rahul. Out of the 4 conversations you listed that you think are not
> necessary, I actually think the PR on coreml tool may be worth discussing.
> For example, should it (and other tools) have a separate repo, and should
> its version management be tied to mxnet.
>
> And on:
>
> > If people are forced to setup filters to parse these mails, then we are
> *ensuring*
> people don't get their eyes on valuable discussions on dev@.
>
> I think this argument is based more on emotion than on reason. I subscribe
> to over 130 email lists for work, lots of which has PR/commit updates that
> are not my immediate concern, and it hasn't prevented me from reading
> valuable discussions.
>
> -sz
>
> On Tue, Jul 17, 2018 at 1:05 PM, Rahul Huilgol 
> wrote:
>
> > -1
> >
> > We had such a thing before and people asked for the mails to be
> redirected
> > to a different list commits@ because of the flood of mails.
> >
> > https://lists.apache.org/thread.html/8b834e39110381fadb8a0ab59185a8
> > f52b8406247a1f281f7d691392@%3Cdev.mxnet.apache.org%3E
> >
> > I don't know if people have a sense of the volume of mails this can add
> > here. Here's the stats from the commits@ email list we have. I'd be
> > curious
> > to see how many subscribers we have to that. Hopefully the people voting
> +1
> > here subscribed to that :)
> >
> > 2018 June: 4617
> > 2018 July: (half a month) 3106
> > (Source of the numbers are here
> > https://lists.apache.org/list.html?comm...@mxnet.apache.org:2018-7)
> >
> > @Joshua: yes we need to bring 'valuable' (emphasis mine) discussion to a
> > centralized place @dev. Does everything needs to be sent to dev@. For
> > example, consider these recent PRs, why is it necessary for them to be
> > forwarded to dev@?
> >
> > fix flaky test test_operator_gpu.test_countsketch:
> > https://github.com/apache/incubator-mxnet/pull/11780
> > Update PyPI version number:
> > https://github.com/apache/incubator-mxnet/pull/11773
> > Fix file name creation for Windows:
> > https://github.com/apache/incubator-mxnet/pull/11765
> > [MXNET-8230] test_operator_gpu.test_rms fails:
> > https://github.com/apache/incubator-mxnet/pull/11749
> >
> > If people are forced to setup filters to parse these mails, then we are
> > *ensuring* people don't get their eyes on valuable discussions on dev@.
> >
> > Regards,
> > Rahul
> >
> > On Tue, Jul 17, 2018 at 12:49 PM, Sheng Zha  wrote:
> >
> > > FWIW: "from:notificati...@github.com AND
> to:dev@mxnet.incubator.apache.
> > org
> > > AND NOT to:me" but I'm sure you get the gist :)
> > >
> > >
> > > Opt-in model applies to individuals rather than the dev list, because
> the
> > > dev list is intended as an asynchronous way for new comers to easily
> > follow
> > > past technical discussions, and is the only place recognized by apache
> > for
> > > these discussions. Currently, lots of high quality technical
> discussions
> > > that are happening on github are lost and not archived here. The
> > procedural
> > > change in this vote is intended for bridging such gap. Besides, it's
> more
> > > likely for new contributors to know how to filter emails than to know
> how
> > > to "opt-in".
> > >
> > >
> > > More discussion is welcome in the linked discussion thread.
> > >
> > >
> > > -sz
> > >
> > > On Tue, Jul 17, 2018 at 12:37 PM, pracheer gupta <
> > > pracheer_gu...@hotmail.com
> > > > wrote:
> > >
> > > > FWIW: The filter needs to be more complicated than just "
> > > > from:notificati...@github.com". After all, if someone mentions me
> > > > directly in PR thread and/or I subscribe to only a particular PR,
> those
> > > > emails will also come from "notificati...@github.com". There are
> ways
> > > > around that though.
> > > >
> > > >
> > > > It might be good to mention this filter in some wiki/webpage
> somewhere;
> > > > may save some effort for people trying to find the right set of
> > filters.
> > > It
> > > > could even be in the welcome email when one subscribes to this
> > > email-list.
> > > >
> > > >
> > > > Another alternate option: How about choosing an opt-in model rather
> > than
> > > > an opt-out model? Having another email list and anyone can subscribe
> to
> > > it
> > > > if they wish.
> > > >
> > > >
> > > > Not sure if there is a perfect answer out there for this but in
> > principle
> > > > I agree that it will be good to have "push notifications" for all
> > > PRs/issue.
> > > >
> > > >
> > > > -Pracheer
> > > >
> > > >

Re: [VOTE] Release MXNet version 1.2.1.RC0 (Patch Release)

2018-06-26 Thread Chris Olivier

+1

On Tue, Jun 26, 2018 at 5:50 PM Anirudh  wrote:

> Hi all,
>
> The current warning message for save_params: "save_params is deprecated,
> use save_parameters instead" is misleading
> for users who may use the API to load into SymbolBlock. To make it clearer
> for all users, a better warning message is to include alternative to
> save_params and reference to detailed documentation.
> The message improvement is important, but not blocking to move forward to
> complete the voting process for MXNet v1.2.1 patch release.
> We plan to follow up with a 1.2.2 patch release to improve the message and
> potentially include other bug fixes.
> Please let me know if you have any thoughts, questions or suggestions.
>
>
> Anirudh
>
> On Mon, Jun 25, 2018 at 10:44 PM, Anirudh  wrote:
>
> >
> > Hi Mu,
> >
> > Thanks for bringing this up and hopefully this should answer Sheng's
> > question.
> > Thomas pointed out something similar in the PR here for the warning
> > message which I didn't notice back then:
> > https://github.com/apache/incubator-mxnet/pull/11127
> >
> > Not sure about the reasoning to not add it and if there was an offline
> > discussion about this between Thomas and Erik.
> > It would be nice if you guys could pitch in if there were any strong
> > reasons.
> >
> > I understand that a more informed warning when using save_params would
> > really avoid some customer frustration.
> > Having said that, I am a little worried about the timeline though since
> > some customers are eagerly waiting for the release of 1.2.1.
> > Another RC would delay it by at-least one and a half weeks.
> >
> > Anirudh
> >
> > On Mon, Jun 25, 2018 at 9:54 PM, Mu Li  wrote:
> >
> >> Detailed documents should help, but the current warning message that
> >> "save_params is deprecated, use save_parameters instead" is not
> sufficient
> >> enough.
> >>
> >> Some details about the API changes:
> >>
> >> 1. v1.2 changed the implementation of "save_params", which is explained
> in
> >> the release note. The main benefit for this change is that we don't need
> >> to
> >> create layers within a name scope. [1]
> >> 2. we found this change breaks a gluon-to-symbol usage, even though we
> >> recommended users to use "export" for this usage. [2]
> >> 3. for some good reasons we made a decision to revert save_params in
> >> v1.2.1, and introduced a new API called save_parameters for this new
> >> behavior. [3]
> >>
> >> Since calling save_params each time will generate a warning message,
> it's
> >> a
> >> major API change. The recommended for users to update their codes are:
> >>
> >> 1. If you save parameters to load back into a SymbolBlock, you can use
> >> export instead, though keeping it will not break your codes except for a
> >> warning message. (But it will break in v1.2)
> >> 2. If you create gluon layers without a name scope, you must replace
> >> save_params with save_parameters. Otherwise, your model cannot be loaded
> >> back in v1.2.1 (though it works in v1.2)
> >> 3. For the rest case, such as models are created within a name scope,
> and
> >> the models are loaded into gluon (not symbolblock) later, recommend
> >> replacing save_params with save_parameteres. If you don't do it, nothing
> >> will break in v1.2 and v1.2.1, but v1.2.1 will give you a warning
> message.
> >>
> >> This API changes in v1.2 and v1.2.1 are pretty tricky. Anirudh did a
> great
> >> job in capturing them in release notes. But I feel it's hard for users
> to
> >> understand the impacts. I suggest to improve the warning message to "use
> >> export if you want to load into SymbolBlock, otherwise use
> >> save_parameters.
> >> For more details, refer to this URL".
> >>
> >> [1] https://github.com/apache/incubator-mxnet/releases/tag/1.2.0
> >> [2] https://github.com/apache/incubator-mxnet/issues/11091
> >> [3] https://github.com/apache/incubator-mxnet/pull/11127
> >>
> >> On Mon, Jun 25, 2018 at 9:23 PM, Sheng Zha  wrote:
> >>
> >> > Wouldn’t this break users who are on 1.2.0 and used our API correctly?
> >> Why
> >> > do we have to revert load_params, given that it’s backward compatible?
> >> >
> >> > -sz
> >> >
> >> > > On Jun 25, 2018, at 6:30 PM, Anirudh  wrote:
> >> > >
> >> > > Hi,
> >> > >
> >> > > 1.2.1 (load_params) is backward compatible with 1.1.0 not with
> 1.2.0.
> >> > > It does not adhere exactly with semver but it had to be made, to
> >> quickly
> >> > > help our customers who were using the APIs incorrectly.
> >> > >
> >> > > Anirudh
> >> > >
> >> > >> On Mon, Jun 25, 2018 at 5:42 PM, Sheng Zha 
> >> wrote:
> >> > >>
> >> > >> save_parameters didn't exist in 1.2.0 so its addition usually isn't
> >> > >> supposed to happen in a patch release if we stick to semantic
> >> > versioning. I
> >> > >> couldn't find a discussion on this exception. Did it happen?
> >> > >>
> >> > >> Would people who used 1.2.0 to save models be able to load
> >> parameters in
> >> > >> 1.2.1 using the reverted load_params? (i.e. is it backward
> >> compatible)
>

Re: [RELEASE][VOTE] Release MXNet version 1.2.1.RC0

2018-06-22 Thread Chris Olivier

what does “binding” mean in this context?

On Fri, Jun 22, 2018 at 9:15 PM Sergio Fernández  wrote:

> Just wanted to refresh what
> https://incubator.apache.org/guides/ppmc.html#ppmc_and_binding_votes says:
> "The only time when a PPMC member’s vote is binding is for the addition of
> new PPMC members and committers. Release votes are only binding to IPMC
> members.".
>
> So it's incorrect to mark as binding those votes at the RESULT email.
>
>
> On Fri, Jun 22, 2018, 17:38 Chris Olivier  wrote:
>
> > what do you mean? just curious.
> >
> > On Fri, Jun 22, 2018 at 4:44 PM Sergio Fernández 
> > wrote:
> >
> > > Please, notice PPMC votes are not binding.
> > >
> > > On Fri, Jun 22, 2018, 09:35 Anirudh  wrote:
> > >
> > > > Hi all,
> > > >
> > > > Apologies for replying instead of sending out a new email.
> > > >
> > > > This vote has passed with 6 +1s:
> > > >
> > > > Binding:
> > > > Sandeep
> > > > Haibin
> > > > Indhu
> > > >
> > > > Non Binding:
> > > > Carin
> > > > Pedro
> > > > Lai
> > > >
> > > > I will proceed with the vote on general@.
> > > >
> > > > Thanks,
> > > > Anirudh
> > > >
> > >
> >
>

1 2 3 4 >

1 - 100 of 353 matches

Mail list logo