Re: [LAZY VOTE][RESULT] Upgrade CI to CUDA 9.1 with CuDNN 7.0

2018-05-16 Thread Haibin Lin
Is there a plan for adding those CUDA 8 tests back to CI? What about CUDA 7?

There were a few build problems in the past few weeks due to lack of CI
coverage:
- https://github.com/apache/incubator-mxnet/pull/10710 were found during
1.2 rc voting
- https://github.com/apache/incubator-mxnet/issues/10981 were reported by
an user with CUDA 7

Having these covered in CI will help catch the issues early. I don't recall
if we decided to drop CUDA 7 support for MXNet.

Best,
Haibin

On Wed, Mar 21, 2018 at 6:32 AM, Marco de Abreu <
marco.g.ab...@googlemail.com> wrote:

> Hello,
>
> the migration has just been completed and we're now running our UNIX based
> slaves on CUDA 9.1 with CuDNN 7. The commit is available at
> https://github.com/apache/incubator-mxnet/commit/
> b0a6760efa141aeca87b03ecf34dae924bd1af46
> .
>
> No jobs have been interrupted by this migration. If you encounter any
> errors, please reach back to me.
>
> Best regards,
> Marco
>
> On Tue, Mar 20, 2018 at 11:20 PM, Marco de Abreu <
> marco.g.ab...@googlemail.com> wrote:
>
> > Hello,
> >
> > the results of this vote are as follows:
> >
> > +1:
> > Jun
> > Anirudh
> > Hao
> > Marco
> >
> > 0:
> > Chris
> >
> > -1:
> > Naveen (veto recalled as of https://lists.apache.org/thread.html/
> > 242db72a0c96349ef6e0ff1d3b1fe0dc7f7a9082532724c3293666c5@%
> > 3Cdev.mxnet.apache.org%3E)
> >
> > Under the constraint that we will use CUDA 8 on Windows and CUDA 9.1 on
> > UNIX slaves and work on integration tests for CUDA 8 in the long term,
> this
> > vote counts as PASSED.
> >
> > The PR for this change is available at https://github.com/apache/
> > incubator-mxnet/pull/10108. I have developed and tested the new slaves in
> > our test environment and everything looks promising so far. The plan is
> as
> > follows:
> >
> >1. Get https://github.com/apache/incubator-mxnet/pull/10108 approved
> >to allow self-merge – CI can’t pass until slaves have been upgraded.
> >2. Replace all existing slaves with new upgraded slaves.
> >3. Retrigger https://github.com/apache/incubator-mxnet/pull/10108 to
> >merge necessary changes into master.
> >
> > IMPORTANT: The migration will happen tomorrow, so please expect some
> delay
> > in job execution - the CI website will be unaffected. Ideally, no jobs
> > should fail - in case they do, please feel free to retrigger them by
> using
> > an empty commit. In case of any errors appearing after the upgrade, don't
> > hesitate to contact me!
> >
> > Best regards,
> > Marco
> >
> >
> > On Tue, Mar 20, 2018 at 1:39 AM, Naveen Swamy 
> wrote:
> >
> >> Yes, for short-term.
> >>
> >> On Monday, March 19, 2018, Chris Olivier 
> wrote:
> >>
> >> > In the short ter, Naveen, are you ok with Linux running CUDA 9 and
> >> Windows
> >> > CUDA 8 in order to get CUDA version coverage?
> >> >
> >> > On 2018/03/16 21:09:09, Marco de Abreu 
> >> > wrote:
> >> > > Thanks for your input. How would you propose to proceed in terms of
> a
> >> > > timeline in case this vote succeedes? I don't really have time to
> work
> >> > on a
> >> > > nightly setup right now. Would anybody in the community be able to
> >> help
> >> > me
> >> > > out here or shall we wait with the migration until a nightly setup
> for
> >> > CUDA
> >> > > 8 is up?
> >> > >
> >> > > -Marco
> >> > >
> >> > > On Fri, Mar 16, 2018 at 9:55 PM, Bhavin Thaker <
> >> bhavintha...@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > +1 to the suggestion of testing CUDA8 in few nightly instances and
> >> > using
> >> > > > CUDA9 for most instances in CI.
> >> > > >
> >> > > > Bhavin Thaker.
> >> > > >
> >> > > > On Fri, Mar 16, 2018 at 12:37 PM Naveen Swamy  >
> >> > wrote:
> >> > > >
> >> > > > > I think its best to add support for CUDA 9.0 while retaining
> >> existing
> >> > > > > support for CUDA 8, code might regress when you remove and
> create
> >> > more
> >> > > > work
> >> > > > > to add CUDA 8 support back.
> >> > > > >
> >> > > > > On Fri, Mar 16, 2018 at 9:29 AM, Marco de Abreu <
> >> > > > > marco.g.ab...@googlemail.com> wrote:
> >> > > > >
> >> > > > > > Yeah, sorry Chris, mixed up the names.
> >> > > > > >
> >> > > > > > @Naveen: Would you be fine with doing the switch now and
> adding
> >> > > > > integration
> >> > > > > > tests later or is this a hard constraint for you?
> >> > > > > >
> >> > > > > > On Wed, Mar 14, 2018 at 6:39 PM, Chris Olivier <
> >> > cjolivie...@gmail.com>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Isn't the TItan V the Volta and not the Tesla?
> >> > > > > > >
> >> > > > > > > On Wed, Mar 14, 2018 at 10:36 AM, Naveen Swamy <
> >> > mnnav...@gmail.com>
> >> > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Marco,
> >> > > > > > > > My -1 vote is for dropping support to CUDA 8 and not for
> >> adding
> >> > > > CUDA
> >> > > > > 9.
> >> > > > > > > > CUDA 9.0 support for MXNet was added Oct'30-2017, I think
> >> that
> >> > all
> >> 

Re: MKL with 1.2.0rc3

2018-05-16 Thread Naveen Swamy
I understand that MKLML was experimental but like you pointed out it has
been around for a while, so I am presuming some bugs would have been fixed.
Users have also seen about 30% performance boost with MKL, now with the
latest release we are dropping support and forcing to use another
experimental which we know(from all the discussion during the release) that
its not stable. Also OpenBlas=MKL does not bring all the features for MKLML
and MKLML will have a smaller sized library per this response by a Intel
member https://github.com/intel/mkl-dnn/issues/102.

IMHO this is not a good practice, I also wish that a discussion happen on
dev list before we drop support, so more people provide feedback.



On Wed, May 16, 2018 at 11:54 AM, Anirudh  wrote:

> Hi Naveen,
>
> There was an earlier flag for using MKLML which was a subset of MKL. I
> think this flag was removed when MKLDNN was added.
> I think the only replacement to the old MKLML is the MKLDNN feature that
> was added. ( I am assuming this also installs the MKLML)
> So if users don't want to use MKLDNN feature they will have to install the
> full MKL library and set USE_BLAS=mkl if I am not wrong.
>
> I would also like to understand more from others about what is the best way
> to transition for users who were using MKLML flag earlier but don't want to
> use
> MKLDNN flag now ? (Both flags were experimental, but MKLML flag had been
> around for some time.)
>
> Anirudh
>
>
> On Wed, May 16, 2018 at 11:38 AM, Naveen Swamy  wrote:
>
> > How do I build the latest release(1.2.0rc3) with MKL enabled, it looks
> like
> > its now defaulting to the experimental MKLDNN. I do not want to MKLDNN as
> > it is experimental.
> >
> > -Naveen
> >
>


MKL with 1.2.0rc3

2018-05-16 Thread Naveen Swamy
How do I build the latest release(1.2.0rc3) with MKL enabled, it looks like
its now defaulting to the experimental MKLDNN. I do not want to MKLDNN as
it is experimental.

-Naveen


installation page UX

2018-05-16 Thread Aaron Markham
Hi,
I've written some notes on the wiki about issues with the installation page
along with some suggestions. I'd appreciate your feedback here or on the
wiki.

https://cwiki.apache.org/confluence/display/MXNET/Installation+page+UX

Cheers,
Aaron


Re: Auto scaling for MXNet CI

2018-05-16 Thread Marco de Abreu
Thanks a lot!

The following numbers are based on our experience in the test environment.
Best case: ~1:50h (unchanged) (0:01 + 0:38 + 0:39 + 0:33 + 0:03) -
conditions: No instances have to be provisioned and caches are primed
Average case: 2:10h (1:50h + 0:10 for instance startup + 0:10 for cache
loading) - conditions: Windows instances are available (they get turned off
less frequently), Ubuntu instances have to be provisioned and cache no
present
Worst case: 3:06h (1:50h + 0:02 + 0:50 + 0:20 + 0:02 + 0:02) - conditions:
no available instances

The bottleneck for the worst case is caused by the Windows instances. They
take about 20 minutes to start and the unprimed MSVC cache results in about
30 minutes increased build times. To balance this out, we're driving a less
aggressive downscaling policy for Windows and use increased buffers. At the
same time, we're currently working on persistent build caches. An
additional option could be reserved instances.

We will observe the situation and then assess the required next steps. For
now, we want to make sure everything is running stable and no builds are
getting interrupted.

Best regards,
Marco

On Wed, May 16, 2018 at 3:47 AM, Thomas DELTEIL 
wrote:

> Great news :) thanks Marco!
>
> On Tue, May 15, 2018, 18:29 Steffen Rochel 
> wrote:
>
> > Thanks Marco, good step forward.
> > What is the expected, typical and worst case TAT time for PR checks?
> >
> > Steffen
> >
> > On Tue, May 15, 2018 at 10:49 AM Marco de Abreu <
> > marco.g.ab...@googlemail.com> wrote:
> >
> > > Hello,
> > >
> > > I'd like to announce the deployment of auto scaling for our CI system
> > (see
> > > [1] for reference, setup documentation at [2]) for today at 11:00PM PST
> > > 05/15/18. I expect no downtime since these changes are happening
> outside
> > of
> > > Jenkins.
> > >
> > > This system will increase the flexibility of our system to be able to
> > > handle the increasing load, being a result of the growing number of
> great
> > > contributions! In future, our CI will automatically adapt to the
> current
> > > load and will thus support big tasks like the to-be-migrated nightly
> > tests
> > > or increased load before a release. Additionally, we're now able to
> > provide
> > > scalable p3.2xlarge instances and have the possibility to add new
> > instance
> > > types without much effort.
> > >
> > > In future, you will see that new slaves are being started up as the
> queue
> > > grows and stopped if they go into idle. Your tasks will automatically
> be
> > > picked up and our system makes sure every PR gets processes as fast as
> > > possible.
> > >
> > > If you encounter any issues in the next week, please don't hesitate to
> > > reach out to me. I'm looking forward to everyones feedback!
> > >
> > > Best regards,
> > > Marco
> > >
> > >
> > > [1]:
> > >
> > https://cwiki.apache.org/confluence/display/MXNET/
> Proposal%3A+Auto+Scaling
> > > [2]: https://cwiki.apache.org/confluence/display/MXNET/Setup
> > >
> >
>


Abandoned MXNet readthedocs site

2018-05-16 Thread Hen
It seems we (Apache MXNet) have an old documentation site posted at:

https://newdocs.readthedocs.io/en/latest/

and only the absent @Awyan has the permissions to the site. See
https://github.com/apache/incubator-mxnet/issues/10409 for more details.

Raising this with readthedocs.org, we've been asked to submit a DMCA
request to take down the account:

https://github.com/rtfd/readthedocs.org/issues/4042

Given our docs are Apache licensed I suspect it's really a trademark
takedown (confused users etc). Does anyone here have any guidance with this
situation? Has anyone had this issue with readthedocs before? Should we be
sending them a Trademark takedown request?

Thanks,

Hen


Re: MXNet 1.2 release question ?

2018-05-16 Thread Anirudh
Hi Hen,

The project didn't publish any such guaranteed timeline for releases. What
I meant with "late with release" was that this release had taken more time
than any previous releases that I can recall.

Anirudh

On Tue, May 15, 2018 at 9:43 PM, Hen  wrote:

> I don’t understand what “late with the release” means. Did the project
> publish a timeline of releases that it guaranteed or something?
>
> On Mon, May 14, 2018 at 10:11 AM Anirudh  wrote:
>
> > Hi all,
> >
> > Our vote got blocked on general@ list. The main reason was the inclusion
> > of
> > a file with category b license in non binary form:
> > "3rdparty/googletest/googlemock/docs/DevGuide.md".
> > This file needs to be removed from source.
> >
> > There are two options we have right now:
> >
> > 1. Remove the DevGuide.md from source and call for a new vote.
> > 2. Remove the DevGuide.md, fix some other license issues and also fix
> some
> > other issues: compilation issues on older versions of mac 10.10 or older,
> > push other fixes etc.
> >
> > The first option would happen much quicker, since there is no change in
> > source apart from removal of DevGuide.md file and we can close the vote
> > quicker when the quorum is reached.
> >
> > The second option will have to go through the full vote cycle again but
> > will fix some issues with previous RC.
> >
> > I am inclined to do 1 currently since I don't see the compilation issue
> on
> > older version of Mac as a critical one and we are already considerably
> late
> > with the release.
> >
> > I would like to know what you guys think.
> >
> > Anirudh
> >
>