Re: CI system seems to be using python3 for python2 builds

2017-09-27 Thread Gautam
Hi Ozawa,

  Thanks for follow up.
  Unfortunately I didn't get time to work on this today.

However I have couple of points to mentions.
1. Looks like this backtrace has been present since long time, since this
was not a test failure or build failure we never got notified about it. Here

is the recent build log where back trace is present but build succeeded.

2. I don't think the default version of python on Ubuntu is 3.0, I logged
into one of the apache slave and the default version of Python is 2.7.6

3. There has been slight change
 in Jenkins file where
we tried to parallelize python2 and 3 test run. I am not sure if it
affects. I can probably scrub the build log and figure out if thats the
case.


Feel free to send the PR, if you have it ready.


-Gautam


On Wed, Sep 27, 2017 at 9:39 PM, Tsuyoshi Ozawa  wrote:

> Hi Kumar,
>
> Thanks for looking into the issue. How is the progress of this problem?
> Shouldn't we call /usr/bin/env python2 or python2.7 in following
> source code instead of python since MXNet only supports python2
> currently?
> I think default version of python in Ubuntu is now python3, so it can
> cause the problem.
> If you have not yet done the work, I can create a PR for that in this
> weekend.
>
> ./python/mxnet/__init__.py:#!/usr/bin/env python
> ./python/mxnet/log.py:#!/usr/bin/env python
> ./tests/nightly/dist_lenet.py:#!/usr/bin/env python
> ./tests/nightly/dist_sync_kvstore.py:#!/usr/bin/env python
> ./tests/nightly/multi_lenet.py:#!/usr/bin/env python
> ./tests/nightly/test_kvstore.py:#!/usr/bin/env python
> ./tools/coreml/mxnet_coreml_converter.py:#!/usr/bin/env python
> ./tools/ipynb2md.py:#!/usr/bin/env python
> ./tools/kill-mxnet.py:#!/usr/bin/env python
> ./tools/launch.py:#!/usr/bin/env python
> ./tools/parse_log.py:#!/usr/bin/env python
>
> On Wed, Sep 27, 2017 at 5:39 PM, Sunderland, Kellen 
> wrote:
> > Many thanks Gautam.
> >
> > On 9/26/17, 8:37 PM, "Kumar, Gautam"  wrote:
> >
> > Hi Kellen,
> >
> >This issue has been happening since last 3-4 days along with few
> other test failure.
> > I am looking into it.
> >
> > -Gautam
> >
> > On 9/26/17, 7:45 AM, "Sunderland, Kellen"  wrote:
> >
> > I’ve been noticing in a few failed builds that the stack trace
> indicates we’re actually running python 3.4 in the python 2 tests. I know
> the CI folks are working hard getting everything setup, is this a known
> issue for the CI team?
> >
> > For example: https://builds.apache.org/
> blue/organizations/jenkins/incubator-mxnet/detail/PR-8026/3/pipeline/281
> >
> > Steps Python2: MKLML-CPU
> >
> > StackTrace:
> > Stack trace returned 10 entries:
> > [bt] (0) /workspace/python/mxnet/../../lib/libmxnet.so(_
> ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7fadb8999aac]
> > [bt] (1) /workspace/python/mxnet/../../lib/libmxnet.so(_
> ZN5mxnet7kvstore12KVStoreLocal12GroupKVPairsISt4pairIPNS_
> 7NDArrayES4_EZNS1_19GroupKVPairsPullRspERKSt6vectorIiSaIiEERKS7_IS6_SaIS6_
> EEPS9_PS7_ISD_SaISD_EEEUliRKS6_E_EEvSB_RKS7_IT_SaISN_EESG_PS7_ISP_SaISP_EERKT0_+0x56b)
> [0x7fadba32c01b]
> > [bt] (2) /workspace/python/mxnet/../../lib/libmxnet.so(_
> ZN5mxnet7kvstore12KVStoreLocal17PullRowSparseImplERKSt6vecto
> rIiSaIiEERKS2_ISt4pairIPNS_7NDArrayES8_ESaISA_EEi+0xa6) [0x7fadba32c856]
> > [bt] (3) 
> > /workspace/python/mxnet/../../lib/libmxnet.so(MXKVStorePullRowSparse+0x245)
> [0x7fadba18f165]
> > [bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c)
> [0x7fadde26cadc]
> > [bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc)
> [0x7fadde26c40c]
> > [bt] (6) /usr/lib/python3.4/lib-dynload/_ctypes.cpython-34m-
> x86_64-linux-gnu.so(_ctypes_callproc+0x21d) [0x7fadde47e12d]
> > [bt] (7) /usr/lib/python3.4/lib-dynload/_ctypes.cpython-34m-
> x86_64-linux-gnu.so(+0xf6a3) [0x7fadde47e6a3]
> > [bt] (8) /usr/bin/python3(PyEval_EvalFrameEx+0x41d7) [0x48a487]
> > [bt] (9) /usr/bin/python3() [0x48f2df]
> >
> > -Kellen
> > Amazon Development Center Germany GmbH
> > Berlin - Dresden - Aachen
> > main office: Krausenstr. 38, 10117 Berlin
> > Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
> > Ust-ID: DE289237879
> > Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
> >
> >
> >
> >
> > Amazon Development Center Germany GmbH
> > Berlin - Dresden - Aachen
> > main office: Krausenstr. 38, 10117 Berlin
> > Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
> > Ust-ID: DE289237879
> > Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
>



-- 
Best Regards,
Gautam Kumar


Re: What's everyone working on?

2017-09-27 Thread Jun Wu
I had been working on the sparse tensor project with Haibin. After it was
wrapped up for the first stage, I started my work on the quantization
project (INT-8 inference). The benefits of using quantized models for
inference include much higher inference throughput than FP32 model with
acceptable accuracy loss and compact models saved on small devices. The
work currently aims at quantizing ConvNets, and we will consider expanding
it to RNN networks after getting good results for images. Meanwhile, it's
expected to support quantization on CPU, GPU, and mobile devices.


Re: Apache MXNet build failures are mostly valid - verify before merge

2017-09-27 Thread Tsuyoshi OZAWA
+1,

While I'm checking the recent build failures, and I think the decision
of making the mx-net branch protected is necessary for stable
building.
Thanks Kumar for resuming important discussion.

Best regards
- Tsuyoshi

On Thu, Sep 28, 2017 at 12:56 PM, Kumar, Gautam  wrote:
> Reviving the discussion.
>
> At this point of time we have couple of stable builds
> https://builds.apache.org/view/Incubator%20Projects/job/incubator-mxnet/job/master/448/
> https://builds.apache.org/view/Incubator%20Projects/job/incubator-mxnet/job/master/449/
>
> Should we have a quick discussion or polling on making the mx-net branch 
> protected? If you still think we shouldn’t make it protected please provide a 
> reason to support your claim.
>
> Few of us have concern over Jenkin’s stability. If I look two weeks back, 
> after upgrading Linux slave to g2.8x and new windows AMI, we have not seen 
> any case where instance died due to high memory usage or any process got 
> killed due to high cpu usage or any other issue with windows slaves.
>
> Going forward we are also planning that if we add any new slave we will not 
> enable the main load immediately, but rather will do ‘test build’ to make 
> sure that new slaves are not causing any infrastructure issue and capable to 
> perform as good as existing slaves.
>
> -Gautam
>
> On 8/31/17, 5:27 PM, "Lupesko, Hagay"  wrote:
>
> @madan looking into some failures – you’re right… there’s multiple issues 
> going on, some of them intermittent, and we want to be able to merge fixes in.
> Agreed that we can wait with setting up protected mode until build 
> stabilizes.
>
> On 8/31/17, 11:41, "Madan Jampani"  wrote:
>
> @hagay: we agree on the end state. I'm not too particular about how 
> we get
> there. If you think enabling it now and fixes regression later is 
> doable,
> I'm fine with. I see a bit of a chicken and egg problem. We need to 
> get
> some fixes in even when the status checks are failing.
>
> On Thu, Aug 31, 2017 at 11:25 AM, Lupesko, Hagay  
> wrote:
>
> > @madan – re: getting to a stable CI first:
> > I’m concerned that by not enabling protected branch mode ASAP, 
> we’re just
> > taking in more regressions, which makes a stable build a moving 
> target for
> > us…
> >
> > On 8/31/17, 10:49, "Zha, Sheng"  wrote:
> >
> > Just one thing: please don’t disable more tests or just raise 
> the
> > tolerance thresholds.
> >
> > Best regards,
> > -sz
> >
> > On 8/31/17, 10:45 AM, "Madan Jampani"  
> wrote:
> >
> > +1
> > Before we can turn protected mode I feel we should first 
> get to a
> > stable CI
> > pipeline.
> > Sandeep is chasing down known breaking issues.
> >
> >
> > On Thu, Aug 31, 2017 at 10:27 AM, Hagay Lupesko 
> 
> > wrote:
> >
> > > Build stability is a major issue, builds have been 
> failing left
> > and right
> > > over the last week. Some of it is due to Jenkins slave 
> issues,
> > but some are
> > > real regressions.
> > > We need to be more strict in the code we're committing.
> > >
> > > I propose we configure our master to be a protected 
> branch (
> > > 
> https://help.github.com/articles/about-protected-branches/).
> > >
> > > Thoughts?
> > >
> > > On 2017-08-28 22:41, sandeep krishnamurthy 
> 
> > wrote:
> > > > Hello Committers and Contributors,>
> > > >
> > > > Due to unstable build pipelines, from past 1 week, PRs 
> are
> > being merged>
> > > > after CR ignoring PR build status. Build pipeline is 
> much more
> > stable
> > > than>
> > > > last week and most of the build failures you see from 
> now on,
> > are likely
> > > to>
> > > > be a valid failure and hence, it is recommended to wait 
> for PR
> > builds,
> > > see>
> > > > the root cause of any build failures before proceeding 
> with
> > merges.>
> > > >
> > > > At this point of time, there are 2 intermittent issue 
> yet to
> > be fixed ->
> > > > * Network error leading to GitHub requests throwing 404>
> > > > * A conflict in artifacts generated between branches/PR 
> -
> > Cause unknown
> > > yet.>
> > > > These 

Re: CI system seems to be using python3 for python2 builds

2017-09-27 Thread Tsuyoshi Ozawa
Hi Kumar,

Thanks for looking into the issue. How is the progress of this problem?
Shouldn't we call /usr/bin/env python2 or python2.7 in following
source code instead of python since MXNet only supports python2
currently?
I think default version of python in Ubuntu is now python3, so it can
cause the problem.
If you have not yet done the work, I can create a PR for that in this weekend.

./python/mxnet/__init__.py:#!/usr/bin/env python
./python/mxnet/log.py:#!/usr/bin/env python
./tests/nightly/dist_lenet.py:#!/usr/bin/env python
./tests/nightly/dist_sync_kvstore.py:#!/usr/bin/env python
./tests/nightly/multi_lenet.py:#!/usr/bin/env python
./tests/nightly/test_kvstore.py:#!/usr/bin/env python
./tools/coreml/mxnet_coreml_converter.py:#!/usr/bin/env python
./tools/ipynb2md.py:#!/usr/bin/env python
./tools/kill-mxnet.py:#!/usr/bin/env python
./tools/launch.py:#!/usr/bin/env python
./tools/parse_log.py:#!/usr/bin/env python

On Wed, Sep 27, 2017 at 5:39 PM, Sunderland, Kellen  wrote:
> Many thanks Gautam.
>
> On 9/26/17, 8:37 PM, "Kumar, Gautam"  wrote:
>
> Hi Kellen,
>
>This issue has been happening since last 3-4 days along with few other 
> test failure.
> I am looking into it.
>
> -Gautam
>
> On 9/26/17, 7:45 AM, "Sunderland, Kellen"  wrote:
>
> I’ve been noticing in a few failed builds that the stack trace 
> indicates we’re actually running python 3.4 in the python 2 tests. I know the 
> CI folks are working hard getting everything setup, is this a known issue for 
> the CI team?
>
> For example: 
> https://builds.apache.org/blue/organizations/jenkins/incubator-mxnet/detail/PR-8026/3/pipeline/281
>
> Steps Python2: MKLML-CPU
>
> StackTrace:
> Stack trace returned 10 entries:
> [bt] (0) 
> /workspace/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c)
>  [0x7fadb8999aac]
> [bt] (1) 
> /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7kvstore12KVStoreLocal12GroupKVPairsISt4pairIPNS_7NDArrayES4_EZNS1_19GroupKVPairsPullRspERKSt6vectorIiSaIiEERKS7_IS6_SaIS6_EEPS9_PS7_ISD_SaISD_EEEUliRKS6_E_EEvSB_RKS7_IT_SaISN_EESG_PS7_ISP_SaISP_EERKT0_+0x56b)
>  [0x7fadba32c01b]
> [bt] (2) 
> /workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7kvstore12KVStoreLocal17PullRowSparseImplERKSt6vectorIiSaIiEERKS2_ISt4pairIPNS_7NDArrayES8_ESaISA_EEi+0xa6)
>  [0x7fadba32c856]
> [bt] (3) 
> /workspace/python/mxnet/../../lib/libmxnet.so(MXKVStorePullRowSparse+0x245) 
> [0x7fadba18f165]
> [bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) 
> [0x7fadde26cadc]
> [bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) 
> [0x7fadde26c40c]
> [bt] (6) 
> /usr/lib/python3.4/lib-dynload/_ctypes.cpython-34m-x86_64-linux-gnu.so(_ctypes_callproc+0x21d)
>  [0x7fadde47e12d]
> [bt] (7) 
> /usr/lib/python3.4/lib-dynload/_ctypes.cpython-34m-x86_64-linux-gnu.so(+0xf6a3)
>  [0x7fadde47e6a3]
> [bt] (8) /usr/bin/python3(PyEval_EvalFrameEx+0x41d7) [0x48a487]
> [bt] (9) /usr/bin/python3() [0x48f2df]
>
> -Kellen
> Amazon Development Center Germany GmbH
> Berlin - Dresden - Aachen
> main office: Krausenstr. 38, 10117 Berlin
> Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
> Ust-ID: DE289237879
> Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
>
>
>
>
> Amazon Development Center Germany GmbH
> Berlin - Dresden - Aachen
> main office: Krausenstr. 38, 10117 Berlin
> Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
> Ust-ID: DE289237879
> Eingetragen am Amtsgericht Charlottenburg HRB 149173 B


Re: Apache MXNet build failures are mostly valid - verify before merge

2017-09-27 Thread Kumar, Gautam
Reviving the discussion. 

At this point of time we have couple of stable builds 
https://builds.apache.org/view/Incubator%20Projects/job/incubator-mxnet/job/master/448/
https://builds.apache.org/view/Incubator%20Projects/job/incubator-mxnet/job/master/449/

Should we have a quick discussion or polling on making the mx-net branch 
protected? If you still think we shouldn’t make it protected please provide a 
reason to support your claim. 

Few of us have concern over Jenkin’s stability. If I look two weeks back, after 
upgrading Linux slave to g2.8x and new windows AMI, we have not seen any case 
where instance died due to high memory usage or any process got killed due to 
high cpu usage or any other issue with windows slaves. 

Going forward we are also planning that if we add any new slave we will not 
enable the main load immediately, but rather will do ‘test build’ to make sure 
that new slaves are not causing any infrastructure issue and capable to perform 
as good as existing slaves. 

-Gautam 

On 8/31/17, 5:27 PM, "Lupesko, Hagay"  wrote:

@madan looking into some failures – you’re right… there’s multiple issues 
going on, some of them intermittent, and we want to be able to merge fixes in.
Agreed that we can wait with setting up protected mode until build 
stabilizes.

On 8/31/17, 11:41, "Madan Jampani"  wrote:

@hagay: we agree on the end state. I'm not too particular about how we 
get
there. If you think enabling it now and fixes regression later is 
doable,
I'm fine with. I see a bit of a chicken and egg problem. We need to get
some fixes in even when the status checks are failing.

On Thu, Aug 31, 2017 at 11:25 AM, Lupesko, Hagay  
wrote:

> @madan – re: getting to a stable CI first:
> I’m concerned that by not enabling protected branch mode ASAP, we’re 
just
> taking in more regressions, which makes a stable build a moving 
target for
> us…
>
> On 8/31/17, 10:49, "Zha, Sheng"  wrote:
>
> Just one thing: please don’t disable more tests or just raise the
> tolerance thresholds.
>
> Best regards,
> -sz
>
> On 8/31/17, 10:45 AM, "Madan Jampani"  
wrote:
>
> +1
> Before we can turn protected mode I feel we should first get 
to a
> stable CI
> pipeline.
> Sandeep is chasing down known breaking issues.
>
>
> On Thu, Aug 31, 2017 at 10:27 AM, Hagay Lupesko 

> wrote:
>
> > Build stability is a major issue, builds have been failing 
left
> and right
> > over the last week. Some of it is due to Jenkins slave 
issues,
> but some are
> > real regressions.
> > We need to be more strict in the code we're committing.
> >
> > I propose we configure our master to be a protected branch (
> > https://help.github.com/articles/about-protected-branches/).
> >
> > Thoughts?
> >
> > On 2017-08-28 22:41, sandeep krishnamurthy 
> wrote:
> > > Hello Committers and Contributors,>
> > >
> > > Due to unstable build pipelines, from past 1 week, PRs are
> being merged>
> > > after CR ignoring PR build status. Build pipeline is much 
more
> stable
> > than>
> > > last week and most of the build failures you see from now 
on,
> are likely
> > to>
> > > be a valid failure and hence, it is recommended to wait 
for PR
> builds,
> > see>
> > > the root cause of any build failures before proceeding 
with
> merges.>
> > >
> > > At this point of time, there are 2 intermittent issue yet 
to
> be fixed ->
> > > * Network error leading to GitHub requests throwing 404>
> > > * A conflict in artifacts generated between branches/PR -
> Cause unknown
> > yet.>
> > > These issues will be fixed soon.>
> > >
> > >
> > > -- >
> > > Sandeep Krishnamurthy>
> > >
> >
>
>
>
>
>
>







Podling Report Reminder - October 2017

2017-09-27 Thread johndament
Dear podling,

This email was sent by an automated system on behalf of the Apache
Incubator PMC. It is an initial reminder to give you plenty of time to
prepare your quarterly board report.

The board meeting is scheduled for Wed, 18 October 2017, 10:30 am PDT.
The report for your podling will form a part of the Incubator PMC
report. The Incubator PMC requires your report to be submitted 2 weeks
before the board meeting, to allow sufficient time for review and
submission (Wed, October 04).

Please submit your report with sufficient time to allow the Incubator
PMC, and subsequently board members to review and digest. Again, the
very latest you should submit your report is 2 weeks prior to the board
meeting.

Thanks,

The Apache Incubator PMC

Submitting your Report

--

Your report should contain the following:

*   Your project name
*   A brief description of your project, which assumes no knowledge of
the project or necessarily of its field
*   A list of the three most important issues to address in the move
towards graduation.
*   Any issues that the Incubator PMC or ASF Board might wish/need to be
aware of
*   How has the community developed since the last report
*   How has the project developed since the last report.
*   How does the podling rate their own maturity.

This should be appended to the Incubator Wiki page at:

https://wiki.apache.org/incubator/October2017

Note: This is manually populated. You may need to wait a little before
this page is created from a template.

Mentors
---

Mentors should review reports for their project(s) and sign them off on
the Incubator wiki page. Signing off reports shows that you are
following the project - projects that are not signed may raise alarms
for the Incubator PMC.

Incubator PMC


[BUILD FAILED] Branch master build 447

2017-09-27 Thread Apache Jenkins Server
Build for MXNet branch master has broken. Please view the build at 
https://builds.apache.org/job/incubator-mxnet/job/master/447/

[BUILD FAILED] Branch master build 446

2017-09-27 Thread Apache Jenkins Server
Build for MXNet branch master has broken. Please view the build at 
https://builds.apache.org/job/incubator-mxnet/job/master/446/

PR builds are currently failing due to a known issue

2017-09-27 Thread Meghna Baijal
Hi All, 
This is just to let everyone know that PR #8034 is breaking the Apache MXNet PR 
builds for the moment. The master branch is not affected by this. 

This PR makes changes to the Jenkinsfile and some script approvals are required 
from the Apache infra team. I have opened a JIRA ticket for the same 
-https://issues.apache.org/jira/browse/INFRA-15176 
 and we are in the process 
of resolving it.

I will update this thread once the issue is fixed. 

Thanks,
Meghna Baijal



[BUILD FAILED] Branch master build 445

2017-09-27 Thread Apache Jenkins Server
Build for MXNet branch master has broken. Please view the build at 
https://builds.apache.org/job/incubator-mxnet/job/master/445/

Re: MXNet Build Services

2017-09-27 Thread Meghna Baijal
Hello Daniel, 
Thank you for reaching out to us with your concerns. I do understand your 
points and apologize for not following the correct procedures. 
We are making every attempt to address these issues. For starters, 
1. We are directing our build related queries to the bui...@apache.org 
 mailing list.
2. Opening JIRA tickets to get help from the Apache Infra team. 
3. And we have created a “builds" channel on the MXNet Slack workspace 
(apache-mxnet.slack.com ). We now have all our 
builds related discussions on this channel and anyone can subscribe to it. I 
hope this helps improve communication between our teams.

 
Thanks,
Meghna

> On Sep 21, 2017, at 12:05 PM, Daniel Pono Takamori  wrote:
> 
> Hello MXNet Team,
> I wanted to send an email to check in on your build services and make
> sure we're on the same page when it comes to the MXNet project and
> Apache Infrastructure.  As I'm sure you are aware the ASF has over 200
> active projects and plenty more subprojects that we shepard. Coming
> from the infra side of things, this can often make it overwhelming to
> understand and work with the multitude of needs these projects
> possess.
> 
> When MXNet was onboarding into the the Incubator I tried my best to
> facilitate the transition and as such made myself readily available to
> your team members who asked for it.  This might have been a mistake as
> I didn't make it clear that I was giving priority to your project to
> get you up to speed.  As it stands now it seems that your team might
> think I'm the only infra member which can help them!  On the contrary
> we have a great team of 5 people who are equally if not more
> knowledgeable than I am.
> 
> The places to reach our team are on Hipchat, where we can try to be
> real time but given workloads sometimes that's not possible, email
> us...@infra.apache.org for general infrastructure questions and
> bui...@apache.org for specific questions.  But most importantly is the
> JIRA instance https://issues.apache.org/jira/browse/INFRA  This is
> where you can file tickets when you need help with things and we will
> be able to look at them and work on them as our workload enables us to
> (keep in mind we cannot respond instantly to each of 200 projects so
> you'll have to bear with us).
> 
> Now onto the more technical side of things.  As you are hosting your
> own build nodes and connecting them to our jenkins we can only do the
> adding and renaming, etc.  Recently with
> https://issues.apache.org/jira/browse/INFRA-15114  it came up that
> there is some backend tooling at Amazon that Apache Infra was
> completely unaware of.  It would definitely help all of us involved if
> you keyed us into the overall design of your setup that way it didn't
> seem like every single request in Hipchat was of utmost urgency.
> Browsing the dev@mxnet list I haven't been able to find much
> information about the build setup so I'm not sure where those
> conversations are happening.
> 
> To summarize:
> 1).  The ASF is a huge organization and cannot give preferred
> treatment to projects.
> 2).  I (Pono) am not the only Infra member that can help you
> 3). JIRA is the best place to get our teams attention for work items
> 4).  Discussion about your build system should be more transparent and
> certainly include the Infra team.
> 
> Thanks for listening and good luck Incubating!
> -Daniel Pono Takamori



[BUILD FAILED] Branch master build 443

2017-09-27 Thread Apache Jenkins Server
Build for MXNet branch master has broken. Please view the build at 
https://builds.apache.org/job/incubator-mxnet/job/master/443/

[BUILD FAILED] Branch master build 441

2017-09-27 Thread Apache Jenkins Server
Build for MXNet branch master has broken. Please view the build at 
https://builds.apache.org/job/incubator-mxnet/job/master/441/

Re: What's everyone working on?

2017-09-27 Thread Rahul Huilgol
Chao and I are working on compressing gradients to low bit precision (2bit
for now) to reduce communication costs and hence speedup training,
especially for distributed training. The idea is to retain the compression
error as residual and incorporate it into later iterations, so we don't see
much (or any?) loss in accuracy.

Regards,
Rahul

On Wed, 27 Sep 2017 at 07:20 kellen sunderland 
wrote:

> Pedro and I are focusing on a few use cases involving mobile and IoT device
> development.  At the moment we're trying to run machine translation and
> object detection models on a Jetson TX2 with reasonable performance.  We'll
> probably also look at a few different types of model compression at some
> point as well.  I think we're also happy to chip in with bug fixes where we
> can.
>
> -Kellen
>
> On Wed, Sep 27, 2017 at 3:58 PM, Dom Divakaruni <
> dominic.divakar...@gmail.com> wrote:
>
> > A couple of us are working on sparse support. Bhavin or Haibin, can you
> > fill in more detail?
> >
> > Regards,
> > Dom
> >
> >
> > > On Sep 26, 2017, at 4:35 PM, Nan Zhu  wrote:
> > >
> > > I am essentially doing the same thing as in xgboost-spark
> > >
> > > DF based ML integration, etc.
> > >
> > > Get Outlook for iOS
> > > 
> > > From: Naveen Swamy 
> > > Sent: Tuesday, September 26, 2017 4:20:25 PM
> > > To: dev@mxnet.incubator.apache.org
> > > Subject: Re: What's everyone working on?
> > >
> > > Hi Nan Zhu,
> > >
> > > Thanks for the update. Curious to know what part of mxnet-spark are you
> > > working on?
> > >
> > > I am also evaluating the integration of MXNet with Spark, planning to
> > start
> > > with PySpark and also looking into spark-deep learning-pipelines
> > > .
> > >
> > > Thanks, Naveen
> > >
> > >> On Tue, Sep 26, 2017 at 4:06 PM, Nan Zhu 
> > wrote:
> > >>
> > >> working on mxnet-spark, fixing some limitations in ps-lite(busy
> for
> > >> daily job in these days, should be back next week)
> > >>
> > >>> On Tue, Sep 26, 2017 at 10:13 AM, YiZhi Liu 
> > wrote:
> > >>>
> > >>> Hi Dominic,
> > >>>
> > >>> I'm working on 0.11-snapshot and we will soon have one. While the
> > >>> stable release will be after that we change package name from
> > >>> 'ml.dmlc' to 'org.apache'.
> > >>>
> > >>> 2017-09-27 0:04 GMT+08:00 Dominic Divakaruni <
> > >> dominic.divakar...@gmail.com
> >  :
> >  That's great, YiZhi. Workday uses the Scala package and was looking
> > >> for a
> >  maven distro for v0.11. When do you think you'll have one up?
> > 
> >  On Tue, Sep 26, 2017 at 8:58 AM, YiZhi Liu 
> > >> wrote:
> > 
> > > I'm currently working on maven deploy for scala package.
> > >
> > > 2017-09-26 16:00 GMT+08:00 Zihao Zheng :
> > >> I’m working on standalone TensorBoard, https://github.com/dmlc/
> > > tensorboard , currently we’ve
> > > support several features in original TensorBoard from TensorFlow in
> > >> pure
> > > Python without any DL framework dependency.
> > >>
> > >> Recently I’m trying to bring more features to this standalone
> > >> version,
> > > but seems not very trivial as it depends on TensorFlow. Any advice
> > are
> > > welcomed and looking for help.
> > >>
> > >> Thanks,
> > >> Zihao
> > >>
> > >>> 在 2017年9月26日,下午1:58,sandeep krishnamurthy <
> > >>> sandeep.krishn...@gmail.com>
> > > 写道:
> > >>>
> > >>> I am currently working with Jiajie Chen (https://github.com/
> > >>> jiajiechen/)
> > > on
> > >>> building an automated periodic benchmarking framework to run
> > >> various
> > >>> standard MXNet training jobs with both Symbolic and Gluon
> > >> interface.
> > > This
> > >>> framework will run following standard training jobs on a nightly
> > >> and
> > > weekly
> > >>> basis helping us to track performance improvements or regression
> > >>> early
> > > in
> > >>> the development cycle of MXNet. Both CPU and GPU instances are
> used
> > >>> capturing various metrics like training accuracy, validation
> > >>> accuracy,
> > >>> convergence, memory consumption, speed.
> > >>>
> > >>> To start with, we will be running Resnet50, Resnet152 on CIFAR
> and
> > >>> Synthetic Dataset. And, few more RNN and Bidirectional LSTM
> > >> training
> > > jobs.
> > >>>
> > >>> Thanks,
> > >>> Sandeep
> > >>>
> > >>>
> > >>> On Mon, Sep 25, 2017 at 8:00 PM, Henri Yandell <
> bay...@apache.org>
> > > wrote:
> > >>>
> >  Getting an instance of github.com/amzn/oss-dashboard setup for
> > >>> mxnet.
> > 
> >  Hopefully useful to write custom metric analysis; like: "most
> pull
> > > requests
> > 

[BUILD FAILED] Branch master build 442

2017-09-27 Thread Apache Jenkins Server
Build for MXNet branch master has broken. Please view the build at 
https://builds.apache.org/job/incubator-mxnet/job/master/442/

Re: Status of Sparse Tensor Support in MXNet

2017-09-27 Thread Haibin Lin
I'm not sure why hyperlinks don't work well with the mailing list. Here's a
duplicate without the links.

I’ve been working on sparse tensor support in MXNet. I’d like to share a
bit regarding what I worked on and gather some inputs/feature requests from
the community.

Recently sparse tensor CPU support has been merged to MXNet master with:
- Two sparse data formats: Compressed Sparse Row(CSR, for sparse inputs)
and Row Sparse (for sparse gradients)
- Two data iterators for sparse data input: NDArrayIter and LibSVMIter
- Three optimizers for sparse gradient updates: Ftrl(@CNevd), SGD and Adam
- Sparse storage conversion, matrix-matrix product, matrix-vector product,
and sparse gradient aggregation operators (CPU @reminisce, GPU
@stefanhenneking)
- Many sparse element-wise CPU operators including arithmetic (e.g.
elemwise_add), rounding, trigonometric, hyperbolic, exponents, logarithms,
and power operators (mainly implemented for Row Sparse but not yet for CSR
@cjolivier01).
- Distributed kvstore with sparse push/pull (CPU only, 64-bit hashed keys
not supported for distributed training)
- Distributed linear regression example with sparse data

There’re also some ongoing benchmarking efforts for matrix multiplication,
memory usage and distributed training within MXNet (@anirudh2290) and
tutorials regarding basic sparse operations (work in progress, comments are
welcome).

The future work I have in mind includes:
- Update document to reflect available sparse operators and benchmark
results
- Sparse embedding operator
- Adagrad optimizer for sparse gradient updates
- Reduce sum operator for CSR
- Gluon interface support
- Factorization machine example
- Noise contrastive estimation example

What sparse related features and operator support would you need and what
do you want to use it for? Do you want any item in the list of future work
to become available sooner? Any feedback is welcome. Thanks a lot.

Best,
Haibin


On Wed, Sep 27, 2017 at 10:12 AM, Haibin Lin 
wrote:

> (It looks like the previous email didn’t go through. Resending it)
>
>
>
> Hi everyone,
>
>
>
> I’ve been working on sparse tensor support in MXNet. I’d like to share a
> bit regarding what I worked on and gather some inputs/feature requests from
> the community.
>
>
>
> Recently sparse tensor CPU support has been merged to MXNet master with:
>
>- Two sparse data formats: Compressed Sparse Row
>
> (CSR,
>for sparse inputs) and Row Sparse
>
> 
>  (for
>sparse gradients)
>- Two data iterators for sparse data input: NDArrayIter
>
> 
> and LibSVMIter
>
> 
>- Three optimizers for sparse gradient updates: Ftrl
>
> 
>(@CNevd), SGD
>
> 
> and Adam
>
> 
>- Sparse storage conversion
>
> 
>, matrix-matrix product
>
> 
>, matrix-vector product
>
> ,
>and sparse gradient aggregation
>
> 
>  operators
>(CPU @reminisce, GPU @stefanhenneking)
>- Many sparse element-wise CPU operators including: arithmetic (e.g.
>elemwise_add), rounding, trigonometric, hyperbolic, exponents,
>logarithms, and power operators (mainly implemented for Row Sparse but not
>yet for CSR @cjolivier01).
>- Distributed kv-store with sparse push
>
> 
>/pull
>
> 
>  (CPU
>only, 64-bit hashed keys not supported for distributed training)
>- Distributed linear regression
> 
> example
>with sparse data
>
>
>
> There’re also some ongoing 

Status of Sparse Tensor Support in MXNet

2017-09-27 Thread Haibin Lin
(It looks like the previous email didn’t go through. Resending it)



Hi everyone,



I’ve been working on sparse tensor support in MXNet. I’d like to share a
bit regarding what I worked on and gather some inputs/feature requests from
the community.



Recently sparse tensor CPU support has been merged to MXNet master with:

   - Two sparse data formats: Compressed Sparse Row
   
(CSR,
   for sparse inputs) and Row Sparse
   

(for
   sparse gradients)
   - Two data iterators for sparse data input: NDArrayIter
   

and LibSVMIter
   

   - Three optimizers for sparse gradient updates: Ftrl
   

   (@CNevd), SGD
   

and Adam
   

   - Sparse storage conversion
   

   , matrix-matrix product
   

   , matrix-vector product
   
,
   and sparse gradient aggregation
   

operators
   (CPU @reminisce, GPU @stefanhenneking)
   - Many sparse element-wise CPU operators including: arithmetic (e.g.
   elemwise_add), rounding, trigonometric, hyperbolic, exponents,
   logarithms, and power operators (mainly implemented for Row Sparse but not
   yet for CSR @cjolivier01).
   - Distributed kv-store with sparse push
   

   /pull
   

(CPU
   only, 64-bit hashed keys not supported for distributed training)
   - Distributed linear regression
   
example
   with sparse data



There’re also some ongoing benchmarking efforts for matrix multiplication,
memory usage and distributed training within MXNet (@anirudh2290) and
tutorials  regarding
basic sparse operations (work in progress, comments are welcome).



The future work I have in mind includes:

   - Update document to reflect available sparse operators and benchmark
   results
   - Sparse embedding operator
   - Adagrad optimizer for sparse gradient updates
   - Reduce sum operator for CSR
   - Gluon interface support
   - Factorization machine example
   - Noise contrastive estimation example



*What sparse related features and operator support would you need and what
do you want to use it for? Do you want any item in the list of future work
to become available sooner? Any feedback is welcome. Thanks a lot.*



Best,

Haibin


Re: CI problems

2017-09-27 Thread Chris Olivier
By the way, I am not referring to a few tests that are known to fail 1%-10%
or so of the time (ie test_batchnorm_training) and are being actively
worked on. I am referring to tests that fail 100% of the time and are still
merged into master, and thus propagate to all branches when sync'd from
master.

On Wed, Sep 27, 2017 at 8:43 AM, Chris Olivier 
wrote:

> How are so many broken unit tests getting into master?  Is stuff being
> merged without passing CI/unit testing?  I have been trying to get three
> PR's to build for over a week now.  Each time it's some broken test or
> another that has nothing to do with my code changes.  It's extremely
> frustrating -- I waste whole days on this, trying to figure out why my code
> is breaking strange things only to realize later it's broken in all
> branches.
>


CI problems

2017-09-27 Thread Chris Olivier
How are so many broken unit tests getting into master?  Is stuff being
merged without passing CI/unit testing?  I have been trying to get three
PR's to build for over a week now.  Each time it's some broken test or
another that has nothing to do with my code changes.  It's extremely
frustrating -- I waste whole days on this, trying to figure out why my code
is breaking strange things only to realize later it's broken in all
branches.


Re: What's everyone working on?

2017-09-27 Thread kellen sunderland
Pedro and I are focusing on a few use cases involving mobile and IoT device
development.  At the moment we're trying to run machine translation and
object detection models on a Jetson TX2 with reasonable performance.  We'll
probably also look at a few different types of model compression at some
point as well.  I think we're also happy to chip in with bug fixes where we
can.

-Kellen

On Wed, Sep 27, 2017 at 3:58 PM, Dom Divakaruni <
dominic.divakar...@gmail.com> wrote:

> A couple of us are working on sparse support. Bhavin or Haibin, can you
> fill in more detail?
>
> Regards,
> Dom
>
>
> > On Sep 26, 2017, at 4:35 PM, Nan Zhu  wrote:
> >
> > I am essentially doing the same thing as in xgboost-spark
> >
> > DF based ML integration, etc.
> >
> > Get Outlook for iOS
> > 
> > From: Naveen Swamy 
> > Sent: Tuesday, September 26, 2017 4:20:25 PM
> > To: dev@mxnet.incubator.apache.org
> > Subject: Re: What's everyone working on?
> >
> > Hi Nan Zhu,
> >
> > Thanks for the update. Curious to know what part of mxnet-spark are you
> > working on?
> >
> > I am also evaluating the integration of MXNet with Spark, planning to
> start
> > with PySpark and also looking into spark-deep learning-pipelines
> > .
> >
> > Thanks, Naveen
> >
> >> On Tue, Sep 26, 2017 at 4:06 PM, Nan Zhu 
> wrote:
> >>
> >> working on mxnet-spark, fixing some limitations in ps-lite(busy for
> >> daily job in these days, should be back next week)
> >>
> >>> On Tue, Sep 26, 2017 at 10:13 AM, YiZhi Liu 
> wrote:
> >>>
> >>> Hi Dominic,
> >>>
> >>> I'm working on 0.11-snapshot and we will soon have one. While the
> >>> stable release will be after that we change package name from
> >>> 'ml.dmlc' to 'org.apache'.
> >>>
> >>> 2017-09-27 0:04 GMT+08:00 Dominic Divakaruni <
> >> dominic.divakar...@gmail.com
>  :
>  That's great, YiZhi. Workday uses the Scala package and was looking
> >> for a
>  maven distro for v0.11. When do you think you'll have one up?
> 
>  On Tue, Sep 26, 2017 at 8:58 AM, YiZhi Liu 
> >> wrote:
> 
> > I'm currently working on maven deploy for scala package.
> >
> > 2017-09-26 16:00 GMT+08:00 Zihao Zheng :
> >> I’m working on standalone TensorBoard, https://github.com/dmlc/
> > tensorboard , currently we’ve
> > support several features in original TensorBoard from TensorFlow in
> >> pure
> > Python without any DL framework dependency.
> >>
> >> Recently I’m trying to bring more features to this standalone
> >> version,
> > but seems not very trivial as it depends on TensorFlow. Any advice
> are
> > welcomed and looking for help.
> >>
> >> Thanks,
> >> Zihao
> >>
> >>> 在 2017年9月26日,下午1:58,sandeep krishnamurthy <
> >>> sandeep.krishn...@gmail.com>
> > 写道:
> >>>
> >>> I am currently working with Jiajie Chen (https://github.com/
> >>> jiajiechen/)
> > on
> >>> building an automated periodic benchmarking framework to run
> >> various
> >>> standard MXNet training jobs with both Symbolic and Gluon
> >> interface.
> > This
> >>> framework will run following standard training jobs on a nightly
> >> and
> > weekly
> >>> basis helping us to track performance improvements or regression
> >>> early
> > in
> >>> the development cycle of MXNet. Both CPU and GPU instances are used
> >>> capturing various metrics like training accuracy, validation
> >>> accuracy,
> >>> convergence, memory consumption, speed.
> >>>
> >>> To start with, we will be running Resnet50, Resnet152 on CIFAR and
> >>> Synthetic Dataset. And, few more RNN and Bidirectional LSTM
> >> training
> > jobs.
> >>>
> >>> Thanks,
> >>> Sandeep
> >>>
> >>>
> >>> On Mon, Sep 25, 2017 at 8:00 PM, Henri Yandell 
> > wrote:
> >>>
>  Getting an instance of github.com/amzn/oss-dashboard setup for
> >>> mxnet.
> 
>  Hopefully useful to write custom metric analysis; like: "most pull
> > requests
>  from non-committer" and "PRs without committer comment".
> 
>  Hen
> 
>  On Mon, Sep 25, 2017 at 11:24 Seb Kiureghian 
> > wrote:
> 
> > Hey dev@,
> >
> > In the spirit of bringing more activity to the mailing lists and
> > growing
> > the community, can everyone who is working on MXNet please share
> >>> what
> > you're working on?
> >
> > I'm working on
> > -Redesigning the website
> > .
> > -Setting up a forum for user support.
> >
> > Seb Kiureghian
> 

Re: CI system seems to be using python3 for python2 builds

2017-09-27 Thread Sunderland, Kellen
Many thanks Gautam.

On 9/26/17, 8:37 PM, "Kumar, Gautam"  wrote:

Hi Kellen, 

   This issue has been happening since last 3-4 days along with few other 
test failure.
I am looking into it.  

-Gautam 

On 9/26/17, 7:45 AM, "Sunderland, Kellen"  wrote:

I’ve been noticing in a few failed builds that the stack trace 
indicates we’re actually running python 3.4 in the python 2 tests. I know the 
CI folks are working hard getting everything setup, is this a known issue for 
the CI team?

For example: 
https://builds.apache.org/blue/organizations/jenkins/incubator-mxnet/detail/PR-8026/3/pipeline/281

Steps Python2: MKLML-CPU

StackTrace:
Stack trace returned 10 entries:
[bt] (0) 
/workspace/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c)
 [0x7fadb8999aac]
[bt] (1) 
/workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7kvstore12KVStoreLocal12GroupKVPairsISt4pairIPNS_7NDArrayES4_EZNS1_19GroupKVPairsPullRspERKSt6vectorIiSaIiEERKS7_IS6_SaIS6_EEPS9_PS7_ISD_SaISD_EEEUliRKS6_E_EEvSB_RKS7_IT_SaISN_EESG_PS7_ISP_SaISP_EERKT0_+0x56b)
 [0x7fadba32c01b]
[bt] (2) 
/workspace/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7kvstore12KVStoreLocal17PullRowSparseImplERKSt6vectorIiSaIiEERKS2_ISt4pairIPNS_7NDArrayES8_ESaISA_EEi+0xa6)
 [0x7fadba32c856]
[bt] (3) 
/workspace/python/mxnet/../../lib/libmxnet.so(MXKVStorePullRowSparse+0x245) 
[0x7fadba18f165]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) 
[0x7fadde26cadc]
[bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) 
[0x7fadde26c40c]
[bt] (6) 
/usr/lib/python3.4/lib-dynload/_ctypes.cpython-34m-x86_64-linux-gnu.so(_ctypes_callproc+0x21d)
 [0x7fadde47e12d]
[bt] (7) 
/usr/lib/python3.4/lib-dynload/_ctypes.cpython-34m-x86_64-linux-gnu.so(+0xf6a3) 
[0x7fadde47e6a3]
[bt] (8) /usr/bin/python3(PyEval_EvalFrameEx+0x41d7) [0x48a487]
[bt] (9) /usr/bin/python3() [0x48f2df]

-Kellen
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B




Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B


Re: mxnet slack channel

2017-09-27 Thread Steffen Rochel
Done

On Tue, Sep 26, 2017 at 11:01 PM Johan Gudmundsson <
johan.gudmunds...@gmail.com> wrote:

> Hi
> Could I have access to the mxnet slack channel,
> I'm currently looking into mxnet for bayesian inference.
>
> Best
> Johan Gudmundsson
>


mxnet slack channel

2017-09-27 Thread Johan Gudmundsson
Hi
Could I have access to the mxnet slack channel,
I'm currently looking into mxnet for bayesian inference.

Best
Johan Gudmundsson