Re: [Proposal] Stabilizing Apache MXNet CI build system

kellen sunderland Wed, 01 Nov 2017 10:43:07 -0700

To point 7) I did a little bit of measure / profiling of our test runs a
week or two ago and came to the same conclusion.  I assumed the slow downs
were mostly due to new tests which had recently been added.  There were
quite a few gluon tests for example added, and I think they're fairly
resource intensive.


On Wed, Nov 1, 2017 at 6:40 PM, kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Bhavin: I would add on point 5 that it doesn't alway make sense to attach
> ownership for the broken integration test to the PR author.  We're planning
> extensive integration tests on a variety of hardware.  Some of these test
> failures won't be reproducible by most PR authors and the effort to resolve
> these failures should be delegated to a test owner.  Agree with Pedro that
> this would be strictly fast-fwd merging from one branch to another after
> integration tests pass, so it shouldn't require much extra work beyond
> fixing failures.
>
> On Wed, Nov 1, 2017 at 6:35 PM, Pedro Larroy <pedro.larroy.li...@gmail.com
> > wrote:
>
>> Hi Bhavin
>>
>> Good suggestions.
>>
>> I wanted to respond to your point #5
>>
>> The promotion of integration to master would be done automatically by
>> jenkins once a commit passes the nightly tests. So it should not
>> impose any additional burden on the developers, as there is no manual
>> step involved / human gatekeeper.
>>
>> It would be equivalent to your suggestion with tags. You can do the
>> same with branches, anyway a git branch is just a pointer to some
>> commit, so I think we are talking about the same.
>>
>> Pedro.
>>
>>
>>
>>
>> On Wed, Nov 1, 2017 at 5:41 PM, Bhavin Thaker <bhavintha...@gmail.com>
>> wrote:
>> > Few comments/suggestions:
>> >
>> > 1) Can  we have this nice list of todo items on the Apache MXNet wiki
>> page
>> > to track them better?
>> >
>> > 2) Can we have a set of owners for each set of tests and source code
>> > directory? One of the problems I have observed is that when there is a
>> test
>> > failure, it is difficult to find an owner who will take the
>> responsibility
>> > of fixing the test OR identifying the culprit code promptly -- this
>> causes
>> > the master to continue to fail for many days.
>> >
>> > 3) Specifically, we need an owner for the Windows setup -- nobody seems
>> to
>> > know much about it -- please feel free to correct me if required.
>> >
>> > 4) +1 to have a list of all feature requests on Jira or a similar
>> commonly
>> > and easily accessible system.
>> >
>> > 5) -1 to the branching model -- I was the gatekeeper for the branching
>> > model at Informix for the database kernel code to be merged to master
>> along
>> > with my day-job of being a database kernel engineer for around 9 months
>> and
>> > hence have the opinion that a branching model just shifts the burden
>> from
>> > one place to another. We don't have a dedicated team to do the branching
>> > model. If we really need a buildable master everyday, then we could just
>> > tag every successful build as last_clean_build on master -- use this
>> tag to
>> > get a clean master at any time. How many Apache projects are doing
>> > development on separate branches?
>> >
>> > 6) FYI: Rahul (rahul003@) has fixed various warnings with this PR:
>> > https://github.com/apache/incubator-mxnet/pull/7109 and has a test
>> added
>> > that fails for any warning found. We can build on top of his work.
>> >
>> > 7) FYI: For the unit-tests problems, Meghna identified that some of the
>> > unit-test run times have increased significantly in the recent builds.
>> We
>> > need volunteers to help diagnose the root-cause here:
>> >
>> > Unit Test Task
>> >
>> > Build #337
>> >
>> > Build #500
>> >
>> > Build #556
>> >
>> > Python 2: GPU win
>> >
>> > 25
>> >
>> > 38
>> >
>> > 40
>> >
>> > Python 3: GPU Win
>> >
>> > 15
>> >
>> > 38
>> >
>> > 46
>> >
>> > Python2: CPU
>> >
>> > 25
>> >
>> > 35
>> >
>> > 80
>> >
>> > Python3: CPU
>> >
>> > 14
>> >
>> > 28
>> >
>> > 72
>> >
>> > R: CPU
>> >
>> > 20
>> >
>> > 34
>> >
>> > 24
>> >
>> > R: GPU
>> >
>> > 5
>> >
>> > 24
>> >
>> > 24
>> >
>> >
>> > 8) Ensure that all PRs submitted have corresponding documentation on
>> > http://mxnet.io for it.  It may be fine to have documentation follow
>> the
>> > code changes as long as there is ownership that this task will be done
>> in a
>> > timely manner.  For example, I have requested the Nvidia team to submit
>> PRs
>> > to update documentation on http://mxnet.io for the Volta changes to
>> MXNet.
>> >
>> >
>> > 9) Ensure that mega-PRs have some level of design or architecture
>> > document(s) shared on the Apache MXNet wiki. The mega-PR must have both
>> > unit-tests and nightly/integration tests submitted to demonstrate
>> > high-quality level.
>> >
>> >
>> > 10) Finally, how do we get ownership for code submitted to MXNet? When
>> > something fails in a code segment that only a small set of folks know
>> > about, what is the expected SLA for a response from them? When users
>> deploy
>> > MXNet in production environments, they will expect some form of SLA for
>> > support and a patch release.
>> >
>> >
>> > Regards,
>> > Bhavin Thaker.
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Nov 1, 2017 at 8:20 AM, Pedro Larroy <
>> pedro.larroy.li...@gmail.com>
>> > wrote:
>> >
>> >> +1  That would be great.
>> >>
>> >> On Mon, Oct 30, 2017 at 5:35 PM, Hen <bay...@apache.org> wrote:
>> >> > How about we ask for a new mxnet repo to store all the config in?
>> >> >
>> >> > On Fri, Oct 27, 2017 at 05:30 Pedro Larroy <
>> pedro.larroy.li...@gmail.com
>> >> >
>> >> > wrote:
>> >> >
>> >> >> Just to provide a high level overview of the ideas and proposals
>> >> >> coming from different sources for the requirements for testing and
>> >> >> validation of builds:
>> >> >>
>> >> >> * Have terraform files for the testing infrastructure.
>> Infrastructure
>> >> >> as code (IaC). Minus not emulated / nor cloud based, embedded
>> >> >> hardware. ("single command" replication of the testing
>> infrastructure,
>> >> >> no manual steps).
>> >> >>
>> >> >> * CI software based on Jenkins, unless someone thinks there's a
>> better
>> >> >> alternative.
>> >> >>
>> >> >> * Use autoscaling groups and improve staggered build + test steps to
>> >> >> achieve higher parallelism and shorter feedback times.
>> >> >>
>> >> >> * Switch to a branching model based on stable master + integration
>> >> >> branch. PRs are merged into dev/integration which runs extended
>> >> >> nightly tests, which are
>> >> >> then merged into master, preferably in an automated way after
>> >> >> successful extended testing.
>> >> >> Master is always tested, and always buildable. Release branches or
>> >> >> tags in master as usual for releases.
>> >> >>
>> >> >> * Build + test feedback time targeting less than 15 minutes.
>> >> >> (Currently a build in a 16x core takes 7m). This involves lot of
>> >> >> refactoring of tests, move expensive tests / big smoke tests to
>> >> >> nightlies on the integration branch, also tests on IoT devices /
>> power
>> >> >> and performance regressions...
>> >> >>
>> >> >> * Add code coverage and other quality metrics.
>> >> >>
>> >> >> * Eliminate warnings and treat warnings as errors. We have spent
>> time
>> >> >> tracking down "undefined behaviour" bugs that could have been caught
>> >> >> by compiler warnings.
>> >> >>
>> >> >> Is there something I'm missing or additional things that come to
>> your
>> >> >> mind that you would wish to add?
>> >> >>
>> >> >> Pedro.
>> >> >>
>> >>
>>
>
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Reply via email to