Re: [Proposal] Stabilizing Apache MXNet CI build system

Pedro Larroy Wed, 01 Nov 2017 10:35:25 -0700

Hi Bhavin

Good suggestions.


I wanted to respond to your point #5

The promotion of integration to master would be done automatically by
jenkins once a commit passes the nightly tests. So it should not
impose any additional burden on the developers, as there is no manual
step involved / human gatekeeper.

It would be equivalent to your suggestion with tags. You can do the
same with branches, anyway a git branch is just a pointer to some
commit, so I think we are talking about the same.

Pedro.




On Wed, Nov 1, 2017 at 5:41 PM, Bhavin Thaker <bhavintha...@gmail.com> wrote:
> Few comments/suggestions:
>
> 1) Can  we have this nice list of todo items on the Apache MXNet wiki page
> to track them better?
>
> 2) Can we have a set of owners for each set of tests and source code
> directory? One of the problems I have observed is that when there is a test
> failure, it is difficult to find an owner who will take the responsibility
> of fixing the test OR identifying the culprit code promptly -- this causes
> the master to continue to fail for many days.
>
> 3) Specifically, we need an owner for the Windows setup -- nobody seems to
> know much about it -- please feel free to correct me if required.
>
> 4) +1 to have a list of all feature requests on Jira or a similar commonly
> and easily accessible system.
>
> 5) -1 to the branching model -- I was the gatekeeper for the branching
> model at Informix for the database kernel code to be merged to master along
> with my day-job of being a database kernel engineer for around 9 months and
> hence have the opinion that a branching model just shifts the burden from
> one place to another. We don't have a dedicated team to do the branching
> model. If we really need a buildable master everyday, then we could just
> tag every successful build as last_clean_build on master -- use this tag to
> get a clean master at any time. How many Apache projects are doing
> development on separate branches?
>
> 6) FYI: Rahul (rahul003@) has fixed various warnings with this PR:
> https://github.com/apache/incubator-mxnet/pull/7109 and has a test added
> that fails for any warning found. We can build on top of his work.
>
> 7) FYI: For the unit-tests problems, Meghna identified that some of the
> unit-test run times have increased significantly in the recent builds. We
> need volunteers to help diagnose the root-cause here:
>
> Unit Test Task
>
> Build #337
>
> Build #500
>
> Build #556
>
> Python 2: GPU win
>
> 25
>
> 38
>
> 40
>
> Python 3: GPU Win
>
> 15
>
> 38
>
> 46
>
> Python2: CPU
>
> 25
>
> 35
>
> 80
>
> Python3: CPU
>
> 14
>
> 28
>
> 72
>
> R: CPU
>
> 20
>
> 34
>
> 24
>
> R: GPU
>
> 5
>
> 24
>
> 24
>
>
> 8) Ensure that all PRs submitted have corresponding documentation on
> http://mxnet.io for it.  It may be fine to have documentation follow the
> code changes as long as there is ownership that this task will be done in a
> timely manner.  For example, I have requested the Nvidia team to submit PRs
> to update documentation on http://mxnet.io for the Volta changes to MXNet.
>
>
> 9) Ensure that mega-PRs have some level of design or architecture
> document(s) shared on the Apache MXNet wiki. The mega-PR must have both
> unit-tests and nightly/integration tests submitted to demonstrate
> high-quality level.
>
>
> 10) Finally, how do we get ownership for code submitted to MXNet? When
> something fails in a code segment that only a small set of folks know
> about, what is the expected SLA for a response from them? When users deploy
> MXNet in production environments, they will expect some form of SLA for
> support and a patch release.
>
>
> Regards,
> Bhavin Thaker.
>
>
>
>
>
>
> On Wed, Nov 1, 2017 at 8:20 AM, Pedro Larroy <pedro.larroy.li...@gmail.com>
> wrote:
>
>> +1  That would be great.
>>
>> On Mon, Oct 30, 2017 at 5:35 PM, Hen <bay...@apache.org> wrote:
>> > How about we ask for a new mxnet repo to store all the config in?
>> >
>> > On Fri, Oct 27, 2017 at 05:30 Pedro Larroy <pedro.larroy.li...@gmail.com
>> >
>> > wrote:
>> >
>> >> Just to provide a high level overview of the ideas and proposals
>> >> coming from different sources for the requirements for testing and
>> >> validation of builds:
>> >>
>> >> * Have terraform files for the testing infrastructure. Infrastructure
>> >> as code (IaC). Minus not emulated / nor cloud based, embedded
>> >> hardware. ("single command" replication of the testing infrastructure,
>> >> no manual steps).
>> >>
>> >> * CI software based on Jenkins, unless someone thinks there's a better
>> >> alternative.
>> >>
>> >> * Use autoscaling groups and improve staggered build + test steps to
>> >> achieve higher parallelism and shorter feedback times.
>> >>
>> >> * Switch to a branching model based on stable master + integration
>> >> branch. PRs are merged into dev/integration which runs extended
>> >> nightly tests, which are
>> >> then merged into master, preferably in an automated way after
>> >> successful extended testing.
>> >> Master is always tested, and always buildable. Release branches or
>> >> tags in master as usual for releases.
>> >>
>> >> * Build + test feedback time targeting less than 15 minutes.
>> >> (Currently a build in a 16x core takes 7m). This involves lot of
>> >> refactoring of tests, move expensive tests / big smoke tests to
>> >> nightlies on the integration branch, also tests on IoT devices / power
>> >> and performance regressions...
>> >>
>> >> * Add code coverage and other quality metrics.
>> >>
>> >> * Eliminate warnings and treat warnings as errors. We have spent time
>> >> tracking down "undefined behaviour" bugs that could have been caught
>> >> by compiler warnings.
>> >>
>> >> Is there something I'm missing or additional things that come to your
>> >> mind that you would wish to add?
>> >>
>> >> Pedro.
>> >>
>>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Reply via email to