Bhavin: I would add on point 5 that it doesn't alway make sense to attach ownership for the broken integration test to the PR author. We're planning extensive integration tests on a variety of hardware. Some of these test failures won't be reproducible by most PR authors and the effort to resolve these failures should be delegated to a test owner. Agree with Pedro that this would be strictly fast-fwd merging from one branch to another after integration tests pass, so it shouldn't require much extra work beyond fixing failures.
On Wed, Nov 1, 2017 at 6:35 PM, Pedro Larroy <pedro.larroy.li...@gmail.com> wrote: > Hi Bhavin > > Good suggestions. > > I wanted to respond to your point #5 > > The promotion of integration to master would be done automatically by > jenkins once a commit passes the nightly tests. So it should not > impose any additional burden on the developers, as there is no manual > step involved / human gatekeeper. > > It would be equivalent to your suggestion with tags. You can do the > same with branches, anyway a git branch is just a pointer to some > commit, so I think we are talking about the same. > > Pedro. > > > > > On Wed, Nov 1, 2017 at 5:41 PM, Bhavin Thaker <bhavintha...@gmail.com> > wrote: > > Few comments/suggestions: > > > > 1) Can we have this nice list of todo items on the Apache MXNet wiki > page > > to track them better? > > > > 2) Can we have a set of owners for each set of tests and source code > > directory? One of the problems I have observed is that when there is a > test > > failure, it is difficult to find an owner who will take the > responsibility > > of fixing the test OR identifying the culprit code promptly -- this > causes > > the master to continue to fail for many days. > > > > 3) Specifically, we need an owner for the Windows setup -- nobody seems > to > > know much about it -- please feel free to correct me if required. > > > > 4) +1 to have a list of all feature requests on Jira or a similar > commonly > > and easily accessible system. > > > > 5) -1 to the branching model -- I was the gatekeeper for the branching > > model at Informix for the database kernel code to be merged to master > along > > with my day-job of being a database kernel engineer for around 9 months > and > > hence have the opinion that a branching model just shifts the burden from > > one place to another. We don't have a dedicated team to do the branching > > model. If we really need a buildable master everyday, then we could just > > tag every successful build as last_clean_build on master -- use this tag > to > > get a clean master at any time. How many Apache projects are doing > > development on separate branches? > > > > 6) FYI: Rahul (rahul003@) has fixed various warnings with this PR: > > https://github.com/apache/incubator-mxnet/pull/7109 and has a test added > > that fails for any warning found. We can build on top of his work. > > > > 7) FYI: For the unit-tests problems, Meghna identified that some of the > > unit-test run times have increased significantly in the recent builds. We > > need volunteers to help diagnose the root-cause here: > > > > Unit Test Task > > > > Build #337 > > > > Build #500 > > > > Build #556 > > > > Python 2: GPU win > > > > 25 > > > > 38 > > > > 40 > > > > Python 3: GPU Win > > > > 15 > > > > 38 > > > > 46 > > > > Python2: CPU > > > > 25 > > > > 35 > > > > 80 > > > > Python3: CPU > > > > 14 > > > > 28 > > > > 72 > > > > R: CPU > > > > 20 > > > > 34 > > > > 24 > > > > R: GPU > > > > 5 > > > > 24 > > > > 24 > > > > > > 8) Ensure that all PRs submitted have corresponding documentation on > > http://mxnet.io for it. It may be fine to have documentation follow the > > code changes as long as there is ownership that this task will be done > in a > > timely manner. For example, I have requested the Nvidia team to submit > PRs > > to update documentation on http://mxnet.io for the Volta changes to > MXNet. > > > > > > 9) Ensure that mega-PRs have some level of design or architecture > > document(s) shared on the Apache MXNet wiki. The mega-PR must have both > > unit-tests and nightly/integration tests submitted to demonstrate > > high-quality level. > > > > > > 10) Finally, how do we get ownership for code submitted to MXNet? When > > something fails in a code segment that only a small set of folks know > > about, what is the expected SLA for a response from them? When users > deploy > > MXNet in production environments, they will expect some form of SLA for > > support and a patch release. > > > > > > Regards, > > Bhavin Thaker. > > > > > > > > > > > > > > On Wed, Nov 1, 2017 at 8:20 AM, Pedro Larroy < > pedro.larroy.li...@gmail.com> > > wrote: > > > >> +1 That would be great. > >> > >> On Mon, Oct 30, 2017 at 5:35 PM, Hen <bay...@apache.org> wrote: > >> > How about we ask for a new mxnet repo to store all the config in? > >> > > >> > On Fri, Oct 27, 2017 at 05:30 Pedro Larroy < > pedro.larroy.li...@gmail.com > >> > > >> > wrote: > >> > > >> >> Just to provide a high level overview of the ideas and proposals > >> >> coming from different sources for the requirements for testing and > >> >> validation of builds: > >> >> > >> >> * Have terraform files for the testing infrastructure. Infrastructure > >> >> as code (IaC). Minus not emulated / nor cloud based, embedded > >> >> hardware. ("single command" replication of the testing > infrastructure, > >> >> no manual steps). > >> >> > >> >> * CI software based on Jenkins, unless someone thinks there's a > better > >> >> alternative. > >> >> > >> >> * Use autoscaling groups and improve staggered build + test steps to > >> >> achieve higher parallelism and shorter feedback times. > >> >> > >> >> * Switch to a branching model based on stable master + integration > >> >> branch. PRs are merged into dev/integration which runs extended > >> >> nightly tests, which are > >> >> then merged into master, preferably in an automated way after > >> >> successful extended testing. > >> >> Master is always tested, and always buildable. Release branches or > >> >> tags in master as usual for releases. > >> >> > >> >> * Build + test feedback time targeting less than 15 minutes. > >> >> (Currently a build in a 16x core takes 7m). This involves lot of > >> >> refactoring of tests, move expensive tests / big smoke tests to > >> >> nightlies on the integration branch, also tests on IoT devices / > power > >> >> and performance regressions... > >> >> > >> >> * Add code coverage and other quality metrics. > >> >> > >> >> * Eliminate warnings and treat warnings as errors. We have spent time > >> >> tracking down "undefined behaviour" bugs that could have been caught > >> >> by compiler warnings. > >> >> > >> >> Is there something I'm missing or additional things that come to your > >> >> mind that you would wish to add? > >> >> > >> >> Pedro. > >> >> > >> >