To point 7) I did a little bit of measure / profiling of our test runs a week or two ago and came to the same conclusion. I assumed the slow downs were mostly due to new tests which had recently been added. There were quite a few gluon tests for example added, and I think they're fairly resource intensive.
On Wed, Nov 1, 2017 at 6:40 PM, kellen sunderland < kellen.sunderl...@gmail.com> wrote: > Bhavin: I would add on point 5 that it doesn't alway make sense to attach > ownership for the broken integration test to the PR author. We're planning > extensive integration tests on a variety of hardware. Some of these test > failures won't be reproducible by most PR authors and the effort to resolve > these failures should be delegated to a test owner. Agree with Pedro that > this would be strictly fast-fwd merging from one branch to another after > integration tests pass, so it shouldn't require much extra work beyond > fixing failures. > > On Wed, Nov 1, 2017 at 6:35 PM, Pedro Larroy <pedro.larroy.li...@gmail.com > > wrote: > >> Hi Bhavin >> >> Good suggestions. >> >> I wanted to respond to your point #5 >> >> The promotion of integration to master would be done automatically by >> jenkins once a commit passes the nightly tests. So it should not >> impose any additional burden on the developers, as there is no manual >> step involved / human gatekeeper. >> >> It would be equivalent to your suggestion with tags. You can do the >> same with branches, anyway a git branch is just a pointer to some >> commit, so I think we are talking about the same. >> >> Pedro. >> >> >> >> >> On Wed, Nov 1, 2017 at 5:41 PM, Bhavin Thaker <bhavintha...@gmail.com> >> wrote: >> > Few comments/suggestions: >> > >> > 1) Can we have this nice list of todo items on the Apache MXNet wiki >> page >> > to track them better? >> > >> > 2) Can we have a set of owners for each set of tests and source code >> > directory? One of the problems I have observed is that when there is a >> test >> > failure, it is difficult to find an owner who will take the >> responsibility >> > of fixing the test OR identifying the culprit code promptly -- this >> causes >> > the master to continue to fail for many days. >> > >> > 3) Specifically, we need an owner for the Windows setup -- nobody seems >> to >> > know much about it -- please feel free to correct me if required. >> > >> > 4) +1 to have a list of all feature requests on Jira or a similar >> commonly >> > and easily accessible system. >> > >> > 5) -1 to the branching model -- I was the gatekeeper for the branching >> > model at Informix for the database kernel code to be merged to master >> along >> > with my day-job of being a database kernel engineer for around 9 months >> and >> > hence have the opinion that a branching model just shifts the burden >> from >> > one place to another. We don't have a dedicated team to do the branching >> > model. If we really need a buildable master everyday, then we could just >> > tag every successful build as last_clean_build on master -- use this >> tag to >> > get a clean master at any time. How many Apache projects are doing >> > development on separate branches? >> > >> > 6) FYI: Rahul (rahul003@) has fixed various warnings with this PR: >> > https://github.com/apache/incubator-mxnet/pull/7109 and has a test >> added >> > that fails for any warning found. We can build on top of his work. >> > >> > 7) FYI: For the unit-tests problems, Meghna identified that some of the >> > unit-test run times have increased significantly in the recent builds. >> We >> > need volunteers to help diagnose the root-cause here: >> > >> > Unit Test Task >> > >> > Build #337 >> > >> > Build #500 >> > >> > Build #556 >> > >> > Python 2: GPU win >> > >> > 25 >> > >> > 38 >> > >> > 40 >> > >> > Python 3: GPU Win >> > >> > 15 >> > >> > 38 >> > >> > 46 >> > >> > Python2: CPU >> > >> > 25 >> > >> > 35 >> > >> > 80 >> > >> > Python3: CPU >> > >> > 14 >> > >> > 28 >> > >> > 72 >> > >> > R: CPU >> > >> > 20 >> > >> > 34 >> > >> > 24 >> > >> > R: GPU >> > >> > 5 >> > >> > 24 >> > >> > 24 >> > >> > >> > 8) Ensure that all PRs submitted have corresponding documentation on >> > http://mxnet.io for it. It may be fine to have documentation follow >> the >> > code changes as long as there is ownership that this task will be done >> in a >> > timely manner. For example, I have requested the Nvidia team to submit >> PRs >> > to update documentation on http://mxnet.io for the Volta changes to >> MXNet. >> > >> > >> > 9) Ensure that mega-PRs have some level of design or architecture >> > document(s) shared on the Apache MXNet wiki. The mega-PR must have both >> > unit-tests and nightly/integration tests submitted to demonstrate >> > high-quality level. >> > >> > >> > 10) Finally, how do we get ownership for code submitted to MXNet? When >> > something fails in a code segment that only a small set of folks know >> > about, what is the expected SLA for a response from them? When users >> deploy >> > MXNet in production environments, they will expect some form of SLA for >> > support and a patch release. >> > >> > >> > Regards, >> > Bhavin Thaker. >> > >> > >> > >> > >> > >> > >> > On Wed, Nov 1, 2017 at 8:20 AM, Pedro Larroy < >> pedro.larroy.li...@gmail.com> >> > wrote: >> > >> >> +1 That would be great. >> >> >> >> On Mon, Oct 30, 2017 at 5:35 PM, Hen <bay...@apache.org> wrote: >> >> > How about we ask for a new mxnet repo to store all the config in? >> >> > >> >> > On Fri, Oct 27, 2017 at 05:30 Pedro Larroy < >> pedro.larroy.li...@gmail.com >> >> > >> >> > wrote: >> >> > >> >> >> Just to provide a high level overview of the ideas and proposals >> >> >> coming from different sources for the requirements for testing and >> >> >> validation of builds: >> >> >> >> >> >> * Have terraform files for the testing infrastructure. >> Infrastructure >> >> >> as code (IaC). Minus not emulated / nor cloud based, embedded >> >> >> hardware. ("single command" replication of the testing >> infrastructure, >> >> >> no manual steps). >> >> >> >> >> >> * CI software based on Jenkins, unless someone thinks there's a >> better >> >> >> alternative. >> >> >> >> >> >> * Use autoscaling groups and improve staggered build + test steps to >> >> >> achieve higher parallelism and shorter feedback times. >> >> >> >> >> >> * Switch to a branching model based on stable master + integration >> >> >> branch. PRs are merged into dev/integration which runs extended >> >> >> nightly tests, which are >> >> >> then merged into master, preferably in an automated way after >> >> >> successful extended testing. >> >> >> Master is always tested, and always buildable. Release branches or >> >> >> tags in master as usual for releases. >> >> >> >> >> >> * Build + test feedback time targeting less than 15 minutes. >> >> >> (Currently a build in a 16x core takes 7m). This involves lot of >> >> >> refactoring of tests, move expensive tests / big smoke tests to >> >> >> nightlies on the integration branch, also tests on IoT devices / >> power >> >> >> and performance regressions... >> >> >> >> >> >> * Add code coverage and other quality metrics. >> >> >> >> >> >> * Eliminate warnings and treat warnings as errors. We have spent >> time >> >> >> tracking down "undefined behaviour" bugs that could have been caught >> >> >> by compiler warnings. >> >> >> >> >> >> Is there something I'm missing or additional things that come to >> your >> >> >> mind that you would wish to add? >> >> >> >> >> >> Pedro. >> >> >> >> >> >> > >