Sounds good in theory. I think there are complex details with regards of resource sharing during parallel execution. Still I think both ways can be explored. I think some tests run for unreasonably long times for what they are doing. We already scale parts of the pipeline horizontally across workers.
On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <cjolivie...@apache.org> wrote: > +1 > > Rather than remove tests (which doesn’t scale as a solution), why not scale > them horizontally so that they finish more quickly? Across processes or > even on a pool of machines that aren’t necessarily the build machine? > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <marco.g.ab...@gmail.com> > wrote: > > > With regards to time I rather prefer us spending a bit more time on > > maintenance than somebody running into an error that could've been caught > > with a test. > > > > I mean, our Publishing pipeline for Scala GPU has been broken for quite > > some time now, but nobody noticed that. Basically my stance on that > matter > > is that as soon as something is not blocking, you can also just > deactivate > > it since you don't have a forcing function in an open source project. > > People will rarely come back and fix the errors of some nightly test that > > they introduced. > > > > -Marco > > > > Carin Meier <carinme...@gmail.com> schrieb am Mi., 14. Aug. 2019, 21:59: > > > > > If a language binding test is failing for a not important reason, then > it > > > is too brittle and needs to be fixed (we have fixed some of these with > > the > > > Clojure package [1]). > > > But in general, if we thinking of the MXNet project as one project that > > is > > > across all the language bindings, then we want to know if some > > fundamental > > > code change is going to break a downstream package. > > > I can't speak for all the high level package binding maintainers, but > I'm > > > always happy to pitch in to provide code fixes to help the base PR get > > > green. > > > > > > The time costs to maintain such a large CI project obviously needs to > be > > > considered as well. > > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579 > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy < > > pedro.larroy.li...@gmail.com > > > > > > > wrote: > > > > > > > From what I have seen Clojure is 15 minutes, which I think is > > reasonable. > > > > The only question is that when a binding such as R, Perl or Clojure > > > fails, > > > > some devs are a bit confused about how to fix them since they are not > > > > familiar with the testing tools and the language. > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <carinme...@gmail.com> > > > wrote: > > > > > > > > > Great idea Marco! Anything that you think would be valuable to > share > > > > would > > > > > be good. The duration of each node in the test stage sounds like a > > good > > > > > start. > > > > > > > > > > - Carin > > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu < > > > marco.g.ab...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > we record a bunch of metrics about run statistics (down to the > > > duration > > > > > of > > > > > > every individual step). If you tell me which ones you're > > particularly > > > > > > interested in (probably total duration of each node in the test > > > stage), > > > > > I'm > > > > > > happy to provide them. > > > > > > > > > > > > Dimensions are (in hierarchical order): > > > > > > - job > > > > > > - branch > > > > > > - stage > > > > > > - node > > > > > > - step > > > > > > > > > > > > Unfortunately I don't have the possibility to export them since > we > > > > store > > > > > > them in CloudWatch Metrics which afaik doesn't offer raw exports. > > > > > > > > > > > > Best regards, > > > > > > Marco > > > > > > > > > > > > Carin Meier <carinme...@gmail.com> schrieb am Mi., 14. Aug. > 2019, > > > > 19:43: > > > > > > > > > > > > > I would prefer to keep the language binding in the PR process. > > > > Perhaps > > > > > we > > > > > > > could do some analytics to see how much each of the language > > > bindings > > > > > is > > > > > > > contributing to overall run time. > > > > > > > If we have some metrics on that, maybe we can come up with a > > > > guideline > > > > > of > > > > > > > how much time each should take. Another possibility is leverage > > the > > > > > > > parallel builds more. > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy < > > > > > > pedro.larroy.li...@gmail.com > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Carin. > > > > > > > > > > > > > > > > That's a good point, all things considered would your > > preference > > > be > > > > > to > > > > > > > keep > > > > > > > > the Clojure tests as part of the PR process or in Nightly? > > > > > > > > Some options are having notifications here or in slack. But > if > > we > > > > > think > > > > > > > > breakages would go unnoticed maybe is not a good idea to > fully > > > > remove > > > > > > > > bindings from the PR process and just streamline the process. > > > > > > > > > > > > > > > > Pedro. > > > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier < > > > carinme...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > > Before any binding tests are moved to nightly, I think we > > need > > > to > > > > > > > figure > > > > > > > > > out how the community can get proper notifications of > failure > > > and > > > > > > > success > > > > > > > > > on those nightly runs. Otherwise, I think that breakages > > would > > > go > > > > > > > > > unnoticed. > > > > > > > > > > > > > > > > > > -Carin > > > > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy < > > > > > > > > pedro.larroy.li...@gmail.com > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi > > > > > > > > > > > > > > > > > > > > Seems we are hitting some problems in CI. I propose the > > > > following > > > > > > > > action > > > > > > > > > > items to remedy the situation and accelerate turn around > > > times > > > > in > > > > > > CI, > > > > > > > > > > reduce cost, complexity and probability of failure > blocking > > > PRs > > > > > and > > > > > > > > > > frustrating developers: > > > > > > > > > > > > > > > > > > > > * Upgrade Windows visual studio from VS 2015 to VS 2017. > > The > > > > > > > > > > build_windows.py infrastructure should easily work with > the > > > new > > > > > > > > version. > > > > > > > > > > Currently some PRs are blocked by this: > > > > > > > > > > https://github.com/apache/incubator-mxnet/issues/13958 > > > > > > > > > > * Move Gluon Model zoo tests to nightly. Tracked at > > > > > > > > > > https://github.com/apache/incubator-mxnet/issues/15295 > > > > > > > > > > * Move non-python bindings tests to nightly. If a commit > is > > > > > > touching > > > > > > > > > other > > > > > > > > > > bindings, the reviewer should ask for a full run which > can > > be > > > > > done > > > > > > > > > locally, > > > > > > > > > > use the label bot to trigger a full CI build, or defer to > > > > > nightly. > > > > > > > > > > * Provide a couple of basic sanity performance tests on > > small > > > > > > models > > > > > > > > that > > > > > > > > > > are run on CI and can be echoed by the label bot as a > > comment > > > > for > > > > > > > PRs. > > > > > > > > > > * Address unit tests that take more than 10-20s, > streamline > > > > them > > > > > or > > > > > > > > move > > > > > > > > > > them to nightly if it can't be done. > > > > > > > > > > * Open sourcing the remaining CI infrastructure scripts > so > > > the > > > > > > > > community > > > > > > > > > > can contribute. > > > > > > > > > > > > > > > > > > > > I think our goal should be turnaround under 30min. > > > > > > > > > > > > > > > > > > > > I would also like to touch base with the community that > > some > > > > PRs > > > > > > are > > > > > > > > not > > > > > > > > > > being followed up by committers asking for changes. For > > > example > > > > > > this > > > > > > > PR > > > > > > > > > is > > > > > > > > > > importtant and is hanging for a long time. > > > > > > > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/pull/15051 > > > > > > > > > > > > > > > > > > > > This is another, less important but more trivial to > review: > > > > > > > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/pull/14940 > > > > > > > > > > > > > > > > > > > > I think comitters requesting changes and not folllowing > up > > in > > > > > > > > reasonable > > > > > > > > > > time is not healthy for the project. I suggest > configuring > > > > github > > > > > > > > > > Notifications for a good SNR and following up. > > > > > > > > > > > > > > > > > > > > Regards. > > > > > > > > > > > > > > > > > > > > Pedro. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >