Re: CI and PRs

Chris Olivier Wed, 14 Aug 2019 17:13:04 -0700

+1

Rather than remove tests (which doesn’t scale as a solution), why not scale
them horizontally so that they finish more quickly? Across processes or
even on a pool of machines that aren’t necessarily the build machine?


On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <marco.g.ab...@gmail.com>
wrote:

> With regards to time I rather prefer us spending a bit more time on
> maintenance than somebody running into an error that could've been caught
> with a test.
>
> I mean, our Publishing pipeline for Scala GPU has been broken for quite
> some time now, but nobody noticed that. Basically my stance on that matter
> is that as soon as something is not blocking, you can also just deactivate
> it since you don't have a forcing function in an open source project.
> People will rarely come back and fix the errors of some nightly test that
> they introduced.
>
> -Marco
>
> Carin Meier <carinme...@gmail.com> schrieb am Mi., 14. Aug. 2019, 21:59:
>
> > If a language binding test is failing for a not important reason, then it
> > is too brittle and needs to be fixed (we have fixed some of these with
> the
> > Clojure package [1]).
> > But in general, if we thinking of the MXNet project as one project that
> is
> > across all the language bindings, then we want to know if some
> fundamental
> > code change is going to break a downstream package.
> > I can't speak for all the high level package binding maintainers, but I'm
> > always happy to pitch in to provide code fixes to help the base PR get
> > green.
> >
> > The time costs to maintain such a large CI project obviously needs to be
> > considered as well.
> >
> > [1] https://github.com/apache/incubator-mxnet/pull/15579
> >
> > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> pedro.larroy.li...@gmail.com
> > >
> > wrote:
> >
> > > From what I have seen Clojure is 15 minutes, which I think is
> reasonable.
> > > The only question is that when a binding such as R, Perl or Clojure
> > fails,
> > > some devs are a bit confused about how to fix them since they are not
> > > familiar with the testing tools and the language.
> > >
> > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <carinme...@gmail.com>
> > wrote:
> > >
> > > > Great idea Marco! Anything that you think would be valuable to share
> > > would
> > > > be good. The duration of each node in the test stage sounds like a
> good
> > > > start.
> > > >
> > > > - Carin
> > > >
> > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > marco.g.ab...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > we record a bunch of metrics about run statistics (down to the
> > duration
> > > > of
> > > > > every individual step). If you tell me which ones you're
> particularly
> > > > > interested in (probably total duration of each node in the test
> > stage),
> > > > I'm
> > > > > happy to provide them.
> > > > >
> > > > > Dimensions are (in hierarchical order):
> > > > > - job
> > > > > - branch
> > > > > - stage
> > > > > - node
> > > > > - step
> > > > >
> > > > > Unfortunately I don't have the possibility to export them since we
> > > store
> > > > > them in CloudWatch Metrics which afaik doesn't offer raw exports.
> > > > >
> > > > > Best regards,
> > > > > Marco
> > > > >
> > > > > Carin Meier <carinme...@gmail.com> schrieb am Mi., 14. Aug. 2019,
> > > 19:43:
> > > > >
> > > > > > I would prefer to keep the language binding in the PR process.
> > > Perhaps
> > > > we
> > > > > > could do some analytics to see how much each of the language
> > bindings
> > > > is
> > > > > > contributing to overall run time.
> > > > > > If we have some metrics on that, maybe we can come up with a
> > > guideline
> > > > of
> > > > > > how much time each should take. Another possibility is leverage
> the
> > > > > > parallel builds more.
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > > pedro.larroy.li...@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Carin.
> > > > > > >
> > > > > > > That's a good point, all things considered would your
> preference
> > be
> > > > to
> > > > > > keep
> > > > > > > the Clojure tests as part of the PR process or in Nightly?
> > > > > > > Some options are having notifications here or in slack. But if
> we
> > > > think
> > > > > > > breakages would go unnoticed maybe is not a good idea to fully
> > > remove
> > > > > > > bindings from the PR process and just streamline the process.
> > > > > > >
> > > > > > > Pedro.
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> > carinme...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Before any binding tests are moved to nightly, I think we
> need
> > to
> > > > > > figure
> > > > > > > > out how the community can get proper notifications of failure
> > and
> > > > > > success
> > > > > > > > on those nightly runs. Otherwise, I think that breakages
> would
> > go
> > > > > > > > unnoticed.
> > > > > > > >
> > > > > > > > -Carin
> > > > > > > >
> > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > > > > > pedro.larroy.li...@gmail.com
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi
> > > > > > > > >
> > > > > > > > > Seems we are hitting some problems in CI. I propose the
> > > following
> > > > > > > action
> > > > > > > > > items to remedy the situation and accelerate turn around
> > times
> > > in
> > > > > CI,
> > > > > > > > > reduce cost, complexity and probability of failure blocking
> > PRs
> > > > and
> > > > > > > > > frustrating developers:
> > > > > > > > >
> > > > > > > > > * Upgrade Windows visual studio from VS 2015 to VS 2017.
> The
> > > > > > > > > build_windows.py infrastructure should easily work with the
> > new
> > > > > > > version.
> > > > > > > > > Currently some PRs are blocked by this:
> > > > > > > > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > > * Move Gluon Model zoo tests to nightly. Tracked at
> > > > > > > > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > > * Move non-python bindings tests to nightly. If a commit is
> > > > > touching
> > > > > > > > other
> > > > > > > > > bindings, the reviewer should ask for a full run which can
> be
> > > > done
> > > > > > > > locally,
> > > > > > > > > use the label bot to trigger a full CI build, or defer to
> > > > nightly.
> > > > > > > > > * Provide a couple of basic sanity performance tests on
> small
> > > > > models
> > > > > > > that
> > > > > > > > > are run on CI and can be echoed by the label bot as a
> comment
> > > for
> > > > > > PRs.
> > > > > > > > > * Address unit tests that take more than 10-20s, streamline
> > > them
> > > > or
> > > > > > > move
> > > > > > > > > them to nightly if it can't be done.
> > > > > > > > > * Open sourcing the remaining CI infrastructure scripts so
> > the
> > > > > > > community
> > > > > > > > > can contribute.
> > > > > > > > >
> > > > > > > > > I think our goal should be turnaround under 30min.
> > > > > > > > >
> > > > > > > > > I would also like to touch base with the community that
> some
> > > PRs
> > > > > are
> > > > > > > not
> > > > > > > > > being followed up by committers asking for changes. For
> > example
> > > > > this
> > > > > > PR
> > > > > > > > is
> > > > > > > > > importtant and is hanging for a long time.
> > > > > > > > >
> > > > > > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > > > >
> > > > > > > > > This is another, less important but more trivial to review:
> > > > > > > > >
> > > > > > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > > > >
> > > > > > > > > I think comitters requesting changes and not folllowing up
> in
> > > > > > > reasonable
> > > > > > > > > time is not healthy for the project. I suggest configuring
> > > github
> > > > > > > > > Notifications for a good SNR and following up.
> > > > > > > > >
> > > > > > > > > Regards.
> > > > > > > > >
> > > > > > > > > Pedro.
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: CI and PRs

Reply via email to