With regards to time I rather prefer us spending a bit more time on
maintenance than somebody running into an error that could've been caught
with a test.

I mean, our Publishing pipeline for Scala GPU has been broken for quite
some time now, but nobody noticed that. Basically my stance on that matter
is that as soon as something is not blocking, you can also just deactivate
it since you don't have a forcing function in an open source project.
People will rarely come back and fix the errors of some nightly test that
they introduced.

-Marco

Carin Meier <carinme...@gmail.com> schrieb am Mi., 14. Aug. 2019, 21:59:

> If a language binding test is failing for a not important reason, then it
> is too brittle and needs to be fixed (we have fixed some of these with the
> Clojure package [1]).
> But in general, if we thinking of the MXNet project as one project that is
> across all the language bindings, then we want to know if some fundamental
> code change is going to break a downstream package.
> I can't speak for all the high level package binding maintainers, but I'm
> always happy to pitch in to provide code fixes to help the base PR get
> green.
>
> The time costs to maintain such a large CI project obviously needs to be
> considered as well.
>
> [1] https://github.com/apache/incubator-mxnet/pull/15579
>
> On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <pedro.larroy.li...@gmail.com
> >
> wrote:
>
> > From what I have seen Clojure is 15 minutes, which I think is reasonable.
> > The only question is that when a binding such as R, Perl or Clojure
> fails,
> > some devs are a bit confused about how to fix them since they are not
> > familiar with the testing tools and the language.
> >
> > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <carinme...@gmail.com>
> wrote:
> >
> > > Great idea Marco! Anything that you think would be valuable to share
> > would
> > > be good. The duration of each node in the test stage sounds like a good
> > > start.
> > >
> > > - Carin
> > >
> > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> marco.g.ab...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > we record a bunch of metrics about run statistics (down to the
> duration
> > > of
> > > > every individual step). If you tell me which ones you're particularly
> > > > interested in (probably total duration of each node in the test
> stage),
> > > I'm
> > > > happy to provide them.
> > > >
> > > > Dimensions are (in hierarchical order):
> > > > - job
> > > > - branch
> > > > - stage
> > > > - node
> > > > - step
> > > >
> > > > Unfortunately I don't have the possibility to export them since we
> > store
> > > > them in CloudWatch Metrics which afaik doesn't offer raw exports.
> > > >
> > > > Best regards,
> > > > Marco
> > > >
> > > > Carin Meier <carinme...@gmail.com> schrieb am Mi., 14. Aug. 2019,
> > 19:43:
> > > >
> > > > > I would prefer to keep the language binding in the PR process.
> > Perhaps
> > > we
> > > > > could do some analytics to see how much each of the language
> bindings
> > > is
> > > > > contributing to overall run time.
> > > > > If we have some metrics on that, maybe we can come up with a
> > guideline
> > > of
> > > > > how much time each should take. Another possibility is leverage the
> > > > > parallel builds more.
> > > > >
> > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > pedro.larroy.li...@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Carin.
> > > > > >
> > > > > > That's a good point, all things considered would your preference
> be
> > > to
> > > > > keep
> > > > > > the Clojure tests as part of the PR process or in Nightly?
> > > > > > Some options are having notifications here or in slack. But if we
> > > think
> > > > > > breakages would go unnoticed maybe is not a good idea to fully
> > remove
> > > > > > bindings from the PR process and just streamline the process.
> > > > > >
> > > > > > Pedro.
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> carinme...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Before any binding tests are moved to nightly, I think we need
> to
> > > > > figure
> > > > > > > out how the community can get proper notifications of failure
> and
> > > > > success
> > > > > > > on those nightly runs. Otherwise, I think that breakages would
> go
> > > > > > > unnoticed.
> > > > > > >
> > > > > > > -Carin
> > > > > > >
> > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > > > > pedro.larroy.li...@gmail.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi
> > > > > > > >
> > > > > > > > Seems we are hitting some problems in CI. I propose the
> > following
> > > > > > action
> > > > > > > > items to remedy the situation and accelerate turn around
> times
> > in
> > > > CI,
> > > > > > > > reduce cost, complexity and probability of failure blocking
> PRs
> > > and
> > > > > > > > frustrating developers:
> > > > > > > >
> > > > > > > > * Upgrade Windows visual studio from VS 2015 to VS 2017. The
> > > > > > > > build_windows.py infrastructure should easily work with the
> new
> > > > > > version.
> > > > > > > > Currently some PRs are blocked by this:
> > > > > > > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > * Move Gluon Model zoo tests to nightly. Tracked at
> > > > > > > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > * Move non-python bindings tests to nightly. If a commit is
> > > > touching
> > > > > > > other
> > > > > > > > bindings, the reviewer should ask for a full run which can be
> > > done
> > > > > > > locally,
> > > > > > > > use the label bot to trigger a full CI build, or defer to
> > > nightly.
> > > > > > > > * Provide a couple of basic sanity performance tests on small
> > > > models
> > > > > > that
> > > > > > > > are run on CI and can be echoed by the label bot as a comment
> > for
> > > > > PRs.
> > > > > > > > * Address unit tests that take more than 10-20s, streamline
> > them
> > > or
> > > > > > move
> > > > > > > > them to nightly if it can't be done.
> > > > > > > > * Open sourcing the remaining CI infrastructure scripts so
> the
> > > > > > community
> > > > > > > > can contribute.
> > > > > > > >
> > > > > > > > I think our goal should be turnaround under 30min.
> > > > > > > >
> > > > > > > > I would also like to touch base with the community that some
> > PRs
> > > > are
> > > > > > not
> > > > > > > > being followed up by committers asking for changes. For
> example
> > > > this
> > > > > PR
> > > > > > > is
> > > > > > > > importtant and is hanging for a long time.
> > > > > > > >
> > > > > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > > >
> > > > > > > > This is another, less important but more trivial to review:
> > > > > > > >
> > > > > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > > >
> > > > > > > > I think comitters requesting changes and not folllowing up in
> > > > > > reasonable
> > > > > > > > time is not healthy for the project. I suggest configuring
> > github
> > > > > > > > Notifications for a good SNR and following up.
> > > > > > > >
> > > > > > > > Regards.
> > > > > > > >
> > > > > > > > Pedro.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to