Re: CI and PRs

Marco de Abreu Thu, 15 Aug 2019 21:23:43 -0700

It's rerunning as soon as that particular script has been modified. Since
the following steps depend on it, it means that once step 4 has a cache
mismatch, steps 5-15 are also no longer valid.


Our cache is always controlled by master. This means that the only thing
that matters is the diff between your branch and master and not the fact
whether it already has been run once. A single Jenkins run will juggle with
over 100gb of Docker images. If we held a cache that records every single
occurrence, the storage requirements and traffic would be very expensive.
Thus, the most efficient and less error prone approach was to make master
be the branch that defines the cache.

-Marco

Aaron Markham <[email protected]> schrieb am Fr., 16. Aug. 2019,
04:06:

> When you create a new Dockerfile and use that on CI, it doesn't seem
> to cache some of the steps... like this:
>
> Step 13/15 : RUN /work/ubuntu_docs.sh
>  ---> Running in a1e522f3283b
>  [91m+ echo 'Installing dependencies...'
> + apt-get update
>  [0mInstalling dependencies.
>
> Or this....
>
> Step 4/13 : RUN /work/ubuntu_core.sh
>  ---> Running in e7882d7aa750
>  [91m+ apt-get update
>
> I get if I was changing those scripts, but then I'd think it should
> cache after running it once... but, no.
>
>
> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu <[email protected]>
> wrote:
> >
> > Do I understand it correctly that you are saying that the Docker cache
> > doesn't work properly and regularly reinstalls dependencies? Or do you
> mean
> > that you only have cache misses when you modify the dependencies - which
> > would be expected?
> >
> > -Marco
> >
> > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <
> [email protected]>
> > wrote:
> >
> > > Many of the CI pipelines follow this pattern:
> > > Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why
> > > repeat steps 1-3 over and over?
> > >
> > > Now, some tests use a stashed binary and docker cache. And I see this
> work
> > > locally, but for the most part, on CI, you're gonna sit through a
> > > dependency install.
> > >
> > > I noticed that almost all jobs use an ubuntu setup that is fully
> loaded.
> > > Without cache, it can take 10 or more minutes to build.  So I made a
> lite
> > > version. Takes only a few minutes instead.
> > >
> > > In some cases archiving worked great to share across pipelines, but as
> > > Marco mentioned we need a storage solution to make that happen. We
> can't
> > > archive every intermediate artifact for each PR.
> > >
> > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy <[email protected]
> >
> > > wrote:
> > >
> > > > Hi Aaron. Why speeds things up? What's the difference?
> > > >
> > > > Pedro.
> > > >
> > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <
> [email protected]
> > > >
> > > > wrote:
> > > >
> > > > > The PRs Thomas and I are working on for the new docs and website
> share
> > > > the
> > > > > mxnet binary in the new CI pipelines we made. Speeds things up a
> lot.
> > > > >
> > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <[email protected]>
> > > wrote:
> > > > >
> > > > > > I see it done daily now, and while I can’t share all the details,
> > > it’s
> > > > > not
> > > > > > an incredibly complex thing, and involves not much more than
> nfs/efs
> > > > > > sharing and remote ssh commands.  All it takes is a little
> ingenuity
> > > > and
> > > > > > some imagination.
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> > > > > [email protected]
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Sounds good in theory. I think there are complex details with
> > > regards
> > > > > of
> > > > > > > resource sharing during parallel execution. Still I think both
> ways
> > > > can
> > > > > > be
> > > > > > > explored. I think some tests run for unreasonably long times
> for
> > > what
> > > > > > they
> > > > > > > are doing. We already scale parts of the pipeline horizontally
> > > across
> > > > > > > workers.
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> > > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1
> > > > > > > >
> > > > > > > > Rather than remove tests (which doesn’t scale as a
> solution), why
> > > > not
> > > > > > > scale
> > > > > > > > them horizontally so that they finish more quickly? Across
> > > > processes
> > > > > or
> > > > > > > > even on a pool of machines that aren’t necessarily the build
> > > > machine?
> > > > > > > >
> > > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > > > > > [email protected]
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > With regards to time I rather prefer us spending a bit more
> > > time
> > > > on
> > > > > > > > > maintenance than somebody running into an error that
> could've
> > > > been
> > > > > > > caught
> > > > > > > > > with a test.
> > > > > > > > >
> > > > > > > > > I mean, our Publishing pipeline for Scala GPU has been
> broken
> > > for
> > > > > > quite
> > > > > > > > > some time now, but nobody noticed that. Basically my
> stance on
> > > > that
> > > > > > > > matter
> > > > > > > > > is that as soon as something is not blocking, you can also
> just
> > > > > > > > deactivate
> > > > > > > > > it since you don't have a forcing function in an open
> source
> > > > > project.
> > > > > > > > > People will rarely come back and fix the errors of some
> nightly
> > > > > test
> > > > > > > that
> > > > > > > > > they introduced.
> > > > > > > > >
> > > > > > > > > -Marco
> > > > > > > > >
> > > > > > > > > Carin Meier <[email protected]> schrieb am Mi., 14.
> Aug.
> > > > 2019,
> > > > > > > 21:59:
> > > > > > > > >
> > > > > > > > > > If a language binding test is failing for a not important
> > > > reason,
> > > > > > > then
> > > > > > > > it
> > > > > > > > > > is too brittle and needs to be fixed (we have fixed some
> of
> > > > these
> > > > > > > with
> > > > > > > > > the
> > > > > > > > > > Clojure package [1]).
> > > > > > > > > > But in general, if we thinking of the MXNet project as
> one
> > > > > project
> > > > > > > that
> > > > > > > > > is
> > > > > > > > > > across all the language bindings, then we want to know if
> > > some
> > > > > > > > > fundamental
> > > > > > > > > > code change is going to break a downstream package.
> > > > > > > > > > I can't speak for all the high level package binding
> > > > maintainers,
> > > > > > but
> > > > > > > > I'm
> > > > > > > > > > always happy to pitch in to provide code fixes to help
> the
> > > base
> > > > > PR
> > > > > > > get
> > > > > > > > > > green.
> > > > > > > > > >
> > > > > > > > > > The time costs to maintain such a large CI project
> obviously
> > > > > needs
> > > > > > to
> > > > > > > > be
> > > > > > > > > > considered as well.
> > > > > > > > > >
> > > > > > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > > > > > > > [email protected]
> > > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > From what I have seen Clojure is 15 minutes, which I
> think
> > > is
> > > > > > > > > reasonable.
> > > > > > > > > > > The only question is that when a binding such as R,
> Perl or
> > > > > > Clojure
> > > > > > > > > > fails,
> > > > > > > > > > > some devs are a bit confused about how to fix them
> since
> > > they
> > > > > are
> > > > > > > not
> > > > > > > > > > > familiar with the testing tools and the language.
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> > > > > > [email protected]
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Great idea Marco! Anything that you think would be
> > > valuable
> > > > > to
> > > > > > > > share
> > > > > > > > > > > would
> > > > > > > > > > > > be good. The duration of each node in the test stage
> > > sounds
> > > > > > like
> > > > > > > a
> > > > > > > > > good
> > > > > > > > > > > > start.
> > > > > > > > > > > >
> > > > > > > > > > > > - Carin
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > > > > > > > > > [email protected]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > >
> > > > > > > > > > > > > we record a bunch of metrics about run statistics
> (down
> > > > to
> > > > > > the
> > > > > > > > > > duration
> > > > > > > > > > > > of
> > > > > > > > > > > > > every individual step). If you tell me which ones
> > > you're
> > > > > > > > > particularly
> > > > > > > > > > > > > interested in (probably total duration of each
> node in
> > > > the
> > > > > > test
> > > > > > > > > > stage),
> > > > > > > > > > > > I'm
> > > > > > > > > > > > > happy to provide them.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Dimensions are (in hierarchical order):
> > > > > > > > > > > > > - job
> > > > > > > > > > > > > - branch
> > > > > > > > > > > > > - stage
> > > > > > > > > > > > > - node
> > > > > > > > > > > > > - step
> > > > > > > > > > > > >
> > > > > > > > > > > > > Unfortunately I don't have the possibility to
> export
> > > them
> > > > > > since
> > > > > > > > we
> > > > > > > > > > > store
> > > > > > > > > > > > > them in CloudWatch Metrics which afaik doesn't
> offer
> > > raw
> > > > > > > exports.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > Marco
> > > > > > > > > > > > >
> > > > > > > > > > > > > Carin Meier <[email protected]> schrieb am
> Mi., 14.
> > > > > Aug.
> > > > > > > > 2019,
> > > > > > > > > > > 19:43:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I would prefer to keep the language binding in
> the PR
> > > > > > > process.
> > > > > > > > > > > Perhaps
> > > > > > > > > > > > we
> > > > > > > > > > > > > > could do some analytics to see how much each of
> the
> > > > > > language
> > > > > > > > > > bindings
> > > > > > > > > > > > is
> > > > > > > > > > > > > > contributing to overall run time.
> > > > > > > > > > > > > > If we have some metrics on that, maybe we can
> come up
> > > > > with
> > > > > > a
> > > > > > > > > > > guideline
> > > > > > > > > > > > of
> > > > > > > > > > > > > > how much time each should take. Another
> possibility
> > > is
> > > > > > > leverage
> > > > > > > > > the
> > > > > > > > > > > > > > parallel builds more.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > > > > > > > > > > [email protected]
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Carin.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > That's a good point, all things considered
> would
> > > your
> > > > > > > > > preference
> > > > > > > > > > be
> > > > > > > > > > > > to
> > > > > > > > > > > > > > keep
> > > > > > > > > > > > > > > the Clojure tests as part of the PR process or
> in
> > > > > > Nightly?
> > > > > > > > > > > > > > > Some options are having notifications here or
> in
> > > > slack.
> > > > > > But
> > > > > > > > if
> > > > > > > > > we
> > > > > > > > > > > > think
> > > > > > > > > > > > > > > breakages would go unnoticed maybe is not a
> good
> > > idea
> > > > > to
> > > > > > > > fully
> > > > > > > > > > > remove
> > > > > > > > > > > > > > > bindings from the PR process and just
> streamline
> > > the
> > > > > > > process.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Pedro.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Before any binding tests are moved to
> nightly, I
> > > > > think
> > > > > > we
> > > > > > > > > need
> > > > > > > > > > to
> > > > > > > > > > > > > > figure
> > > > > > > > > > > > > > > > out how the community can get proper
> > > notifications
> > > > of
> > > > > > > > failure
> > > > > > > > > > and
> > > > > > > > > > > > > > success
> > > > > > > > > > > > > > > > on those nightly runs. Otherwise, I think
> that
> > > > > > breakages
> > > > > > > > > would
> > > > > > > > > > go
> > > > > > > > > > > > > > > > unnoticed.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -Carin
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy
> <
> > > > > > > > > > > > > > > [email protected]
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Seems we are hitting some problems in CI. I
> > > > propose
> > > > > > the
> > > > > > > > > > > following
> > > > > > > > > > > > > > > action
> > > > > > > > > > > > > > > > > items to remedy the situation and
> accelerate
> > > turn
> > > > > > > around
> > > > > > > > > > times
> > > > > > > > > > > in
> > > > > > > > > > > > > CI,
> > > > > > > > > > > > > > > > > reduce cost, complexity and probability of
> > > > failure
> > > > > > > > blocking
> > > > > > > > > > PRs
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > frustrating developers:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > * Upgrade Windows visual studio from VS
> 2015 to
> > > > VS
> > > > > > > 2017.
> > > > > > > > > The
> > > > > > > > > > > > > > > > > build_windows.py infrastructure should
> easily
> > > > work
> > > > > > with
> > > > > > > > the
> > > > > > > > > > new
> > > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > > > Currently some PRs are blocked by this:
> > > > > > > > > > > > > > > > >
> > > > > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > > > > > > > > > > * Move Gluon Model zoo tests to nightly.
> > > Tracked
> > > > at
> > > > > > > > > > > > > > > > >
> > > > > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > > > > > > > > > > * Move non-python bindings tests to
> nightly.
> > > If a
> > > > > > > commit
> > > > > > > > is
> > > > > > > > > > > > > touching
> > > > > > > > > > > > > > > > other
> > > > > > > > > > > > > > > > > bindings, the reviewer should ask for a
> full
> > > run
> > > > > > which
> > > > > > > > can
> > > > > > > > > be
> > > > > > > > > > > > done
> > > > > > > > > > > > > > > > locally,
> > > > > > > > > > > > > > > > > use the label bot to trigger a full CI
> build,
> > > or
> > > > > > defer
> > > > > > > to
> > > > > > > > > > > > nightly.
> > > > > > > > > > > > > > > > > * Provide a couple of basic sanity
> performance
> > > > > tests
> > > > > > on
> > > > > > > > > small
> > > > > > > > > > > > > models
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > are run on CI and can be echoed by the
> label
> > > bot
> > > > > as a
> > > > > > > > > comment
> > > > > > > > > > > for
> > > > > > > > > > > > > > PRs.
> > > > > > > > > > > > > > > > > * Address unit tests that take more than
> > > 10-20s,
> > > > > > > > streamline
> > > > > > > > > > > them
> > > > > > > > > > > > or
> > > > > > > > > > > > > > > move
> > > > > > > > > > > > > > > > > them to nightly if it can't be done.
> > > > > > > > > > > > > > > > > * Open sourcing the remaining CI
> infrastructure
> > > > > > scripts
> > > > > > > > so
> > > > > > > > > > the
> > > > > > > > > > > > > > > community
> > > > > > > > > > > > > > > > > can contribute.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I think our goal should be turnaround under
> > > > 30min.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I would also like to touch base with the
> > > > community
> > > > > > that
> > > > > > > > > some
> > > > > > > > > > > PRs
> > > > > > > > > > > > > are
> > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > being followed up by committers asking for
> > > > changes.
> > > > > > For
> > > > > > > > > > example
> > > > > > > > > > > > > this
> > > > > > > > > > > > > > PR
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > importtant and is hanging for a long time.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This is another, less important but more
> > > trivial
> > > > to
> > > > > > > > review:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I think comitters requesting changes and
> not
> > > > > > folllowing
> > > > > > > > up
> > > > > > > > > in
> > > > > > > > > > > > > > > reasonable
> > > > > > > > > > > > > > > > > time is not healthy for the project. I
> suggest
> > > > > > > > configuring
> > > > > > > > > > > github
> > > > > > > > > > > > > > > > > Notifications for a good SNR and following
> up.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Regards.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Pedro.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Re: CI and PRs

Reply via email to