Hi,

thanks a lot for these great notes! I'm happy to give my comments about
them :)

* Archiving is *very VERY* bad for the CI master performance. It floods the
disk with data since archiving persists the data. We are now at the point
where we technically can't extend the volume any further (we exceeded the
4TB limit and had to delete old runs). Thus, stashing is the only option
that's not harmful to the systems performance.

* Yeah, agree. One way is to build a Dockerfile, push it to your own
Dockerhub account and then in the MXNet DOckerfile just make "FROM
yourdockerhub:blabla".

* We support the GitHub Multi-Branch Pipeline and basically use this across
all jobs. So adhering to that system will result in the git repository
within the workspace being scoped to the correct branch. As a rule of thumb
it's basically a red flag as soon as you call anything with regards to git
(e.g. checking out a different branch, creating a commit, merging another
branch, etc) within your payload. Happy to help if you would like to have
that elaborated.

* Could you elaborate on "Publishing scripts seem to need a security
refactor, or we don't bother offering stand-alone access to them; running
local versus on Jenkins."? I don't really understand what you mean here.

* Basically it's an s3 bucket with a TTL of 30 days that our CI slaves have
permission to push to. We basically just upload the entire folder that is
being created. Is there anything specifically you're looking for?

* That's awesome!

Best regards,
Marco

On Thu, Aug 15, 2019 at 8:52 PM Aaron Markham <aaron.s.mark...@gmail.com>
wrote:

> I'll start a different thread about the website. Sure, there's a lot
> of overlap with CI. I learned a lot in the last few weeks having to
> iterate on 7 different docs packages and trying to streamline the
> build process in CI.
>
> Here are my notes:
>
> * Stash operations vs. archiving - recommendations in the docs suggest
> that large artifacts should be archived; stash is super slow; archived
> artifacts seems to be faster and can be used between pipelines. This
> is helpful for the MXNet binary and for the Scala package, both of
> which are used by various other docs packages. However, there's an
> implication with the master server. Archived artifacts are stored
> there, so if the pipeline is related to PR validation, this would be
> unwieldy. If related to publishing final artifacts for specific
> versions, well, that's probably ok.
>
> * It seems that efficiency in development and testing can be gained by
> checkpointing the docker containers after the dependencies are
> installed. I can't stress how much time is lost while watching
> `apt-get update` run for the millionth time when testing new CI
> routines. It sort of makes me crazy(er).
>
> * A version/branch parameter would be useful for the Jenkins pipelines
> for generating docs artifacts from different branches.
>
> * Publishing scripts seem to need a security refactor, or we don't
> bother offering stand-alone access to them; running local versus on
> Jenkins.
>
> * I don't see any documentation on the S3 publishing steps and how to test
> this.
>
> * After breaking out each docs package in its own pipeline, I see
> opportunities to use the GitHub API to check the PR payload and be
> selective about what tests to run.
>
>
> On Wed, Aug 14, 2019 at 10:03 PM Zhao, Patric <patric.z...@intel.com>
> wrote:
> >
> > Hi Aaron,
> >
> > Recently, we are working on improving the documents of CPU backend based
> on the current website.
> >
> > I saw there're several PRs to update the new website and it's really
> great.
> >
> > Thus, I'd like to know when the new website will online.
> > If it's very near, we will switch our works to the new website.
> >
> > Thanks,
> >
> > --Patric
> >
> >
> > > -----Original Message-----
> > > From: Aaron Markham <aaron.s.mark...@gmail.com>
> > > Sent: Thursday, August 15, 2019 11:40 AM
> > > To: dev@mxnet.incubator.apache.org
> > > Subject: Re: CI and PRs
> > >
> > > The PRs Thomas and I are working on for the new docs and website share
> > > the mxnet binary in the new CI pipelines we made. Speeds things up a
> lot.
> > >
> > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <cjolivie...@gmail.com>
> wrote:
> > >
> > > > I see it done daily now, and while I can’t share all the details,
> it’s
> > > > not an incredibly complex thing, and involves not much more than
> > > > nfs/efs sharing and remote ssh commands.  All it takes is a little
> > > > ingenuity and some imagination.
> > > >
> > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy
> > > > <pedro.larroy.li...@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Sounds good in theory. I think there are complex details with
> > > > > regards of resource sharing during parallel execution. Still I
> think
> > > > > both ways can
> > > > be
> > > > > explored. I think some tests run for unreasonably long times for
> > > > > what
> > > > they
> > > > > are doing. We already scale parts of the pipeline horizontally
> > > > > across workers.
> > > > >
> > > > >
> > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier
> > > > > <cjolivie...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > Rather than remove tests (which doesn’t scale as a solution), why
> > > > > > not
> > > > > scale
> > > > > > them horizontally so that they finish more quickly? Across
> > > > > > processes or even on a pool of machines that aren’t necessarily
> the
> > > build machine?
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > > > marco.g.ab...@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > With regards to time I rather prefer us spending a bit more
> time
> > > > > > > on maintenance than somebody running into an error that
> could've
> > > > > > > been
> > > > > caught
> > > > > > > with a test.
> > > > > > >
> > > > > > > I mean, our Publishing pipeline for Scala GPU has been broken
> > > > > > > for
> > > > quite
> > > > > > > some time now, but nobody noticed that. Basically my stance on
> > > > > > > that
> > > > > > matter
> > > > > > > is that as soon as something is not blocking, you can also just
> > > > > > deactivate
> > > > > > > it since you don't have a forcing function in an open source
> project.
> > > > > > > People will rarely come back and fix the errors of some nightly
> > > > > > > test
> > > > > that
> > > > > > > they introduced.
> > > > > > >
> > > > > > > -Marco
> > > > > > >
> > > > > > > Carin Meier <carinme...@gmail.com> schrieb am Mi., 14. Aug.
> > > > > > > 2019,
> > > > > 21:59:
> > > > > > >
> > > > > > > > If a language binding test is failing for a not important
> > > > > > > > reason,
> > > > > then
> > > > > > it
> > > > > > > > is too brittle and needs to be fixed (we have fixed some of
> > > > > > > > these
> > > > > with
> > > > > > > the
> > > > > > > > Clojure package [1]).
> > > > > > > > But in general, if we thinking of the MXNet project as one
> > > > > > > > project
> > > > > that
> > > > > > > is
> > > > > > > > across all the language bindings, then we want to know if
> some
> > > > > > > fundamental
> > > > > > > > code change is going to break a downstream package.
> > > > > > > > I can't speak for all the high level package binding
> > > > > > > > maintainers,
> > > > but
> > > > > > I'm
> > > > > > > > always happy to pitch in to provide code fixes to help the
> > > > > > > > base PR
> > > > > get
> > > > > > > > green.
> > > > > > > >
> > > > > > > > The time costs to maintain such a large CI project obviously
> > > > > > > > needs
> > > > to
> > > > > > be
> > > > > > > > considered as well.
> > > > > > > >
> > > > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > > > > >
> > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > > > > > pedro.larroy.li...@gmail.com
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > From what I have seen Clojure is 15 minutes, which I think
> > > > > > > > > is
> > > > > > > reasonable.
> > > > > > > > > The only question is that when a binding such as R, Perl or
> > > > Clojure
> > > > > > > > fails,
> > > > > > > > > some devs are a bit confused about how to fix them since
> > > > > > > > > they are
> > > > > not
> > > > > > > > > familiar with the testing tools and the language.
> > > > > > > > >
> > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> > > > carinme...@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Great idea Marco! Anything that you think would be
> > > > > > > > > > valuable to
> > > > > > share
> > > > > > > > > would
> > > > > > > > > > be good. The duration of each node in the test stage
> > > > > > > > > > sounds
> > > > like
> > > > > a
> > > > > > > good
> > > > > > > > > > start.
> > > > > > > > > >
> > > > > > > > > > - Carin
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > > > > > > > marco.g.ab...@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > we record a bunch of metrics about run statistics (down
> > > > > > > > > > > to
> > > > the
> > > > > > > > duration
> > > > > > > > > > of
> > > > > > > > > > > every individual step). If you tell me which ones
> you're
> > > > > > > particularly
> > > > > > > > > > > interested in (probably total duration of each node in
> > > > > > > > > > > the
> > > > test
> > > > > > > > stage),
> > > > > > > > > > I'm
> > > > > > > > > > > happy to provide them.
> > > > > > > > > > >
> > > > > > > > > > > Dimensions are (in hierarchical order):
> > > > > > > > > > > - job
> > > > > > > > > > > - branch
> > > > > > > > > > > - stage
> > > > > > > > > > > - node
> > > > > > > > > > > - step
> > > > > > > > > > >
> > > > > > > > > > > Unfortunately I don't have the possibility to export
> > > > > > > > > > > them
> > > > since
> > > > > > we
> > > > > > > > > store
> > > > > > > > > > > them in CloudWatch Metrics which afaik doesn't offer
> raw
> > > > > exports.
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Marco
> > > > > > > > > > >
> > > > > > > > > > > Carin Meier <carinme...@gmail.com> schrieb am Mi.,
> 14. Aug.
> > > > > > 2019,
> > > > > > > > > 19:43:
> > > > > > > > > > >
> > > > > > > > > > > > I would prefer to keep the language binding in the PR
> > > > > process.
> > > > > > > > > Perhaps
> > > > > > > > > > we
> > > > > > > > > > > > could do some analytics to see how much each of the
> > > > language
> > > > > > > > bindings
> > > > > > > > > > is
> > > > > > > > > > > > contributing to overall run time.
> > > > > > > > > > > > If we have some metrics on that, maybe we can come up
> > > > > > > > > > > > with
> > > > a
> > > > > > > > > guideline
> > > > > > > > > > of
> > > > > > > > > > > > how much time each should take. Another possibility
> is
> > > > > leverage
> > > > > > > the
> > > > > > > > > > > > parallel builds more.
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > > > > > > > > pedro.larroy.li...@gmail.com
> > > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Carin.
> > > > > > > > > > > > >
> > > > > > > > > > > > > That's a good point, all things considered would
> > > > > > > > > > > > > your
> > > > > > > preference
> > > > > > > > be
> > > > > > > > > > to
> > > > > > > > > > > > keep
> > > > > > > > > > > > > the Clojure tests as part of the PR process or in
> > > > Nightly?
> > > > > > > > > > > > > Some options are having notifications here or in
> slack.
> > > > But
> > > > > > if
> > > > > > > we
> > > > > > > > > > think
> > > > > > > > > > > > > breakages would go unnoticed maybe is not a good
> > > > > > > > > > > > > idea to
> > > > > > fully
> > > > > > > > > remove
> > > > > > > > > > > > > bindings from the PR process and just streamline
> the
> > > > > process.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Pedro.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> > > > > > > > carinme...@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Before any binding tests are moved to nightly, I
> > > > > > > > > > > > > > think
> > > > we
> > > > > > > need
> > > > > > > > to
> > > > > > > > > > > > figure
> > > > > > > > > > > > > > out how the community can get proper
> notifications
> > > > > > > > > > > > > > of
> > > > > > failure
> > > > > > > > and
> > > > > > > > > > > > success
> > > > > > > > > > > > > > on those nightly runs. Otherwise, I think that
> > > > breakages
> > > > > > > would
> > > > > > > > go
> > > > > > > > > > > > > > unnoticed.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > -Carin
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > > > > > > > > > > > pedro.larroy.li...@gmail.com
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Seems we are hitting some problems in CI. I
> > > > > > > > > > > > > > > propose
> > > > the
> > > > > > > > > following
> > > > > > > > > > > > > action
> > > > > > > > > > > > > > > items to remedy the situation and accelerate
> > > > > > > > > > > > > > > turn
> > > > > around
> > > > > > > > times
> > > > > > > > > in
> > > > > > > > > > > CI,
> > > > > > > > > > > > > > > reduce cost, complexity and probability of
> > > > > > > > > > > > > > > failure
> > > > > > blocking
> > > > > > > > PRs
> > > > > > > > > > and
> > > > > > > > > > > > > > > frustrating developers:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * Upgrade Windows visual studio from VS 2015 to
> > > > > > > > > > > > > > > VS
> > > > > 2017.
> > > > > > > The
> > > > > > > > > > > > > > > build_windows.py infrastructure should easily
> > > > > > > > > > > > > > > work
> > > > with
> > > > > > the
> > > > > > > > new
> > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > Currently some PRs are blocked by this:
> > > > > > > > > > > > > > >
> > > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > > > > > > > > * Move Gluon Model zoo tests to nightly.
> Tracked
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > >
> > > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > > > > > > > > * Move non-python bindings tests to nightly. If
> > > > > > > > > > > > > > > a
> > > > > commit
> > > > > > is
> > > > > > > > > > > touching
> > > > > > > > > > > > > > other
> > > > > > > > > > > > > > > bindings, the reviewer should ask for a full
> run
> > > > which
> > > > > > can
> > > > > > > be
> > > > > > > > > > done
> > > > > > > > > > > > > > locally,
> > > > > > > > > > > > > > > use the label bot to trigger a full CI build,
> or
> > > > defer
> > > > > to
> > > > > > > > > > nightly.
> > > > > > > > > > > > > > > * Provide a couple of basic sanity performance
> > > > > > > > > > > > > > > tests
> > > > on
> > > > > > > small
> > > > > > > > > > > models
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > > are run on CI and can be echoed by the label
> bot
> > > > > > > > > > > > > > > as a
> > > > > > > comment
> > > > > > > > > for
> > > > > > > > > > > > PRs.
> > > > > > > > > > > > > > > * Address unit tests that take more than
> 10-20s,
> > > > > > streamline
> > > > > > > > > them
> > > > > > > > > > or
> > > > > > > > > > > > > move
> > > > > > > > > > > > > > > them to nightly if it can't be done.
> > > > > > > > > > > > > > > * Open sourcing the remaining CI infrastructure
> > > > scripts
> > > > > > so
> > > > > > > > the
> > > > > > > > > > > > > community
> > > > > > > > > > > > > > > can contribute.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I think our goal should be turnaround under
> 30min.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I would also like to touch base with the
> > > > > > > > > > > > > > > community
> > > > that
> > > > > > > some
> > > > > > > > > PRs
> > > > > > > > > > > are
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > > being followed up by committers asking for
> changes.
> > > > For
> > > > > > > > example
> > > > > > > > > > > this
> > > > > > > > > > > > PR
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > importtant and is hanging for a long time.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> https://github.com/apache/incubator-mxnet/pull/1
> > > > > > > > > > > > > > > 5051
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This is another, less important but more
> trivial
> > > > > > > > > > > > > > > to
> > > > > > review:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> https://github.com/apache/incubator-mxnet/pull/1
> > > > > > > > > > > > > > > 4940
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I think comitters requesting changes and not
> > > > folllowing
> > > > > > up
> > > > > > > in
> > > > > > > > > > > > > reasonable
> > > > > > > > > > > > > > > time is not healthy for the project. I suggest
> > > > > > configuring
> > > > > > > > > github
> > > > > > > > > > > > > > > Notifications for a good SNR and following up.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Regards.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Pedro.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
>

Reply via email to