To parallelize across machines: For GluonNLP we started submitting test
jobs to AWS Batch. Just adding a for-loop over the units in the
Jenkinsfile [1] and submitting a job for each [2] works quite well. Then
Jenkins just waits for all jobs to finish and retrieves their status.
This works since AWS Batch added GPU support this April [3].

For MXNet, naively parallelizing over the files defining the test cases
that are in the longest running Pipeline stage may already help?

[1]: 
https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/Jenkinsfile_py3-master_gpu_doc#L53
[2]: https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/submit-job.py
[3]: https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/

Marco de Abreu <marco.g.ab...@gmail.com> writes:

> The first start wrt parallelization could certainly be start adding
> parallel test execution in nosetests.
>
> -Marco
>
> Aaron Markham <aaron.s.mark...@gmail.com> schrieb am Do., 15. Aug. 2019,
> 05:39:
>
>> The PRs Thomas and I are working on for the new docs and website share the
>> mxnet binary in the new CI pipelines we made. Speeds things up a lot.
>>
>> On Wed, Aug 14, 2019, 18:16 Chris Olivier <cjolivie...@gmail.com> wrote:
>>
>> > I see it done daily now, and while I can’t share all the details, it’s
>> not
>> > an incredibly complex thing, and involves not much more than nfs/efs
>> > sharing and remote ssh commands.  All it takes is a little ingenuity and
>> > some imagination.
>> >
>> > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
>> pedro.larroy.li...@gmail.com
>> > >
>> > wrote:
>> >
>> > > Sounds good in theory. I think there are complex details with regards
>> of
>> > > resource sharing during parallel execution. Still I think both ways can
>> > be
>> > > explored. I think some tests run for unreasonably long times for what
>> > they
>> > > are doing. We already scale parts of the pipeline horizontally across
>> > > workers.
>> > >
>> > >
>> > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <cjolivie...@apache.org>
>> > > wrote:
>> > >
>> > > > +1
>> > > >
>> > > > Rather than remove tests (which doesn’t scale as a solution), why not
>> > > scale
>> > > > them horizontally so that they finish more quickly? Across processes
>> or
>> > > > even on a pool of machines that aren’t necessarily the build machine?
>> > > >
>> > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
>> > marco.g.ab...@gmail.com
>> > > >
>> > > > wrote:
>> > > >
>> > > > > With regards to time I rather prefer us spending a bit more time on
>> > > > > maintenance than somebody running into an error that could've been
>> > > caught
>> > > > > with a test.
>> > > > >
>> > > > > I mean, our Publishing pipeline for Scala GPU has been broken for
>> > quite
>> > > > > some time now, but nobody noticed that. Basically my stance on that
>> > > > matter
>> > > > > is that as soon as something is not blocking, you can also just
>> > > > deactivate
>> > > > > it since you don't have a forcing function in an open source
>> project.
>> > > > > People will rarely come back and fix the errors of some nightly
>> test
>> > > that
>> > > > > they introduced.
>> > > > >
>> > > > > -Marco
>> > > > >
>> > > > > Carin Meier <carinme...@gmail.com> schrieb am Mi., 14. Aug. 2019,
>> > > 21:59:
>> > > > >
>> > > > > > If a language binding test is failing for a not important reason,
>> > > then
>> > > > it
>> > > > > > is too brittle and needs to be fixed (we have fixed some of these
>> > > with
>> > > > > the
>> > > > > > Clojure package [1]).
>> > > > > > But in general, if we thinking of the MXNet project as one
>> project
>> > > that
>> > > > > is
>> > > > > > across all the language bindings, then we want to know if some
>> > > > > fundamental
>> > > > > > code change is going to break a downstream package.
>> > > > > > I can't speak for all the high level package binding maintainers,
>> > but
>> > > > I'm
>> > > > > > always happy to pitch in to provide code fixes to help the base
>> PR
>> > > get
>> > > > > > green.
>> > > > > >
>> > > > > > The time costs to maintain such a large CI project obviously
>> needs
>> > to
>> > > > be
>> > > > > > considered as well.
>> > > > > >
>> > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
>> > > > > >
>> > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
>> > > > > pedro.larroy.li...@gmail.com
>> > > > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > From what I have seen Clojure is 15 minutes, which I think is
>> > > > > reasonable.
>> > > > > > > The only question is that when a binding such as R, Perl or
>> > Clojure
>> > > > > > fails,
>> > > > > > > some devs are a bit confused about how to fix them since they
>> are
>> > > not
>> > > > > > > familiar with the testing tools and the language.
>> > > > > > >
>> > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
>> > carinme...@gmail.com
>> > > >
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > > Great idea Marco! Anything that you think would be valuable
>> to
>> > > > share
>> > > > > > > would
>> > > > > > > > be good. The duration of each node in the test stage sounds
>> > like
>> > > a
>> > > > > good
>> > > > > > > > start.
>> > > > > > > >
>> > > > > > > > - Carin
>> > > > > > > >
>> > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
>> > > > > > marco.g.ab...@gmail.com>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi,
>> > > > > > > > >
>> > > > > > > > > we record a bunch of metrics about run statistics (down to
>> > the
>> > > > > > duration
>> > > > > > > > of
>> > > > > > > > > every individual step). If you tell me which ones you're
>> > > > > particularly
>> > > > > > > > > interested in (probably total duration of each node in the
>> > test
>> > > > > > stage),
>> > > > > > > > I'm
>> > > > > > > > > happy to provide them.
>> > > > > > > > >
>> > > > > > > > > Dimensions are (in hierarchical order):
>> > > > > > > > > - job
>> > > > > > > > > - branch
>> > > > > > > > > - stage
>> > > > > > > > > - node
>> > > > > > > > > - step
>> > > > > > > > >
>> > > > > > > > > Unfortunately I don't have the possibility to export them
>> > since
>> > > > we
>> > > > > > > store
>> > > > > > > > > them in CloudWatch Metrics which afaik doesn't offer raw
>> > > exports.
>> > > > > > > > >
>> > > > > > > > > Best regards,
>> > > > > > > > > Marco
>> > > > > > > > >
>> > > > > > > > > Carin Meier <carinme...@gmail.com> schrieb am Mi., 14.
>> Aug.
>> > > > 2019,
>> > > > > > > 19:43:
>> > > > > > > > >
>> > > > > > > > > > I would prefer to keep the language binding in the PR
>> > > process.
>> > > > > > > Perhaps
>> > > > > > > > we
>> > > > > > > > > > could do some analytics to see how much each of the
>> > language
>> > > > > > bindings
>> > > > > > > > is
>> > > > > > > > > > contributing to overall run time.
>> > > > > > > > > > If we have some metrics on that, maybe we can come up
>> with
>> > a
>> > > > > > > guideline
>> > > > > > > > of
>> > > > > > > > > > how much time each should take. Another possibility is
>> > > leverage
>> > > > > the
>> > > > > > > > > > parallel builds more.
>> > > > > > > > > >
>> > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
>> > > > > > > > > pedro.larroy.li...@gmail.com
>> > > > > > > > > > >
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Hi Carin.
>> > > > > > > > > > >
>> > > > > > > > > > > That's a good point, all things considered would your
>> > > > > preference
>> > > > > > be
>> > > > > > > > to
>> > > > > > > > > > keep
>> > > > > > > > > > > the Clojure tests as part of the PR process or in
>> > Nightly?
>> > > > > > > > > > > Some options are having notifications here or in slack.
>> > But
>> > > > if
>> > > > > we
>> > > > > > > > think
>> > > > > > > > > > > breakages would go unnoticed maybe is not a good idea
>> to
>> > > > fully
>> > > > > > > remove
>> > > > > > > > > > > bindings from the PR process and just streamline the
>> > > process.
>> > > > > > > > > > >
>> > > > > > > > > > > Pedro.
>> > > > > > > > > > >
>> > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
>> > > > > > carinme...@gmail.com>
>> > > > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > Before any binding tests are moved to nightly, I
>> think
>> > we
>> > > > > need
>> > > > > > to
>> > > > > > > > > > figure
>> > > > > > > > > > > > out how the community can get proper notifications of
>> > > > failure
>> > > > > > and
>> > > > > > > > > > success
>> > > > > > > > > > > > on those nightly runs. Otherwise, I think that
>> > breakages
>> > > > > would
>> > > > > > go
>> > > > > > > > > > > > unnoticed.
>> > > > > > > > > > > >
>> > > > > > > > > > > > -Carin
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
>> > > > > > > > > > > pedro.larroy.li...@gmail.com
>> > > > > > > > > > > > >
>> > > > > > > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > > Hi
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Seems we are hitting some problems in CI. I propose
>> > the
>> > > > > > > following
>> > > > > > > > > > > action
>> > > > > > > > > > > > > items to remedy the situation and accelerate turn
>> > > around
>> > > > > > times
>> > > > > > > in
>> > > > > > > > > CI,
>> > > > > > > > > > > > > reduce cost, complexity and probability of failure
>> > > > blocking
>> > > > > > PRs
>> > > > > > > > and
>> > > > > > > > > > > > > frustrating developers:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > * Upgrade Windows visual studio from VS 2015 to VS
>> > > 2017.
>> > > > > The
>> > > > > > > > > > > > > build_windows.py infrastructure should easily work
>> > with
>> > > > the
>> > > > > > new
>> > > > > > > > > > > version.
>> > > > > > > > > > > > > Currently some PRs are blocked by this:
>> > > > > > > > > > > > >
>> > https://github.com/apache/incubator-mxnet/issues/13958
>> > > > > > > > > > > > > * Move Gluon Model zoo tests to nightly. Tracked at
>> > > > > > > > > > > > >
>> > https://github.com/apache/incubator-mxnet/issues/15295
>> > > > > > > > > > > > > * Move non-python bindings tests to nightly. If a
>> > > commit
>> > > > is
>> > > > > > > > > touching
>> > > > > > > > > > > > other
>> > > > > > > > > > > > > bindings, the reviewer should ask for a full run
>> > which
>> > > > can
>> > > > > be
>> > > > > > > > done
>> > > > > > > > > > > > locally,
>> > > > > > > > > > > > > use the label bot to trigger a full CI build, or
>> > defer
>> > > to
>> > > > > > > > nightly.
>> > > > > > > > > > > > > * Provide a couple of basic sanity performance
>> tests
>> > on
>> > > > > small
>> > > > > > > > > models
>> > > > > > > > > > > that
>> > > > > > > > > > > > > are run on CI and can be echoed by the label bot
>> as a
>> > > > > comment
>> > > > > > > for
>> > > > > > > > > > PRs.
>> > > > > > > > > > > > > * Address unit tests that take more than 10-20s,
>> > > > streamline
>> > > > > > > them
>> > > > > > > > or
>> > > > > > > > > > > move
>> > > > > > > > > > > > > them to nightly if it can't be done.
>> > > > > > > > > > > > > * Open sourcing the remaining CI infrastructure
>> > scripts
>> > > > so
>> > > > > > the
>> > > > > > > > > > > community
>> > > > > > > > > > > > > can contribute.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > I think our goal should be turnaround under 30min.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > I would also like to touch base with the community
>> > that
>> > > > > some
>> > > > > > > PRs
>> > > > > > > > > are
>> > > > > > > > > > > not
>> > > > > > > > > > > > > being followed up by committers asking for changes.
>> > For
>> > > > > > example
>> > > > > > > > > this
>> > > > > > > > > > PR
>> > > > > > > > > > > > is
>> > > > > > > > > > > > > importtant and is hanging for a long time.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> https://github.com/apache/incubator-mxnet/pull/15051
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > This is another, less important but more trivial to
>> > > > review:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> https://github.com/apache/incubator-mxnet/pull/14940
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > I think comitters requesting changes and not
>> > folllowing
>> > > > up
>> > > > > in
>> > > > > > > > > > > reasonable
>> > > > > > > > > > > > > time is not healthy for the project. I suggest
>> > > > configuring
>> > > > > > > github
>> > > > > > > > > > > > > Notifications for a good SNR and following up.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Regards.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Pedro.
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>

Reply via email to