Thanks Marco and Aaron for your input.

> Can you show by how much the duration will increase?

The average sanity build time is around 10min, while the average build time
for unix-cpu is about 2 hours, so the entire build pipeline would increase
by 2 hours if we required both unix-cpu and sanity to complete in parallel.

I took a look at the CloudWatch metrics we're saving for Jenkins jobs. Here
is the failure rate per job, based on builds triggered by PRs in the past
year. As you can see, the sanity build failure is still fairly high and
would save a lot of unneeded build jobs.

Job Successful Failed Failure Rate
sanity 6900 2729 28.34%
unix-cpu 4268 4786 52.86%
unix-gpu 3686 5637 60.46%
centos-cpu 6777 2809 29.30%
centos-gpu 6318 3350 34.65%
clang 7879 1588 16.77%
edge 7654 1933 20.16%
miscellaneous 8090 1510 15.73%
website 7226 2179 23.17%
windows-cpu 6084 3621 37.31%
windows-gpu 5191 4721 47.63%

We can start by requiring only the sanity job to complete before triggering
the rest, and collect data to decide if it makes sense to change it from
there. Any objections to this approach?

Thanks.
Joe


On Wed, Mar 25, 2020 at 9:35 AM Marco de Abreu <marco.g.ab...@gmail.com>
wrote:

> Back then I have created a system which exports all Jenkins results to
> cloud watch. It does not include individual test results but rather stages
> and jobs. The data for the sanity check should be available there.
>
> Something I'd also be curious about is the percentage of the failures in
> one run. Speak, if a commit failed, have there been multiple jobs failing
> (indicating an error in the code) or only one or two (indicating
> flakyness). This should give us a proper understanding of how unnecessary
> these runs really are.
>
> -Marck
>
> Aaron Markham <aaron.s.mark...@gmail.com> schrieb am Mi., 25. März 2020,
> 16:53:
>
> > +1 for sanity check - that's fast.
> > -1 for unix-cpu - that's slow and can just hang.
> >
> > So my suggestion would be to see the data apart - what's the failure
> > rate on the sanity check and the unix-cpu? Actually, can we get a
> > table of all of the tests with this data?!
> > If the sanity check fails... let's say 20% of the time, but only takes
> > a couple of minutes, then ya, let's stack it and do that one first.
> >
> > I think unix-cpu needs to be broken apart. It's too complex and fails
> > in multiple ways. Isolate the brittle parts. Then we can
> > restart/disable those as needed, while all of the other parts pass and
> > don't have to be rerun.
> >
> > On Wed, Mar 25, 2020 at 1:32 AM Marco de Abreu <marco.g.ab...@gmail.com>
> > wrote:
> > >
> > > We had this structure in the past and the community was bothered by CI
> > > taking more time, thus we moved to the current model with everything
> > > parallelized. We'd basically revert that then.
> > >
> > > Can you show by how much the duration will increase?
> > >
> > > Also, we have zero test parallelisation, speak we are running one test
> on
> > > 72 core machines (although multiple workers). Wouldn't it be way more
> > > efficient to add parallelisation and thus heavily reduce the time spent
> > on
> > > the tasks instead of staggering?
> > >
> > > I feel concerned that these measures to save cost are paid in the form
> > of a
> > > worse user experience. I see a big potential to save costs by
> increasing
> > > efficiency while actually improving the user experience due to CI being
> > > faster.
> > >
> > > -Marco
> > >
> > > Joe Evans <joseph.ev...@gmail.com> schrieb am Mi., 25. März 2020,
> 04:58:
> > >
> > > > Hi,
> > > >
> > > >
> > > > First, I just wanted to introduce myself to the MXNet community. I’m
> > Joe
> > > > and will be working with Chai and the AWS team to improve some issues
> > > > around MXNet CI. One of our goals is to reduce the costs associated
> > with
> > > > running MXNet CI. The task I’m working on now is this issue:
> > > >
> > > >
> > > > https://github.com/apache/incubator-mxnet/issues/17802
> > > >
> > > >
> > > > Proposal: Staggered Jenkins CI pipeline
> > > >
> > > >
> > > > Based on data collected from Jenkins, around 55% of the time when the
> > > > mxnet-validation CI build is triggered by a PR, either the sanity or
> > > > unix-cpu builds fail. When either of these builds fail, it doesn’t
> make
> > > > sense to run the rest of the pipelines and utilize all those
> resources
> > if
> > > > we’ve already identified a build or unit test failure.
> > > >
> > > >
> > > > We are proposing changing the MXNet Jenkins CI pipeline by requiring
> > the
> > > > *sanity* and *unix-cpu* builds to complete and pass tests
> successfully
> > > > before starting the other build pipelines (centos-cpu/gpu, unix-gpu,
> > > > windows-cpu/gpu, etc.) Once the sanity builds successfully complete,
> > the
> > > > remaining build pipelines will be triggered and run in parallel (as
> > they
> > > > currently do.) The purpose of this change is to identify faulty code
> or
> > > > compatibility issues early and prevent further execution of CI
> builds.
> > This
> > > > will increase the time required to test a PR, but will prevent
> > unnecessary
> > > > builds from running.
> > > >
> > > >
> > > > Does anyone have any concerns with this change or suggestions?
> > > >
> > > >
> > > > Thanks.
> > > >
> > > > Joe Evans
> > > >
> > > > joseph.ev...@gmail.com
> > > >
> >
>

Reply via email to