Re: CI impaired

Steffen Rochel Sun, 25 Nov 2018 15:49:36 -0800

Hi Marco - suggest to retrigger PRs, if needed in stages:
- pr-awaiting-merge
- pr-awaiting-review
that would cover 78 PR. In any case I would exclude pr-work-in-progress.


Steffen

On Sat, Nov 24, 2018 at 9:11 PM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Hey Marco, I'm still having quite a few issues passing PRs.  Would you be
> able to at least test a handful of PRs and make sure they pass/fail tests
> as you expect?
>
> On Sat, Nov 24, 2018, 7:01 PM Marco de Abreu
> <marco.g.ab...@googlemail.com.invalid wrote:
>
> > Hello Steffen,
> >
> > thank you for bringing up these PRs.
> >
> > I had to abort the builds during the outage which means that the jobs
> > didn't finish and not even the status propagation could have finished
> > (hence they show pending instead of failure or aborted).
> >
> > Recently, we merged a PR that adds utility slaves. This will ensure that
> > status updates will always be posted, no matter whether the main queue
> > hangs or not. This means that the status would then be properly reflected
> > and there should be no hanging pending runs.
> >
> > I could retrigger all PRs to kick off another round of validation, but
> this
> > would result in 240 jobs (2 main pipelines times 120 open PRs) to run.
> > Since we are currently in the pre-release stage, I wanted to avoid
> putting
> > the system under such heavy load.
> >
> > Instead, I'd kindly like to request the PR creators to make a new commit
> to
> > trigger the pipelines. In order to merge a PR, only PR-merge has to pass
> > and I tried to retrigger all PRs that have been aborted during the
> outage.
> > It might have been possible that I missed a few.
> >
> > Since it's still the weekend and there's not much going on, I can use the
> > time to trigger all PRs. Please advise whether you think I should move
> > forward (I expect the CI to finish all PRs within 6-10 hours) or if it's
> > fine to ask people to retrigger themselves.
> >
> > Please excuse the caused inconveniences.
> >
> > Best regards,
> > Marco
> >
> >
> > Am So., 25. Nov. 2018, 03:48 hat Steffen Rochel <steffenroc...@gmail.com
> >
> > geschrieben:
> >
> > > Thanks Marco for the updates and resolving the issues.
> > > However, I do see a number of PR waiting to be merged with inconsistent
> > PR
> > > validation status check.
> > > E.g. https://github.com/apache/incubator-mxnet/pull/13041 shows 9
> > pending
> > > checks being queued. However, when you look at the details, either the
> > > checks have passed or failed (centos-cpu, edge, unix-cpu, window-cpu,
> > > windows-gpu failed, required pr-merge which includes edge, gpu tests
> > > passed).
> > > Similar also for other PR with label pr-awaiting-merge (
> > >
> > >
> >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge
> > > )
> > > Please advice on resolution.
> > >
> > > Regards,
> > > Steffen
> > >
> > > On Thu, Nov 22, 2018 at 12:09 PM Marco de Abreu
> > > <marco.g.ab...@googlemail.com.invalid> wrote:
> > >
> > > > Thanks everybody, I really appreciate it!
> > > >
> > > > Today was a good day, there were no incidents and everything appears
> to
> > > be
> > > > stable. In the meantime I did a deep dive on why we has such a
> > > significant
> > > > performance decrease with of our compilation jobs - which then
> clogged
> > up
> > > > the queue and resulted in 1000 jobs waiting to be scheduled.
> > > >
> > > > The reason was the way how we use ccache to speed up our compilation
> > > jobs.
> > > > Usually, this yields us a huge performance improvement (CPU openblas,
> > for
> > > > example, goes from 30 minutes down to ~3min, ARMv7 from 30 minutes
> down
> > > to
> > > > ~1.5min, etc.). Unfortunately in this case, ccache was our limiting
> > > factor.
> > > > Here's some background about how we operate our cache:
> > > >
> > > > We use EFS to have a distributed ccache between all of our
> > > > unrestricted-prod-slaves. EFS is classified for almost unlimited
> > > > scalability (being consumed by thousands of instances in parallel
> [1])
> > > with
> > > > a theoretical throughput of over 10Gbps. One thing I didn't know
> when I
> > > > designed this approach was the method how throughput is being
> granted.
> > > > Similar to T2-CPU-Credits, EFS uses BurstCredits to allow you higher
> > > > throughput (default is 50MiB/s) [2]. Due to the high load, we
> consumed
> > > all
> > > > of our credits - here's a very interesting graph: [3].
> > > >
> > > > To avoid similar incidents in future, I have taken the following
> > actions:
> > > > 1. I switched EFS from burst-mode to provisioned throughput with
> > 300MB/s
> > > > (in the graph at [3] you can see how our IO immediately increases -
> and
> > > > thus our CI gets faster - as soon as I added provisioned throughput).
> > > > 2. I created internal follow-up tickets to add monitoring and
> automated
> > > > actions.
> > > >
> > > > First, we should be notified if we are running low on credits to
> > kick-off
> > > > an investigation. Second (nice to have), we could have a
> > lambda-function
> > > > which listens for that event and automatically switches the EFS
> volume
> > > from
> > > > burst-mode to provisioned throughput during high-load-times. The
> > required
> > > > throughput could be retrieved via CloudWatch and then multiplied by a
> > > > factor. EFS allows to downgrade the throughput mode 24h after the
> last
> > > > changes (to reduce capacity if the load is over) and always allows to
> > > > upgrade the provisioned capacity (if the load goes even higher). I've
> > > been
> > > > looking for a pre-made CloudFormation template to facilitate that,
> but
> > so
> > > > far, I haven't been able to find it.
> > > >
> > > > I'm now running additional load tests on our test CI environment to
> > > detect
> > > > other potential bottlenecks.
> > > >
> > > > Thanks a lot for your support!
> > > >
> > > > Best regards,
> > > > Marco
> > > >
> > > > [1]: https://docs.aws.amazon.com/efs/latest/ug/performance.html
> > > > [2]:
> > > >
> > >
> >
> https://docs.aws.amazon.com/efs/latest/ug/performance.html#throughput-modes
> > > > [3]: https://i.imgur.com/nboQLOn.png
> > > >
> > > > On Thu, Nov 22, 2018 at 1:40 AM Qing Lan <lanking...@live.com>
> wrote:
> > > >
> > > > > Appreciated for your effort and help to make CI a better place!
> > > > >
> > > > > Qing
> > > > >
> > > > > On 11/21/18, 4:38 PM, "Lin Yuan" <apefor...@gmail.com> wrote:
> > > > >
> > > > >     Thanks for your efforts, Marco!
> > > > >
> > > > >     On Wed, Nov 21, 2018 at 4:02 PM Anirudh Subramanian <
> > > > > anirudh2...@gmail.com>
> > > > >     wrote:
> > > > >
> > > > >     > Thanks for the quick response and mitigation!
> > > > >     >
> > > > >     > On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu
> > > > >     > <marco.g.ab...@googlemail.com.invalid> wrote:
> > > > >     >
> > > > >     > > Hello,
> > > > >     > >
> > > > >     > > today, CI had some issues and I had to cancel all jobs a
> few
> > > > > minutes ago.
> > > > >     > > This was basically caused by the high load that is
> currently
> > > > being
> > > > > put on
> > > > >     > > our CI system due to the pre-release efforts for this
> Friday.
> > > > >     > >
> > > > >     > > It's really unfortunate that we just had outages of three
> > core
> > > > > components
> > > > >     > > within the last two days - sorry about that!. To recap, we
> > had
> > > > the
> > > > >     > > following outages (which are unrelated to the parallel
> > refactor
> > > > of
> > > > > the
> > > > >     > > Jenkins pipeline):
> > > > >     > > - (yesterday evening) The Jenkins master ran out of disk
> > space
> > > > and
> > > > > thus
> > > > >     > > processed requests at reduced capacity
> > > > >     > > - (this morning) The Jenkins master got updated which broke
> > our
> > > > >     > > autoscalings upscaling capabilities.
> > > > >     > > - (new, this evening) Jenkins API was irresponsive: Due to
> > the
> > > > high
> > > > >     > number
> > > > >     > > of jobs and a bad API design in the Jenkins REST API, the
> > > > > time-complexity
> > > > >     > > of a simple create or delete request was quadratic which
> > > resulted
> > > > > in all
> > > > >     > > requests timing out (that was the current outage). This
> > > resulted
> > > > > in our
> > > > >     > > auto scaling to be unable to interface with the Jenkins
> > master.
> > > > >     > >
> > > > >     > > I have now made improvements to our REST API calls which
> > > reduced
> > > > > the
> > > > >     > > complexity from O(N^2) to O(1). The reason was an
> underlying
> > > > > redirect
> > > > >     > loop
> > > > >     > > in the Jenkins createNode and deleteNode REST API in
> > > combination
> > > > > with
> > > > >     > > unrolling the entire slave and job graph (which got quite
> > huge
> > > > > during
> > > > >     > > extensive load) upon every single request. Since we had
> about
> > > 150
> > > > >     > > registered slaves and 1000 jobs in the queue, the duration
> > for
> > > a
> > > > > single
> > > > >     > > REST API call rose to up to 45 seconds (we execute up to a
> > few
> > > > > hundred
> > > > >     > > queries per auto scaling loop). This lead to our auto
> scaling
> > > > > timing out.
> > > > >     > >
> > > > >     > > Everything should be back to normal now. I'm closely
> > observing
> > > > the
> > > > >     > > situation and I'll let you know if I encounter any
> additional
> > > > > issues.
> > > > >     > >
> > > > >     > > Again, sorry for any caused inconveniences.
> > > > >     > >
> > > > >     > > Best regards,
> > > > >     > > Marco
> > > > >     > >
> > > > >     > > On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell <
> > > > > gavin.max.b...@gmail.com>
> > > > >     > > wrote:
> > > > >     > >
> > > > >     > > > Yes, let me add to the kudos, very nice work Marco.
> > > > >     > > >
> > > > >     > > >
> > > > >     > > > "I'm trying real hard to be the shepherd." -Jules
> Winnfield
> > > > >     > > >
> > > > >     > > >
> > > > >     > > > > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen
> > > > >     > > > <kell...@amazon.de.INVALID> wrote:
> > > > >     > > > >
> > > > >     > > > > Appreciate the big effort in bring the CI back so
> > quickly.
> > > > > Thanks
> > > > >     > > Marco.
> > > > >     > > > >
> > > > >     > > > > On Nov 21, 2018 5:52 AM, Marco de Abreu <
> > > > >     > marco.g.ab...@googlemail.com
> > > > >     > > .INVALID>
> > > > >     > > > wrote:
> > > > >     > > > > Thanks Aaron! Just for the record, the new Jenkins jobs
> > > were
> > > > >     > unrelated
> > > > >     > > to
> > > > >     > > > > that incident.
> > > > >     > > > >
> > > > >     > > > > If somebody is interested in the details around the
> > outage:
> > > > >     > > > >
> > > > >     > > > > Due to a required maintenance (disk running full), we
> had
> > > to
> > > > > upgrade
> > > > >     > > our
> > > > >     > > > > Jenkins master because it was running on Ubuntu 17.04
> > (for
> > > an
> > > > > unknown
> > > > >     > > > > reason, it used to be 16.04) and we needed to install
> > some
> > > > > packages.
> > > > >     > > > Since
> > > > >     > > > > the support for Ubuntu 17.04 was stopped, this resulted
> > in
> > > > all
> > > > >     > package
> > > > >     > > > > updates and installations to fail because the
> > repositories
> > > > > were taken
> > > > >     > > > > offline. Due to the unavailable maintenance package and
> > > other
> > > > > issues
> > > > >     > > with
> > > > >     > > > > the installed OpenJDK8 version, we made the decision to
> > > > > upgrade the
> > > > >     > > > Jenkins
> > > > >     > > > > master to Ubuntu 18.04 LTS in order to get back to a
> > > > supported
> > > > >     > version
> > > > >     > > > with
> > > > >     > > > > maintenance tools. During this upgrade, Jenkins was
> > > > > automatically
> > > > >     > > updated
> > > > >     > > > > by APT as part of the dist-upgrade process.
> > > > >     > > > >
> > > > >     > > > > In the latest version of Jenkins, some labels have been
> > > > > changed which
> > > > >     > > we
> > > > >     > > > > depend on for our auto scaling. To be more specific:
> > > > >     > > > >> Waiting for next available executor on mxnetlinux-gpu
> > > > >     > > > > has been changed to
> > > > >     > > > >> Waiting for next available executor on
> ‘mxnetlinux-gpu’
> > > > >     > > > > Notice the quote characters.
> > > > >     > > > >
> > > > >     > > > > Jenkins does not offer a better way than to parse these
> > > > > messages
> > > > >     > > > > unfortunately - there's no standardized way to express
> > > queue
> > > > > items.
> > > > >     > > Since
> > > > >     > > > > our parser expected the above message without quote
> > signs,
> > > > this
> > > > >     > message
> > > > >     > > > was
> > > > >     > > > > discarded.
> > > > >     > > > >
> > > > >     > > > > We support various queue reasons (5 of them to be
> exact)
> > > that
> > > > >     > indicate
> > > > >     > > > > resource starvation. If we run super low on capacity,
> the
> > > > queue
> > > > >     > reason
> > > > >     > > is
> > > > >     > > > > different and we would still be able to scale up, but
> > most
> > > of
> > > > > the
> > > > >     > cases
> > > > >     > > > > would have printed the unsupported message. This
> resulted
> > > in
> > > > > reduced
> > > > >     > > > > capacity (to be specific, the limit during that time
> was
> > 1
> > > > > slave per
> > > > >     > > > type).
> > > > >     > > > >
> > > > >     > > > > We have now fixed our autoscaling to automatically
> strip
> > > > these
> > > > >     > > characters
> > > > >     > > > > and added that message to our test suite.
> > > > >     > > > >
> > > > >     > > > > Best regards,
> > > > >     > > > > Marco
> > > > >     > > > >
> > > > >     > > > > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <
> > > > >     > > aaron.s.mark...@gmail.com
> > > > >     > > > >
> > > > >     > > > > wrote:
> > > > >     > > > >
> > > > >     > > > >> Marco, thanks for your hard work on this. I'm super
> > > excited
> > > > > about
> > > > >     > the
> > > > >     > > > new
> > > > >     > > > >> Jenkins jobs. This is going to be very helpful and
> > improve
> > > > > sanity
> > > > >     > for
> > > > >     > > > our
> > > > >     > > > >> PRs and ourselves!
> > > > >     > > > >>
> > > > >     > > > >> Cheers,
> > > > >     > > > >> Aaron
> > > > >     > > > >>
> > > > >     > > > >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> > > > >     > > > >> <marco.g.ab...@googlemail.com.invalid wrote:
> > > > >     > > > >>
> > > > >     > > > >>> Hello,
> > > > >     > > > >>>
> > > > >     > > > >>> the CI is now back up and running. Auto scaling is
> > > working
> > > > as
> > > > >     > > expected
> > > > >     > > > >> and
> > > > >     > > > >>> it passed our load tests.
> > > > >     > > > >>>
> > > > >     > > > >>> Please excuse the caused inconveniences.
> > > > >     > > > >>>
> > > > >     > > > >>> Best regards,
> > > > >     > > > >>> Marco
> > > > >     > > > >>>
> > > > >     > > > >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> > > > >     > > > >>> marco.g.ab...@googlemail.com>
> > > > >     > > > >>> wrote:
> > > > >     > > > >>>
> > > > >     > > > >>>> Hello,
> > > > >     > > > >>>>
> > > > >     > > > >>>> I'd like to let you know that our CI was impaired
> and
> > > down
> > > > > for the
> > > > >     > > > last
> > > > >     > > > >>>> few hours. After getting the CI back up, I noticed
> > that
> > > > our
> > > > > auto
> > > > >     > > > >> scaling
> > > > >     > > > >>>> broke due to a silent update of Jenkins which broke
> > our
> > > > >     > > > >>> upscale-detection.
> > > > >     > > > >>>> Manual scaling is currently not possible and
> stopping
> > > the
> > > > > scaling
> > > > >     > > > won't
> > > > >     > > > >>>> help either because there are currently no p3
> > instances
> > > > > available,
> > > > >     > > > >> which
> > > > >     > > > >>>> means that all jobs will fail none the less. In a
> few
> > > > > hours, the
> > > > >     > > auto
> > > > >     > > > >>>> scaling will have recycled all slaves through the
> > > > down-scale
> > > > >     > > mechanism
> > > > >     > > > >>> and
> > > > >     > > > >>>> we will be out of capacity. This will lead to
> resource
> > > > > starvation
> > > > >     > > and
> > > > >     > > > >>> thus
> > > > >     > > > >>>> timeouts.
> > > > >     > > > >>>>
> > > > >     > > > >>>> Your PRs will be properly registered by Jenkins, but
> > > > please
> > > > > expect
> > > > >     > > the
> > > > >     > > > >>>> jobs to time out and thus fail your PRs.
> > > > >     > > > >>>>
> > > > >     > > > >>>> I will fix the auto scaling as soon as I'm awake
> > again.
> > > > >     > > > >>>>
> > > > >     > > > >>>> Sorry for the caused inconveniences.
> > > > >     > > > >>>>
> > > > >     > > > >>>> Best regards,
> > > > >     > > > >>>> Marco
> > > > >     > > > >>>>
> > > > >     > > > >>>>
> > > > >     > > > >>>> P.S. Sorry for the brief email and my lack of
> further
> > > > > fixes, but
> > > > >     > > it's
> > > > >     > > > >>>> 5:30AM now and I've been working for 17 hours.
> > > > >     > > > >>>>
> > > > >     > > > >>>
> > > > >     > > > >>
> > > > >     > > >
> > > > >     > >
> > > > >     >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: CI impaired

Reply via email to