Hello,

today, CI had some issues and I had to cancel all jobs a few minutes ago.
This was basically caused by the high load that is currently being put on
our CI system due to the pre-release efforts for this Friday.

It's really unfortunate that we just had outages of three core components
within the last two days - sorry about that!. To recap, we had the
following outages (which are unrelated to the parallel refactor of the
Jenkins pipeline):
- (yesterday evening) The Jenkins master ran out of disk space and thus
processed requests at reduced capacity
- (this morning) The Jenkins master got updated which broke our
autoscalings upscaling capabilities.
- (new, this evening) Jenkins API was irresponsive: Due to the high number
of jobs and a bad API design in the Jenkins REST API, the time-complexity
of a simple create or delete request was quadratic which resulted in all
requests timing out (that was the current outage). This resulted in our
auto scaling to be unable to interface with the Jenkins master.

I have now made improvements to our REST API calls which reduced the
complexity from O(N^2) to O(1). The reason was an underlying redirect loop
in the Jenkins createNode and deleteNode REST API in combination with
unrolling the entire slave and job graph (which got quite huge during
extensive load) upon every single request. Since we had about 150
registered slaves and 1000 jobs in the queue, the duration for a single
REST API call rose to up to 45 seconds (we execute up to a few hundred
queries per auto scaling loop). This lead to our auto scaling timing out.

Everything should be back to normal now. I'm closely observing the
situation and I'll let you know if I encounter any additional issues.

Again, sorry for any caused inconveniences.

Best regards,
Marco

On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell <gavin.max.b...@gmail.com>
wrote:

> Yes, let me add to the kudos, very nice work Marco.
>
>
> "I'm trying real hard to be the shepherd." -Jules Winnfield
>
>
> > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen
> <kell...@amazon.de.INVALID> wrote:
> >
> > Appreciate the big effort in bring the CI back so quickly.  Thanks Marco.
> >
> > On Nov 21, 2018 5:52 AM, Marco de Abreu 
> > <marco.g.ab...@googlemail.com.INVALID>
> wrote:
> > Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated to
> > that incident.
> >
> > If somebody is interested in the details around the outage:
> >
> > Due to a required maintenance (disk running full), we had to upgrade our
> > Jenkins master because it was running on Ubuntu 17.04 (for an unknown
> > reason, it used to be 16.04) and we needed to install some packages.
> Since
> > the support for Ubuntu 17.04 was stopped, this resulted in all package
> > updates and installations to fail because the repositories were taken
> > offline. Due to the unavailable maintenance package and other issues with
> > the installed OpenJDK8 version, we made the decision to upgrade the
> Jenkins
> > master to Ubuntu 18.04 LTS in order to get back to a supported version
> with
> > maintenance tools. During this upgrade, Jenkins was automatically updated
> > by APT as part of the dist-upgrade process.
> >
> > In the latest version of Jenkins, some labels have been changed which we
> > depend on for our auto scaling. To be more specific:
> >> Waiting for next available executor on mxnetlinux-gpu
> > has been changed to
> >> Waiting for next available executor on ‘mxnetlinux-gpu’
> > Notice the quote characters.
> >
> > Jenkins does not offer a better way than to parse these messages
> > unfortunately - there's no standardized way to express queue items. Since
> > our parser expected the above message without quote signs, this message
> was
> > discarded.
> >
> > We support various queue reasons (5 of them to be exact) that indicate
> > resource starvation. If we run super low on capacity, the queue reason is
> > different and we would still be able to scale up, but most of the cases
> > would have printed the unsupported message. This resulted in reduced
> > capacity (to be specific, the limit during that time was 1 slave per
> type).
> >
> > We have now fixed our autoscaling to automatically strip these characters
> > and added that message to our test suite.
> >
> > Best regards,
> > Marco
> >
> > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <aaron.s.mark...@gmail.com
> >
> > wrote:
> >
> >> Marco, thanks for your hard work on this. I'm super excited about the
> new
> >> Jenkins jobs. This is going to be very helpful and improve sanity for
> our
> >> PRs and ourselves!
> >>
> >> Cheers,
> >> Aaron
> >>
> >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> >> <marco.g.ab...@googlemail.com.invalid wrote:
> >>
> >>> Hello,
> >>>
> >>> the CI is now back up and running. Auto scaling is working as expected
> >> and
> >>> it passed our load tests.
> >>>
> >>> Please excuse the caused inconveniences.
> >>>
> >>> Best regards,
> >>> Marco
> >>>
> >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> >>> marco.g.ab...@googlemail.com>
> >>> wrote:
> >>>
> >>>> Hello,
> >>>>
> >>>> I'd like to let you know that our CI was impaired and down for the
> last
> >>>> few hours. After getting the CI back up, I noticed that our auto
> >> scaling
> >>>> broke due to a silent update of Jenkins which broke our
> >>> upscale-detection.
> >>>> Manual scaling is currently not possible and stopping the scaling
> won't
> >>>> help either because there are currently no p3 instances available,
> >> which
> >>>> means that all jobs will fail none the less. In a few hours, the auto
> >>>> scaling will have recycled all slaves through the down-scale mechanism
> >>> and
> >>>> we will be out of capacity. This will lead to resource starvation and
> >>> thus
> >>>> timeouts.
> >>>>
> >>>> Your PRs will be properly registered by Jenkins, but please expect the
> >>>> jobs to time out and thus fail your PRs.
> >>>>
> >>>> I will fix the auto scaling as soon as I'm awake again.
> >>>>
> >>>> Sorry for the caused inconveniences.
> >>>>
> >>>> Best regards,
> >>>> Marco
> >>>>
> >>>>
> >>>> P.S. Sorry for the brief email and my lack of further fixes, but it's
> >>>> 5:30AM now and I've been working for 17 hours.
> >>>>
> >>>
> >>
>

Reply via email to