Yes, let me add to the kudos, very nice work Marco. 

"I'm trying real hard to be the shepherd." -Jules Winnfield


> On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen <kell...@amazon.de.INVALID> 
> wrote:
> 
> Appreciate the big effort in bring the CI back so quickly.  Thanks Marco.
> 
> On Nov 21, 2018 5:52 AM, Marco de Abreu 
> <marco.g.ab...@googlemail.com.INVALID> wrote:
> Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated to
> that incident.
> 
> If somebody is interested in the details around the outage:
> 
> Due to a required maintenance (disk running full), we had to upgrade our
> Jenkins master because it was running on Ubuntu 17.04 (for an unknown
> reason, it used to be 16.04) and we needed to install some packages. Since
> the support for Ubuntu 17.04 was stopped, this resulted in all package
> updates and installations to fail because the repositories were taken
> offline. Due to the unavailable maintenance package and other issues with
> the installed OpenJDK8 version, we made the decision to upgrade the Jenkins
> master to Ubuntu 18.04 LTS in order to get back to a supported version with
> maintenance tools. During this upgrade, Jenkins was automatically updated
> by APT as part of the dist-upgrade process.
> 
> In the latest version of Jenkins, some labels have been changed which we
> depend on for our auto scaling. To be more specific:
>> Waiting for next available executor on mxnetlinux-gpu
> has been changed to
>> Waiting for next available executor on ‘mxnetlinux-gpu’
> Notice the quote characters.
> 
> Jenkins does not offer a better way than to parse these messages
> unfortunately - there's no standardized way to express queue items. Since
> our parser expected the above message without quote signs, this message was
> discarded.
> 
> We support various queue reasons (5 of them to be exact) that indicate
> resource starvation. If we run super low on capacity, the queue reason is
> different and we would still be able to scale up, but most of the cases
> would have printed the unsupported message. This resulted in reduced
> capacity (to be specific, the limit during that time was 1 slave per type).
> 
> We have now fixed our autoscaling to automatically strip these characters
> and added that message to our test suite.
> 
> Best regards,
> Marco
> 
> On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <aaron.s.mark...@gmail.com>
> wrote:
> 
>> Marco, thanks for your hard work on this. I'm super excited about the new
>> Jenkins jobs. This is going to be very helpful and improve sanity for our
>> PRs and ourselves!
>> 
>> Cheers,
>> Aaron
>> 
>> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
>> <marco.g.ab...@googlemail.com.invalid wrote:
>> 
>>> Hello,
>>> 
>>> the CI is now back up and running. Auto scaling is working as expected
>> and
>>> it passed our load tests.
>>> 
>>> Please excuse the caused inconveniences.
>>> 
>>> Best regards,
>>> Marco
>>> 
>>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
>>> marco.g.ab...@googlemail.com>
>>> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> I'd like to let you know that our CI was impaired and down for the last
>>>> few hours. After getting the CI back up, I noticed that our auto
>> scaling
>>>> broke due to a silent update of Jenkins which broke our
>>> upscale-detection.
>>>> Manual scaling is currently not possible and stopping the scaling won't
>>>> help either because there are currently no p3 instances available,
>> which
>>>> means that all jobs will fail none the less. In a few hours, the auto
>>>> scaling will have recycled all slaves through the down-scale mechanism
>>> and
>>>> we will be out of capacity. This will lead to resource starvation and
>>> thus
>>>> timeouts.
>>>> 
>>>> Your PRs will be properly registered by Jenkins, but please expect the
>>>> jobs to time out and thus fail your PRs.
>>>> 
>>>> I will fix the auto scaling as soon as I'm awake again.
>>>> 
>>>> Sorry for the caused inconveniences.
>>>> 
>>>> Best regards,
>>>> Marco
>>>> 
>>>> 
>>>> P.S. Sorry for the brief email and my lack of further fixes, but it's
>>>> 5:30AM now and I've been working for 17 hours.
>>>> 
>>> 
>> 

Reply via email to