Yes, let me add to the kudos, very nice work Marco.
"I'm trying real hard to be the shepherd." -Jules Winnfield > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen <kell...@amazon.de.INVALID> > wrote: > > Appreciate the big effort in bring the CI back so quickly. Thanks Marco. > > On Nov 21, 2018 5:52 AM, Marco de Abreu > <marco.g.ab...@googlemail.com.INVALID> wrote: > Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated to > that incident. > > If somebody is interested in the details around the outage: > > Due to a required maintenance (disk running full), we had to upgrade our > Jenkins master because it was running on Ubuntu 17.04 (for an unknown > reason, it used to be 16.04) and we needed to install some packages. Since > the support for Ubuntu 17.04 was stopped, this resulted in all package > updates and installations to fail because the repositories were taken > offline. Due to the unavailable maintenance package and other issues with > the installed OpenJDK8 version, we made the decision to upgrade the Jenkins > master to Ubuntu 18.04 LTS in order to get back to a supported version with > maintenance tools. During this upgrade, Jenkins was automatically updated > by APT as part of the dist-upgrade process. > > In the latest version of Jenkins, some labels have been changed which we > depend on for our auto scaling. To be more specific: >> Waiting for next available executor on mxnetlinux-gpu > has been changed to >> Waiting for next available executor on ‘mxnetlinux-gpu’ > Notice the quote characters. > > Jenkins does not offer a better way than to parse these messages > unfortunately - there's no standardized way to express queue items. Since > our parser expected the above message without quote signs, this message was > discarded. > > We support various queue reasons (5 of them to be exact) that indicate > resource starvation. If we run super low on capacity, the queue reason is > different and we would still be able to scale up, but most of the cases > would have printed the unsupported message. This resulted in reduced > capacity (to be specific, the limit during that time was 1 slave per type). > > We have now fixed our autoscaling to automatically strip these characters > and added that message to our test suite. > > Best regards, > Marco > > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <aaron.s.mark...@gmail.com> > wrote: > >> Marco, thanks for your hard work on this. I'm super excited about the new >> Jenkins jobs. This is going to be very helpful and improve sanity for our >> PRs and ourselves! >> >> Cheers, >> Aaron >> >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu >> <marco.g.ab...@googlemail.com.invalid wrote: >> >>> Hello, >>> >>> the CI is now back up and running. Auto scaling is working as expected >> and >>> it passed our load tests. >>> >>> Please excuse the caused inconveniences. >>> >>> Best regards, >>> Marco >>> >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu < >>> marco.g.ab...@googlemail.com> >>> wrote: >>> >>>> Hello, >>>> >>>> I'd like to let you know that our CI was impaired and down for the last >>>> few hours. After getting the CI back up, I noticed that our auto >> scaling >>>> broke due to a silent update of Jenkins which broke our >>> upscale-detection. >>>> Manual scaling is currently not possible and stopping the scaling won't >>>> help either because there are currently no p3 instances available, >> which >>>> means that all jobs will fail none the less. In a few hours, the auto >>>> scaling will have recycled all slaves through the down-scale mechanism >>> and >>>> we will be out of capacity. This will lead to resource starvation and >>> thus >>>> timeouts. >>>> >>>> Your PRs will be properly registered by Jenkins, but please expect the >>>> jobs to time out and thus fail your PRs. >>>> >>>> I will fix the auto scaling as soon as I'm awake again. >>>> >>>> Sorry for the caused inconveniences. >>>> >>>> Best regards, >>>> Marco >>>> >>>> >>>> P.S. Sorry for the brief email and my lack of further fixes, but it's >>>> 5:30AM now and I've been working for 17 hours. >>>> >>> >>