Hi Marco - suggest to retrigger PRs, if needed in stages: - pr-awaiting-merge - pr-awaiting-review that would cover 78 PR. In any case I would exclude pr-work-in-progress.
Steffen On Sat, Nov 24, 2018 at 9:11 PM kellen sunderland < kellen.sunderl...@gmail.com> wrote: > Hey Marco, I'm still having quite a few issues passing PRs. Would you be > able to at least test a handful of PRs and make sure they pass/fail tests > as you expect? > > On Sat, Nov 24, 2018, 7:01 PM Marco de Abreu > <marco.g.ab...@googlemail.com.invalid wrote: > > > Hello Steffen, > > > > thank you for bringing up these PRs. > > > > I had to abort the builds during the outage which means that the jobs > > didn't finish and not even the status propagation could have finished > > (hence they show pending instead of failure or aborted). > > > > Recently, we merged a PR that adds utility slaves. This will ensure that > > status updates will always be posted, no matter whether the main queue > > hangs or not. This means that the status would then be properly reflected > > and there should be no hanging pending runs. > > > > I could retrigger all PRs to kick off another round of validation, but > this > > would result in 240 jobs (2 main pipelines times 120 open PRs) to run. > > Since we are currently in the pre-release stage, I wanted to avoid > putting > > the system under such heavy load. > > > > Instead, I'd kindly like to request the PR creators to make a new commit > to > > trigger the pipelines. In order to merge a PR, only PR-merge has to pass > > and I tried to retrigger all PRs that have been aborted during the > outage. > > It might have been possible that I missed a few. > > > > Since it's still the weekend and there's not much going on, I can use the > > time to trigger all PRs. Please advise whether you think I should move > > forward (I expect the CI to finish all PRs within 6-10 hours) or if it's > > fine to ask people to retrigger themselves. > > > > Please excuse the caused inconveniences. > > > > Best regards, > > Marco > > > > > > Am So., 25. Nov. 2018, 03:48 hat Steffen Rochel <steffenroc...@gmail.com > > > > geschrieben: > > > > > Thanks Marco for the updates and resolving the issues. > > > However, I do see a number of PR waiting to be merged with inconsistent > > PR > > > validation status check. > > > E.g. https://github.com/apache/incubator-mxnet/pull/13041 shows 9 > > pending > > > checks being queued. However, when you look at the details, either the > > > checks have passed or failed (centos-cpu, edge, unix-cpu, window-cpu, > > > windows-gpu failed, required pr-merge which includes edge, gpu tests > > > passed). > > > Similar also for other PR with label pr-awaiting-merge ( > > > > > > > > > https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge > > > ) > > > Please advice on resolution. > > > > > > Regards, > > > Steffen > > > > > > On Thu, Nov 22, 2018 at 12:09 PM Marco de Abreu > > > <marco.g.ab...@googlemail.com.invalid> wrote: > > > > > > > Thanks everybody, I really appreciate it! > > > > > > > > Today was a good day, there were no incidents and everything appears > to > > > be > > > > stable. In the meantime I did a deep dive on why we has such a > > > significant > > > > performance decrease with of our compilation jobs - which then > clogged > > up > > > > the queue and resulted in 1000 jobs waiting to be scheduled. > > > > > > > > The reason was the way how we use ccache to speed up our compilation > > > jobs. > > > > Usually, this yields us a huge performance improvement (CPU openblas, > > for > > > > example, goes from 30 minutes down to ~3min, ARMv7 from 30 minutes > down > > > to > > > > ~1.5min, etc.). Unfortunately in this case, ccache was our limiting > > > factor. > > > > Here's some background about how we operate our cache: > > > > > > > > We use EFS to have a distributed ccache between all of our > > > > unrestricted-prod-slaves. EFS is classified for almost unlimited > > > > scalability (being consumed by thousands of instances in parallel > [1]) > > > with > > > > a theoretical throughput of over 10Gbps. One thing I didn't know > when I > > > > designed this approach was the method how throughput is being > granted. > > > > Similar to T2-CPU-Credits, EFS uses BurstCredits to allow you higher > > > > throughput (default is 50MiB/s) [2]. Due to the high load, we > consumed > > > all > > > > of our credits - here's a very interesting graph: [3]. > > > > > > > > To avoid similar incidents in future, I have taken the following > > actions: > > > > 1. I switched EFS from burst-mode to provisioned throughput with > > 300MB/s > > > > (in the graph at [3] you can see how our IO immediately increases - > and > > > > thus our CI gets faster - as soon as I added provisioned throughput). > > > > 2. I created internal follow-up tickets to add monitoring and > automated > > > > actions. > > > > > > > > First, we should be notified if we are running low on credits to > > kick-off > > > > an investigation. Second (nice to have), we could have a > > lambda-function > > > > which listens for that event and automatically switches the EFS > volume > > > from > > > > burst-mode to provisioned throughput during high-load-times. The > > required > > > > throughput could be retrieved via CloudWatch and then multiplied by a > > > > factor. EFS allows to downgrade the throughput mode 24h after the > last > > > > changes (to reduce capacity if the load is over) and always allows to > > > > upgrade the provisioned capacity (if the load goes even higher). I've > > > been > > > > looking for a pre-made CloudFormation template to facilitate that, > but > > so > > > > far, I haven't been able to find it. > > > > > > > > I'm now running additional load tests on our test CI environment to > > > detect > > > > other potential bottlenecks. > > > > > > > > Thanks a lot for your support! > > > > > > > > Best regards, > > > > Marco > > > > > > > > [1]: https://docs.aws.amazon.com/efs/latest/ug/performance.html > > > > [2]: > > > > > > > > > > https://docs.aws.amazon.com/efs/latest/ug/performance.html#throughput-modes > > > > [3]: https://i.imgur.com/nboQLOn.png > > > > > > > > On Thu, Nov 22, 2018 at 1:40 AM Qing Lan <lanking...@live.com> > wrote: > > > > > > > > > Appreciated for your effort and help to make CI a better place! > > > > > > > > > > Qing > > > > > > > > > > On 11/21/18, 4:38 PM, "Lin Yuan" <apefor...@gmail.com> wrote: > > > > > > > > > > Thanks for your efforts, Marco! > > > > > > > > > > On Wed, Nov 21, 2018 at 4:02 PM Anirudh Subramanian < > > > > > anirudh2...@gmail.com> > > > > > wrote: > > > > > > > > > > > Thanks for the quick response and mitigation! > > > > > > > > > > > > On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu > > > > > > <marco.g.ab...@googlemail.com.invalid> wrote: > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > today, CI had some issues and I had to cancel all jobs a > few > > > > > minutes ago. > > > > > > > This was basically caused by the high load that is > currently > > > > being > > > > > put on > > > > > > > our CI system due to the pre-release efforts for this > Friday. > > > > > > > > > > > > > > It's really unfortunate that we just had outages of three > > core > > > > > components > > > > > > > within the last two days - sorry about that!. To recap, we > > had > > > > the > > > > > > > following outages (which are unrelated to the parallel > > refactor > > > > of > > > > > the > > > > > > > Jenkins pipeline): > > > > > > > - (yesterday evening) The Jenkins master ran out of disk > > space > > > > and > > > > > thus > > > > > > > processed requests at reduced capacity > > > > > > > - (this morning) The Jenkins master got updated which broke > > our > > > > > > > autoscalings upscaling capabilities. > > > > > > > - (new, this evening) Jenkins API was irresponsive: Due to > > the > > > > high > > > > > > number > > > > > > > of jobs and a bad API design in the Jenkins REST API, the > > > > > time-complexity > > > > > > > of a simple create or delete request was quadratic which > > > resulted > > > > > in all > > > > > > > requests timing out (that was the current outage). This > > > resulted > > > > > in our > > > > > > > auto scaling to be unable to interface with the Jenkins > > master. > > > > > > > > > > > > > > I have now made improvements to our REST API calls which > > > reduced > > > > > the > > > > > > > complexity from O(N^2) to O(1). The reason was an > underlying > > > > > redirect > > > > > > loop > > > > > > > in the Jenkins createNode and deleteNode REST API in > > > combination > > > > > with > > > > > > > unrolling the entire slave and job graph (which got quite > > huge > > > > > during > > > > > > > extensive load) upon every single request. Since we had > about > > > 150 > > > > > > > registered slaves and 1000 jobs in the queue, the duration > > for > > > a > > > > > single > > > > > > > REST API call rose to up to 45 seconds (we execute up to a > > few > > > > > hundred > > > > > > > queries per auto scaling loop). This lead to our auto > scaling > > > > > timing out. > > > > > > > > > > > > > > Everything should be back to normal now. I'm closely > > observing > > > > the > > > > > > > situation and I'll let you know if I encounter any > additional > > > > > issues. > > > > > > > > > > > > > > Again, sorry for any caused inconveniences. > > > > > > > > > > > > > > Best regards, > > > > > > > Marco > > > > > > > > > > > > > > On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell < > > > > > gavin.max.b...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Yes, let me add to the kudos, very nice work Marco. > > > > > > > > > > > > > > > > > > > > > > > > "I'm trying real hard to be the shepherd." -Jules > Winnfield > > > > > > > > > > > > > > > > > > > > > > > > > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen > > > > > > > > <kell...@amazon.de.INVALID> wrote: > > > > > > > > > > > > > > > > > > Appreciate the big effort in bring the CI back so > > quickly. > > > > > Thanks > > > > > > > Marco. > > > > > > > > > > > > > > > > > > On Nov 21, 2018 5:52 AM, Marco de Abreu < > > > > > > marco.g.ab...@googlemail.com > > > > > > > .INVALID> > > > > > > > > wrote: > > > > > > > > > Thanks Aaron! Just for the record, the new Jenkins jobs > > > were > > > > > > unrelated > > > > > > > to > > > > > > > > > that incident. > > > > > > > > > > > > > > > > > > If somebody is interested in the details around the > > outage: > > > > > > > > > > > > > > > > > > Due to a required maintenance (disk running full), we > had > > > to > > > > > upgrade > > > > > > > our > > > > > > > > > Jenkins master because it was running on Ubuntu 17.04 > > (for > > > an > > > > > unknown > > > > > > > > > reason, it used to be 16.04) and we needed to install > > some > > > > > packages. > > > > > > > > Since > > > > > > > > > the support for Ubuntu 17.04 was stopped, this resulted > > in > > > > all > > > > > > package > > > > > > > > > updates and installations to fail because the > > repositories > > > > > were taken > > > > > > > > > offline. Due to the unavailable maintenance package and > > > other > > > > > issues > > > > > > > with > > > > > > > > > the installed OpenJDK8 version, we made the decision to > > > > > upgrade the > > > > > > > > Jenkins > > > > > > > > > master to Ubuntu 18.04 LTS in order to get back to a > > > > supported > > > > > > version > > > > > > > > with > > > > > > > > > maintenance tools. During this upgrade, Jenkins was > > > > > automatically > > > > > > > updated > > > > > > > > > by APT as part of the dist-upgrade process. > > > > > > > > > > > > > > > > > > In the latest version of Jenkins, some labels have been > > > > > changed which > > > > > > > we > > > > > > > > > depend on for our auto scaling. To be more specific: > > > > > > > > >> Waiting for next available executor on mxnetlinux-gpu > > > > > > > > > has been changed to > > > > > > > > >> Waiting for next available executor on > ‘mxnetlinux-gpu’ > > > > > > > > > Notice the quote characters. > > > > > > > > > > > > > > > > > > Jenkins does not offer a better way than to parse these > > > > > messages > > > > > > > > > unfortunately - there's no standardized way to express > > > queue > > > > > items. > > > > > > > Since > > > > > > > > > our parser expected the above message without quote > > signs, > > > > this > > > > > > message > > > > > > > > was > > > > > > > > > discarded. > > > > > > > > > > > > > > > > > > We support various queue reasons (5 of them to be > exact) > > > that > > > > > > indicate > > > > > > > > > resource starvation. If we run super low on capacity, > the > > > > queue > > > > > > reason > > > > > > > is > > > > > > > > > different and we would still be able to scale up, but > > most > > > of > > > > > the > > > > > > cases > > > > > > > > > would have printed the unsupported message. This > resulted > > > in > > > > > reduced > > > > > > > > > capacity (to be specific, the limit during that time > was > > 1 > > > > > slave per > > > > > > > > type). > > > > > > > > > > > > > > > > > > We have now fixed our autoscaling to automatically > strip > > > > these > > > > > > > characters > > > > > > > > > and added that message to our test suite. > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > Marco > > > > > > > > > > > > > > > > > > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham < > > > > > > > aaron.s.mark...@gmail.com > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > >> Marco, thanks for your hard work on this. I'm super > > > excited > > > > > about > > > > > > the > > > > > > > > new > > > > > > > > >> Jenkins jobs. This is going to be very helpful and > > improve > > > > > sanity > > > > > > for > > > > > > > > our > > > > > > > > >> PRs and ourselves! > > > > > > > > >> > > > > > > > > >> Cheers, > > > > > > > > >> Aaron > > > > > > > > >> > > > > > > > > >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu > > > > > > > > >> <marco.g.ab...@googlemail.com.invalid wrote: > > > > > > > > >> > > > > > > > > >>> Hello, > > > > > > > > >>> > > > > > > > > >>> the CI is now back up and running. Auto scaling is > > > working > > > > as > > > > > > > expected > > > > > > > > >> and > > > > > > > > >>> it passed our load tests. > > > > > > > > >>> > > > > > > > > >>> Please excuse the caused inconveniences. > > > > > > > > >>> > > > > > > > > >>> Best regards, > > > > > > > > >>> Marco > > > > > > > > >>> > > > > > > > > >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu < > > > > > > > > >>> marco.g.ab...@googlemail.com> > > > > > > > > >>> wrote: > > > > > > > > >>> > > > > > > > > >>>> Hello, > > > > > > > > >>>> > > > > > > > > >>>> I'd like to let you know that our CI was impaired > and > > > down > > > > > for the > > > > > > > > last > > > > > > > > >>>> few hours. After getting the CI back up, I noticed > > that > > > > our > > > > > auto > > > > > > > > >> scaling > > > > > > > > >>>> broke due to a silent update of Jenkins which broke > > our > > > > > > > > >>> upscale-detection. > > > > > > > > >>>> Manual scaling is currently not possible and > stopping > > > the > > > > > scaling > > > > > > > > won't > > > > > > > > >>>> help either because there are currently no p3 > > instances > > > > > available, > > > > > > > > >> which > > > > > > > > >>>> means that all jobs will fail none the less. In a > few > > > > > hours, the > > > > > > > auto > > > > > > > > >>>> scaling will have recycled all slaves through the > > > > down-scale > > > > > > > mechanism > > > > > > > > >>> and > > > > > > > > >>>> we will be out of capacity. This will lead to > resource > > > > > starvation > > > > > > > and > > > > > > > > >>> thus > > > > > > > > >>>> timeouts. > > > > > > > > >>>> > > > > > > > > >>>> Your PRs will be properly registered by Jenkins, but > > > > please > > > > > expect > > > > > > > the > > > > > > > > >>>> jobs to time out and thus fail your PRs. > > > > > > > > >>>> > > > > > > > > >>>> I will fix the auto scaling as soon as I'm awake > > again. > > > > > > > > >>>> > > > > > > > > >>>> Sorry for the caused inconveniences. > > > > > > > > >>>> > > > > > > > > >>>> Best regards, > > > > > > > > >>>> Marco > > > > > > > > >>>> > > > > > > > > >>>> > > > > > > > > >>>> P.S. Sorry for the brief email and my lack of > further > > > > > fixes, but > > > > > > > it's > > > > > > > > >>>> 5:30AM now and I've been working for 17 hours. > > > > > > > > >>>> > > > > > > > > >>> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >