Re: CI impaired
Thanks for the update Marco and all the hard work put into the CI! On Sat, Dec 1, 2018 at 1:21 PM Marco de Abreu wrote: > Hello everyone, > > the move has just been completed and the old big pipeline as well as the > according job have been disabled. From now on, you will see the details > status messages below your PRs. > > Some people wanted to make modifications to the Jenkinsfiles recently. In > that case, your PR will show a merge conflict. The new Jenkinsfiles are > available at [1]. > > Yesterday, I have indexed all PRs with our CI system to make sure that each > one gets properly validated and our merge processes don't get impaired. > Everything looks good so far, but due to the flakyness of our tests, it's > quite unlikely that every single tests has passed. If your particular PR > shows a failure for a certain test, please follow the same procedure as > usual and retrigger it by pushing another commit. From now on, you can also > trigger partial runs of the CI. For this, just hit up a committer and they > will be happy to trigger that specific job on your behalf. > > If somebody in the community is interested, we would also be happy to > collaborate on a bot that allows to control CI runs like retriggering > certain jobs, requesting additional non-PR jobs to run - e.g. when you made > changes to nightly, etc. > > Thanks everybody for being patient and so collaborative during this > transisition time. I'm looking forward to everybodys contributions. > > Best regards, > Marco > > [1]: https://github.com/apache/incubator-mxnet/tree/master/ci/jenkins > > On Sat, Dec 1, 2018 at 4:27 AM Marco de Abreu < > marco.g.ab...@googlemail.com> > wrote: > > > Thanks Naveen and Gavin! > > > > #1 has been completed and every job has finished its processing. > > > > #2 is the ticket with infra: > > https://issues.apache.org/jira/browse/INFRA-17346 > > > > I'm now waiting for their response. > > > > -Marco > > > > On Fri, Nov 30, 2018 at 8:25 PM Naveen Swamy wrote: > > > >> Hi Marco/Gavin, > >> > >> Thanks for the clarification. I was not aware that it has been tested > on a > >> separate test environment(this is what I was suggesting and make the > >> changes in a more controlled manner), last time the change was made, > many > >> PRs were left dangling and developers had to go trigger and I triggered > >> them at least 5 times before it succeeded today. > >> > >> Appreciate all the hard work to make CI better. > >> > >> -Naveen > >> > >> On Fri, Nov 30, 2018 at 8:50 AM Gavin M. Bell > > >> wrote: > >> > >> > Hey Folks, > >> > > >> > Marco has been running this change in dev, with flying colors, for > some > >> > time. This is not an experiment but a roll out that was announced. We > >> also > >> > decided to make this change post the release cut so limit the blast > >> radius > >> > from any critical obligations to the community. Marco is accountable > >> for > >> > this work and will address any issues that may occur as he has been > put > >> > on-call. We have, to our best ability, mitigated as much risk as > >> possible > >> > and now it is time to pull the trigger. The community will enjoy a > bit > >> > more visibility and clarity into the test process which will be > >> > advantageous, as well as allowing us to extend our infrastructure in a > >> way > >> > that affords us more flexibility. > >> > > >> > No pending PRs will be impacted. > >> > > >> > Thank you for your support as we evolve this system to better serve > the > >> > community. > >> > > >> > -Gavin > >> > > >> > On Fri, Nov 30, 2018 at 5:23 PM Marco de Abreu > >> > wrote: > >> > > >> > > Hello Naveen, this is not an experiment. Everything has been tested > in > >> > our > >> > > test system and is considered working 100%. This is not a test but > >> > actually > >> > > the move into production - the merge into master happened a week > ago. > >> We > >> > > now just have to put all PRs into the catalogue, which means that > all > >> PRs > >> > > have to be analyzed with the new pipelines - the only thing that > will > >> be > >> > > noticeable is that the CI is under higher load. > >> > > > >> > > The pending PRs will not be impacted. The existing pipeline is still > >> > > running in parallel and everything will behave as before. > >> > > > >> > > -Marco > >> > > > >> > > On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy > >> wrote: > >> > > > >> > > > Marco, run your experiments on a branch - set up, test it well and > >> then > >> > > > bring it to the master. > >> > > > > >> > > > > On Nov 30, 2018, at 6:53 AM, Marco de Abreu < > >> > > > marco.g.ab...@googlemail.com.INVALID> wrote: > >> > > > > > >> > > > > Hello, > >> > > > > > >> > > > > I'm now moving forward with #1. I will try to get to #3 as soon > as > >> > > > possible > >> > > > > to reduce parallel jobs in our CI. You might notice some > >> unfinished > >> > > > jobs. I > >> > > > > will let you know as soon as this process has been completed. > >> Until > >> > > then, > >>
Re: CI impaired
Hello everyone, the move has just been completed and the old big pipeline as well as the according job have been disabled. From now on, you will see the details status messages below your PRs. Some people wanted to make modifications to the Jenkinsfiles recently. In that case, your PR will show a merge conflict. The new Jenkinsfiles are available at [1]. Yesterday, I have indexed all PRs with our CI system to make sure that each one gets properly validated and our merge processes don't get impaired. Everything looks good so far, but due to the flakyness of our tests, it's quite unlikely that every single tests has passed. If your particular PR shows a failure for a certain test, please follow the same procedure as usual and retrigger it by pushing another commit. From now on, you can also trigger partial runs of the CI. For this, just hit up a committer and they will be happy to trigger that specific job on your behalf. If somebody in the community is interested, we would also be happy to collaborate on a bot that allows to control CI runs like retriggering certain jobs, requesting additional non-PR jobs to run - e.g. when you made changes to nightly, etc. Thanks everybody for being patient and so collaborative during this transisition time. I'm looking forward to everybodys contributions. Best regards, Marco [1]: https://github.com/apache/incubator-mxnet/tree/master/ci/jenkins On Sat, Dec 1, 2018 at 4:27 AM Marco de Abreu wrote: > Thanks Naveen and Gavin! > > #1 has been completed and every job has finished its processing. > > #2 is the ticket with infra: > https://issues.apache.org/jira/browse/INFRA-17346 > > I'm now waiting for their response. > > -Marco > > On Fri, Nov 30, 2018 at 8:25 PM Naveen Swamy wrote: > >> Hi Marco/Gavin, >> >> Thanks for the clarification. I was not aware that it has been tested on a >> separate test environment(this is what I was suggesting and make the >> changes in a more controlled manner), last time the change was made, many >> PRs were left dangling and developers had to go trigger and I triggered >> them at least 5 times before it succeeded today. >> >> Appreciate all the hard work to make CI better. >> >> -Naveen >> >> On Fri, Nov 30, 2018 at 8:50 AM Gavin M. Bell >> wrote: >> >> > Hey Folks, >> > >> > Marco has been running this change in dev, with flying colors, for some >> > time. This is not an experiment but a roll out that was announced. We >> also >> > decided to make this change post the release cut so limit the blast >> radius >> > from any critical obligations to the community. Marco is accountable >> for >> > this work and will address any issues that may occur as he has been put >> > on-call. We have, to our best ability, mitigated as much risk as >> possible >> > and now it is time to pull the trigger. The community will enjoy a bit >> > more visibility and clarity into the test process which will be >> > advantageous, as well as allowing us to extend our infrastructure in a >> way >> > that affords us more flexibility. >> > >> > No pending PRs will be impacted. >> > >> > Thank you for your support as we evolve this system to better serve the >> > community. >> > >> > -Gavin >> > >> > On Fri, Nov 30, 2018 at 5:23 PM Marco de Abreu >> > wrote: >> > >> > > Hello Naveen, this is not an experiment. Everything has been tested in >> > our >> > > test system and is considered working 100%. This is not a test but >> > actually >> > > the move into production - the merge into master happened a week ago. >> We >> > > now just have to put all PRs into the catalogue, which means that all >> PRs >> > > have to be analyzed with the new pipelines - the only thing that will >> be >> > > noticeable is that the CI is under higher load. >> > > >> > > The pending PRs will not be impacted. The existing pipeline is still >> > > running in parallel and everything will behave as before. >> > > >> > > -Marco >> > > >> > > On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy >> wrote: >> > > >> > > > Marco, run your experiments on a branch - set up, test it well and >> then >> > > > bring it to the master. >> > > > >> > > > > On Nov 30, 2018, at 6:53 AM, Marco de Abreu < >> > > > marco.g.ab...@googlemail.com.INVALID> wrote: >> > > > > >> > > > > Hello, >> > > > > >> > > > > I'm now moving forward with #1. I will try to get to #3 as soon as >> > > > possible >> > > > > to reduce parallel jobs in our CI. You might notice some >> unfinished >> > > > jobs. I >> > > > > will let you know as soon as this process has been completed. >> Until >> > > then, >> > > > > please bare with me since we have hundreds of jobs to run in >> order to >> > > > > validate all PRs. >> > > > > >> > > > > Best regards, >> > > > > Marco >> > > > > >> > > > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu < >> > > > marco.g.ab...@googlemail.com> >> > > > > wrote: >> > > > > >> > > > >> Hello, >> > > > >> >> > > > >> since the release branch has now been cut, I would like to move >> > > forward >> >
Re: CI impaired
Thanks Naveen and Gavin! #1 has been completed and every job has finished its processing. #2 is the ticket with infra: https://issues.apache.org/jira/browse/INFRA-17346 I'm now waiting for their response. -Marco On Fri, Nov 30, 2018 at 8:25 PM Naveen Swamy wrote: > Hi Marco/Gavin, > > Thanks for the clarification. I was not aware that it has been tested on a > separate test environment(this is what I was suggesting and make the > changes in a more controlled manner), last time the change was made, many > PRs were left dangling and developers had to go trigger and I triggered > them at least 5 times before it succeeded today. > > Appreciate all the hard work to make CI better. > > -Naveen > > On Fri, Nov 30, 2018 at 8:50 AM Gavin M. Bell > wrote: > > > Hey Folks, > > > > Marco has been running this change in dev, with flying colors, for some > > time. This is not an experiment but a roll out that was announced. We > also > > decided to make this change post the release cut so limit the blast > radius > > from any critical obligations to the community. Marco is accountable for > > this work and will address any issues that may occur as he has been put > > on-call. We have, to our best ability, mitigated as much risk as > possible > > and now it is time to pull the trigger. The community will enjoy a bit > > more visibility and clarity into the test process which will be > > advantageous, as well as allowing us to extend our infrastructure in a > way > > that affords us more flexibility. > > > > No pending PRs will be impacted. > > > > Thank you for your support as we evolve this system to better serve the > > community. > > > > -Gavin > > > > On Fri, Nov 30, 2018 at 5:23 PM Marco de Abreu > > wrote: > > > > > Hello Naveen, this is not an experiment. Everything has been tested in > > our > > > test system and is considered working 100%. This is not a test but > > actually > > > the move into production - the merge into master happened a week ago. > We > > > now just have to put all PRs into the catalogue, which means that all > PRs > > > have to be analyzed with the new pipelines - the only thing that will > be > > > noticeable is that the CI is under higher load. > > > > > > The pending PRs will not be impacted. The existing pipeline is still > > > running in parallel and everything will behave as before. > > > > > > -Marco > > > > > > On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy > wrote: > > > > > > > Marco, run your experiments on a branch - set up, test it well and > then > > > > bring it to the master. > > > > > > > > > On Nov 30, 2018, at 6:53 AM, Marco de Abreu < > > > > marco.g.ab...@googlemail.com.INVALID> wrote: > > > > > > > > > > Hello, > > > > > > > > > > I'm now moving forward with #1. I will try to get to #3 as soon as > > > > possible > > > > > to reduce parallel jobs in our CI. You might notice some unfinished > > > > jobs. I > > > > > will let you know as soon as this process has been completed. Until > > > then, > > > > > please bare with me since we have hundreds of jobs to run in order > to > > > > > validate all PRs. > > > > > > > > > > Best regards, > > > > > Marco > > > > > > > > > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu < > > > > marco.g.ab...@googlemail.com> > > > > > wrote: > > > > > > > > > >> Hello, > > > > >> > > > > >> since the release branch has now been cut, I would like to move > > > forward > > > > >> with the CI improvements for the master branch. This would include > > the > > > > >> following actions: > > > > >> 1. Re-enable the new Jenkins job > > > > >> 2. Request Apache Infra to move the protected branch check from > the > > > main > > > > >> pipeline to our new ones > > > > >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - > > this > > > > >> finalizes the deprecation process > > > > >> > > > > >> If nobody objects, I would like to start with #1 soon. Mentors, > > could > > > > you > > > > >> please assist to create the Apache Infra ticket? I would then take > > it > > > > from > > > > >> there and talk to Infra. > > > > >> > > > > >> Best regards, > > > > >> Marco > > > > >> > > > > >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland < > > > > >> kellen.sunderl...@gmail.com> wrote: > > > > >> > > > > >>> Sorry, [1] meant to reference > > > > >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 . > > > > >>> > > > > >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland < > > > > >>> kellen.sunderl...@gmail.com> wrote: > > > > >>> > > > > Marco and I ran into another urgent issue over the weekend that > > was > > > > causing builds to fail. This issue was unrelated to any feature > > > > development work, or other CI fixes applied recently, but it did > > > > require > > > > quite a bit of work from Marco (and a little from me) to fix. > > > > > > > > We spent enough time on the problem that it caused us to take a > > step > > > > >>> back > > > > and consider how we could both fix issues
Re: CI impaired
Hi Marco/Gavin, Thanks for the clarification. I was not aware that it has been tested on a separate test environment(this is what I was suggesting and make the changes in a more controlled manner), last time the change was made, many PRs were left dangling and developers had to go trigger and I triggered them at least 5 times before it succeeded today. Appreciate all the hard work to make CI better. -Naveen On Fri, Nov 30, 2018 at 8:50 AM Gavin M. Bell wrote: > Hey Folks, > > Marco has been running this change in dev, with flying colors, for some > time. This is not an experiment but a roll out that was announced. We also > decided to make this change post the release cut so limit the blast radius > from any critical obligations to the community. Marco is accountable for > this work and will address any issues that may occur as he has been put > on-call. We have, to our best ability, mitigated as much risk as possible > and now it is time to pull the trigger. The community will enjoy a bit > more visibility and clarity into the test process which will be > advantageous, as well as allowing us to extend our infrastructure in a way > that affords us more flexibility. > > No pending PRs will be impacted. > > Thank you for your support as we evolve this system to better serve the > community. > > -Gavin > > On Fri, Nov 30, 2018 at 5:23 PM Marco de Abreu > wrote: > > > Hello Naveen, this is not an experiment. Everything has been tested in > our > > test system and is considered working 100%. This is not a test but > actually > > the move into production - the merge into master happened a week ago. We > > now just have to put all PRs into the catalogue, which means that all PRs > > have to be analyzed with the new pipelines - the only thing that will be > > noticeable is that the CI is under higher load. > > > > The pending PRs will not be impacted. The existing pipeline is still > > running in parallel and everything will behave as before. > > > > -Marco > > > > On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy wrote: > > > > > Marco, run your experiments on a branch - set up, test it well and then > > > bring it to the master. > > > > > > > On Nov 30, 2018, at 6:53 AM, Marco de Abreu < > > > marco.g.ab...@googlemail.com.INVALID> wrote: > > > > > > > > Hello, > > > > > > > > I'm now moving forward with #1. I will try to get to #3 as soon as > > > possible > > > > to reduce parallel jobs in our CI. You might notice some unfinished > > > jobs. I > > > > will let you know as soon as this process has been completed. Until > > then, > > > > please bare with me since we have hundreds of jobs to run in order to > > > > validate all PRs. > > > > > > > > Best regards, > > > > Marco > > > > > > > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu < > > > marco.g.ab...@googlemail.com> > > > > wrote: > > > > > > > >> Hello, > > > >> > > > >> since the release branch has now been cut, I would like to move > > forward > > > >> with the CI improvements for the master branch. This would include > the > > > >> following actions: > > > >> 1. Re-enable the new Jenkins job > > > >> 2. Request Apache Infra to move the protected branch check from the > > main > > > >> pipeline to our new ones > > > >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - > this > > > >> finalizes the deprecation process > > > >> > > > >> If nobody objects, I would like to start with #1 soon. Mentors, > could > > > you > > > >> please assist to create the Apache Infra ticket? I would then take > it > > > from > > > >> there and talk to Infra. > > > >> > > > >> Best regards, > > > >> Marco > > > >> > > > >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland < > > > >> kellen.sunderl...@gmail.com> wrote: > > > >> > > > >>> Sorry, [1] meant to reference > > > >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 . > > > >>> > > > >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland < > > > >>> kellen.sunderl...@gmail.com> wrote: > > > >>> > > > Marco and I ran into another urgent issue over the weekend that > was > > > causing builds to fail. This issue was unrelated to any feature > > > development work, or other CI fixes applied recently, but it did > > > require > > > quite a bit of work from Marco (and a little from me) to fix. > > > > > > We spent enough time on the problem that it caused us to take a > step > > > >>> back > > > and consider how we could both fix issues in CI and support the > 1.4 > > > >>> release > > > with the least impact possible on MXNet devs. Marco had planned > to > > > >>> make a > > > significant change to the CI to fix a long-standing Jenkins error > > [1], > > > >>> but > > > we feel that most developers would prioritize having a stable > build > > > environment for the next few weeks over having this fix in place. > > > > > > To properly introduce a new CI system the intent was to do a > gradual > > > blue/green roll out of the
Re: CI impaired
Hey Folks, Marco has been running this change in dev, with flying colors, for some time. This is not an experiment but a roll out that was announced. We also decided to make this change post the release cut so limit the blast radius from any critical obligations to the community. Marco is accountable for this work and will address any issues that may occur as he has been put on-call. We have, to our best ability, mitigated as much risk as possible and now it is time to pull the trigger. The community will enjoy a bit more visibility and clarity into the test process which will be advantageous, as well as allowing us to extend our infrastructure in a way that affords us more flexibility. No pending PRs will be impacted. Thank you for your support as we evolve this system to better serve the community. -Gavin On Fri, Nov 30, 2018 at 5:23 PM Marco de Abreu wrote: > Hello Naveen, this is not an experiment. Everything has been tested in our > test system and is considered working 100%. This is not a test but actually > the move into production - the merge into master happened a week ago. We > now just have to put all PRs into the catalogue, which means that all PRs > have to be analyzed with the new pipelines - the only thing that will be > noticeable is that the CI is under higher load. > > The pending PRs will not be impacted. The existing pipeline is still > running in parallel and everything will behave as before. > > -Marco > > On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy wrote: > > > Marco, run your experiments on a branch - set up, test it well and then > > bring it to the master. > > > > > On Nov 30, 2018, at 6:53 AM, Marco de Abreu < > > marco.g.ab...@googlemail.com.INVALID> wrote: > > > > > > Hello, > > > > > > I'm now moving forward with #1. I will try to get to #3 as soon as > > possible > > > to reduce parallel jobs in our CI. You might notice some unfinished > > jobs. I > > > will let you know as soon as this process has been completed. Until > then, > > > please bare with me since we have hundreds of jobs to run in order to > > > validate all PRs. > > > > > > Best regards, > > > Marco > > > > > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu < > > marco.g.ab...@googlemail.com> > > > wrote: > > > > > >> Hello, > > >> > > >> since the release branch has now been cut, I would like to move > forward > > >> with the CI improvements for the master branch. This would include the > > >> following actions: > > >> 1. Re-enable the new Jenkins job > > >> 2. Request Apache Infra to move the protected branch check from the > main > > >> pipeline to our new ones > > >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this > > >> finalizes the deprecation process > > >> > > >> If nobody objects, I would like to start with #1 soon. Mentors, could > > you > > >> please assist to create the Apache Infra ticket? I would then take it > > from > > >> there and talk to Infra. > > >> > > >> Best regards, > > >> Marco > > >> > > >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland < > > >> kellen.sunderl...@gmail.com> wrote: > > >> > > >>> Sorry, [1] meant to reference > > >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 . > > >>> > > >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland < > > >>> kellen.sunderl...@gmail.com> wrote: > > >>> > > Marco and I ran into another urgent issue over the weekend that was > > causing builds to fail. This issue was unrelated to any feature > > development work, or other CI fixes applied recently, but it did > > require > > quite a bit of work from Marco (and a little from me) to fix. > > > > We spent enough time on the problem that it caused us to take a step > > >>> back > > and consider how we could both fix issues in CI and support the 1.4 > > >>> release > > with the least impact possible on MXNet devs. Marco had planned to > > >>> make a > > significant change to the CI to fix a long-standing Jenkins error > [1], > > >>> but > > we feel that most developers would prioritize having a stable build > > environment for the next few weeks over having this fix in place. > > > > To properly introduce a new CI system the intent was to do a gradual > > blue/green roll out of the fix. To manage this rollout would have > > taken > > operational effort and double compute load as we run systems in > > >>> parallel. > > This risks outages due to scaling limits, and we’d rather make this > > >>> change > > during a period of low-developer activity, i.e. shortly after the > 1.4 > > release. > > > > This means that from now until the 1.4 release, in order to reduce > > complexity MXNet developers should only see a single Jenkins > > >>> verification > > check, and a single Travis check. > > > > > > >>> > > >> > > > -- Sincerely, Gavin M. Bell "Never mistake a clear view for a short distance." -Paul Saffo
Re: CI impaired
There are still pending PRs pending that needs to be merged and cherry picked to the branch > On Nov 30, 2018, at 6:53 AM, Marco de Abreu > wrote: > > Hello, > > I'm now moving forward with #1. I will try to get to #3 as soon as possible > to reduce parallel jobs in our CI. You might notice some unfinished jobs. I > will let you know as soon as this process has been completed. Until then, > please bare with me since we have hundreds of jobs to run in order to > validate all PRs. > > Best regards, > Marco > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu > wrote: > >> Hello, >> >> since the release branch has now been cut, I would like to move forward >> with the CI improvements for the master branch. This would include the >> following actions: >> 1. Re-enable the new Jenkins job >> 2. Request Apache Infra to move the protected branch check from the main >> pipeline to our new ones >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this >> finalizes the deprecation process >> >> If nobody objects, I would like to start with #1 soon. Mentors, could you >> please assist to create the Apache Infra ticket? I would then take it from >> there and talk to Infra. >> >> Best regards, >> Marco >> >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland < >> kellen.sunderl...@gmail.com> wrote: >> >>> Sorry, [1] meant to reference >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 . >>> >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland < >>> kellen.sunderl...@gmail.com> wrote: >>> Marco and I ran into another urgent issue over the weekend that was causing builds to fail. This issue was unrelated to any feature development work, or other CI fixes applied recently, but it did require quite a bit of work from Marco (and a little from me) to fix. We spent enough time on the problem that it caused us to take a step >>> back and consider how we could both fix issues in CI and support the 1.4 >>> release with the least impact possible on MXNet devs. Marco had planned to >>> make a significant change to the CI to fix a long-standing Jenkins error [1], >>> but we feel that most developers would prioritize having a stable build environment for the next few weeks over having this fix in place. To properly introduce a new CI system the intent was to do a gradual blue/green roll out of the fix. To manage this rollout would have taken operational effort and double compute load as we run systems in >>> parallel. This risks outages due to scaling limits, and we’d rather make this >>> change during a period of low-developer activity, i.e. shortly after the 1.4 release. This means that from now until the 1.4 release, in order to reduce complexity MXNet developers should only see a single Jenkins >>> verification check, and a single Travis check. >>> >>
Re: CI impaired
Hello, I'm now moving forward with #1. I will try to get to #3 as soon as possible to reduce parallel jobs in our CI. You might notice some unfinished jobs. I will let you know as soon as this process has been completed. Until then, please bare with me since we have hundreds of jobs to run in order to validate all PRs. Best regards, Marco On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu wrote: > Hello, > > since the release branch has now been cut, I would like to move forward > with the CI improvements for the master branch. This would include the > following actions: > 1. Re-enable the new Jenkins job > 2. Request Apache Infra to move the protected branch check from the main > pipeline to our new ones > 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this > finalizes the deprecation process > > If nobody objects, I would like to start with #1 soon. Mentors, could you > please assist to create the Apache Infra ticket? I would then take it from > there and talk to Infra. > > Best regards, > Marco > > On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland < > kellen.sunderl...@gmail.com> wrote: > >> Sorry, [1] meant to reference >> https://issues.jenkins-ci.org/browse/JENKINS-37984 . >> >> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland < >> kellen.sunderl...@gmail.com> wrote: >> >> > Marco and I ran into another urgent issue over the weekend that was >> > causing builds to fail. This issue was unrelated to any feature >> > development work, or other CI fixes applied recently, but it did require >> > quite a bit of work from Marco (and a little from me) to fix. >> > >> > We spent enough time on the problem that it caused us to take a step >> back >> > and consider how we could both fix issues in CI and support the 1.4 >> release >> > with the least impact possible on MXNet devs. Marco had planned to >> make a >> > significant change to the CI to fix a long-standing Jenkins error [1], >> but >> > we feel that most developers would prioritize having a stable build >> > environment for the next few weeks over having this fix in place. >> > >> > To properly introduce a new CI system the intent was to do a gradual >> > blue/green roll out of the fix. To manage this rollout would have taken >> > operational effort and double compute load as we run systems in >> parallel. >> > This risks outages due to scaling limits, and we’d rather make this >> change >> > during a period of low-developer activity, i.e. shortly after the 1.4 >> > release. >> > >> > This means that from now until the 1.4 release, in order to reduce >> > complexity MXNet developers should only see a single Jenkins >> verification >> > check, and a single Travis check. >> > >> > >> >
Re: CI impaired
Hello, since the release branch has now been cut, I would like to move forward with the CI improvements for the master branch. This would include the following actions: 1. Re-enable the new Jenkins job 2. Request Apache Infra to move the protected branch check from the main pipeline to our new ones 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this finalizes the deprecation process If nobody objects, I would like to start with #1 soon. Mentors, could you please assist to create the Apache Infra ticket? I would then take it from there and talk to Infra. Best regards, Marco On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland < kellen.sunderl...@gmail.com> wrote: > Sorry, [1] meant to reference > https://issues.jenkins-ci.org/browse/JENKINS-37984 . > > On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland < > kellen.sunderl...@gmail.com> wrote: > > > Marco and I ran into another urgent issue over the weekend that was > > causing builds to fail. This issue was unrelated to any feature > > development work, or other CI fixes applied recently, but it did require > > quite a bit of work from Marco (and a little from me) to fix. > > > > We spent enough time on the problem that it caused us to take a step back > > and consider how we could both fix issues in CI and support the 1.4 > release > > with the least impact possible on MXNet devs. Marco had planned to make > a > > significant change to the CI to fix a long-standing Jenkins error [1], > but > > we feel that most developers would prioritize having a stable build > > environment for the next few weeks over having this fix in place. > > > > To properly introduce a new CI system the intent was to do a gradual > > blue/green roll out of the fix. To manage this rollout would have taken > > operational effort and double compute load as we run systems in parallel. > > This risks outages due to scaling limits, and we’d rather make this > change > > during a period of low-developer activity, i.e. shortly after the 1.4 > > release. > > > > This means that from now until the 1.4 release, in order to reduce > > complexity MXNet developers should only see a single Jenkins verification > > check, and a single Travis check. > > > > >
Re: CI impaired
Sorry, [1] meant to reference https://issues.jenkins-ci.org/browse/JENKINS-37984 . On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland < kellen.sunderl...@gmail.com> wrote: > Marco and I ran into another urgent issue over the weekend that was > causing builds to fail. This issue was unrelated to any feature > development work, or other CI fixes applied recently, but it did require > quite a bit of work from Marco (and a little from me) to fix. > > We spent enough time on the problem that it caused us to take a step back > and consider how we could both fix issues in CI and support the 1.4 release > with the least impact possible on MXNet devs. Marco had planned to make a > significant change to the CI to fix a long-standing Jenkins error [1], but > we feel that most developers would prioritize having a stable build > environment for the next few weeks over having this fix in place. > > To properly introduce a new CI system the intent was to do a gradual > blue/green roll out of the fix. To manage this rollout would have taken > operational effort and double compute load as we run systems in parallel. > This risks outages due to scaling limits, and we’d rather make this change > during a period of low-developer activity, i.e. shortly after the 1.4 > release. > > This means that from now until the 1.4 release, in order to reduce > complexity MXNet developers should only see a single Jenkins verification > check, and a single Travis check. > >
Re: CI impaired
Marco and I ran into another urgent issue over the weekend that was causing builds to fail. This issue was unrelated to any feature development work, or other CI fixes applied recently, but it did require quite a bit of work from Marco (and a little from me) to fix. We spent enough time on the problem that it caused us to take a step back and consider how we could both fix issues in CI and support the 1.4 release with the least impact possible on MXNet devs. Marco had planned to make a significant change to the CI to fix a long-standing Jenkins error [1], but we feel that most developers would prioritize having a stable build environment for the next few weeks over having this fix in place. To properly introduce a new CI system the intent was to do a gradual blue/green roll out of the fix. To manage this rollout would have taken operational effort and double compute load as we run systems in parallel. This risks outages due to scaling limits, and we’d rather make this change during a period of low-developer activity, i.e. shortly after the 1.4 release. This means that from now until the 1.4 release, in order to reduce complexity MXNet developers should only see a single Jenkins verification check, and a single Travis check.
Re: CI impaired
Hi Marco - suggest to retrigger PRs, if needed in stages: - pr-awaiting-merge - pr-awaiting-review that would cover 78 PR. In any case I would exclude pr-work-in-progress. Steffen On Sat, Nov 24, 2018 at 9:11 PM kellen sunderland < kellen.sunderl...@gmail.com> wrote: > Hey Marco, I'm still having quite a few issues passing PRs. Would you be > able to at least test a handful of PRs and make sure they pass/fail tests > as you expect? > > On Sat, Nov 24, 2018, 7:01 PM Marco de Abreu > > > Hello Steffen, > > > > thank you for bringing up these PRs. > > > > I had to abort the builds during the outage which means that the jobs > > didn't finish and not even the status propagation could have finished > > (hence they show pending instead of failure or aborted). > > > > Recently, we merged a PR that adds utility slaves. This will ensure that > > status updates will always be posted, no matter whether the main queue > > hangs or not. This means that the status would then be properly reflected > > and there should be no hanging pending runs. > > > > I could retrigger all PRs to kick off another round of validation, but > this > > would result in 240 jobs (2 main pipelines times 120 open PRs) to run. > > Since we are currently in the pre-release stage, I wanted to avoid > putting > > the system under such heavy load. > > > > Instead, I'd kindly like to request the PR creators to make a new commit > to > > trigger the pipelines. In order to merge a PR, only PR-merge has to pass > > and I tried to retrigger all PRs that have been aborted during the > outage. > > It might have been possible that I missed a few. > > > > Since it's still the weekend and there's not much going on, I can use the > > time to trigger all PRs. Please advise whether you think I should move > > forward (I expect the CI to finish all PRs within 6-10 hours) or if it's > > fine to ask people to retrigger themselves. > > > > Please excuse the caused inconveniences. > > > > Best regards, > > Marco > > > > > > Am So., 25. Nov. 2018, 03:48 hat Steffen Rochel > > > geschrieben: > > > > > Thanks Marco for the updates and resolving the issues. > > > However, I do see a number of PR waiting to be merged with inconsistent > > PR > > > validation status check. > > > E.g. https://github.com/apache/incubator-mxnet/pull/13041 shows 9 > > pending > > > checks being queued. However, when you look at the details, either the > > > checks have passed or failed (centos-cpu, edge, unix-cpu, window-cpu, > > > windows-gpu failed, required pr-merge which includes edge, gpu tests > > > passed). > > > Similar also for other PR with label pr-awaiting-merge ( > > > > > > > > > https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge > > > ) > > > Please advice on resolution. > > > > > > Regards, > > > Steffen > > > > > > On Thu, Nov 22, 2018 at 12:09 PM Marco de Abreu > > > wrote: > > > > > > > Thanks everybody, I really appreciate it! > > > > > > > > Today was a good day, there were no incidents and everything appears > to > > > be > > > > stable. In the meantime I did a deep dive on why we has such a > > > significant > > > > performance decrease with of our compilation jobs - which then > clogged > > up > > > > the queue and resulted in 1000 jobs waiting to be scheduled. > > > > > > > > The reason was the way how we use ccache to speed up our compilation > > > jobs. > > > > Usually, this yields us a huge performance improvement (CPU openblas, > > for > > > > example, goes from 30 minutes down to ~3min, ARMv7 from 30 minutes > down > > > to > > > > ~1.5min, etc.). Unfortunately in this case, ccache was our limiting > > > factor. > > > > Here's some background about how we operate our cache: > > > > > > > > We use EFS to have a distributed ccache between all of our > > > > unrestricted-prod-slaves. EFS is classified for almost unlimited > > > > scalability (being consumed by thousands of instances in parallel > [1]) > > > with > > > > a theoretical throughput of over 10Gbps. One thing I didn't know > when I > > > > designed this approach was the method how throughput is being > granted. > > > > Similar to T2-CPU-Credits, EFS uses BurstCredits to allow you higher > > > > throughput (default is 50MiB/s) [2]. Due to the high load, we > consumed > > > all > > > > of our credits - here's a very interesting graph: [3]. > > > > > > > > To avoid similar incidents in future, I have taken the following > > actions: > > > > 1. I switched EFS from burst-mode to provisioned throughput with > > 300MB/s > > > > (in the graph at [3] you can see how our IO immediately increases - > and > > > > thus our CI gets faster - as soon as I added provisioned throughput). > > > > 2. I created internal follow-up tickets to add monitoring and > automated > > > > actions. > > > > > > > > First, we should be notified if we are running low on credits to > > kick-off > > > > an investigation. Second (nice to have), we could have a > > lambda-function
Re: CI impaired
Hey Marco, I'm still having quite a few issues passing PRs. Would you be able to at least test a handful of PRs and make sure they pass/fail tests as you expect? On Sat, Nov 24, 2018, 7:01 PM Marco de Abreu Hello Steffen, > > thank you for bringing up these PRs. > > I had to abort the builds during the outage which means that the jobs > didn't finish and not even the status propagation could have finished > (hence they show pending instead of failure or aborted). > > Recently, we merged a PR that adds utility slaves. This will ensure that > status updates will always be posted, no matter whether the main queue > hangs or not. This means that the status would then be properly reflected > and there should be no hanging pending runs. > > I could retrigger all PRs to kick off another round of validation, but this > would result in 240 jobs (2 main pipelines times 120 open PRs) to run. > Since we are currently in the pre-release stage, I wanted to avoid putting > the system under such heavy load. > > Instead, I'd kindly like to request the PR creators to make a new commit to > trigger the pipelines. In order to merge a PR, only PR-merge has to pass > and I tried to retrigger all PRs that have been aborted during the outage. > It might have been possible that I missed a few. > > Since it's still the weekend and there's not much going on, I can use the > time to trigger all PRs. Please advise whether you think I should move > forward (I expect the CI to finish all PRs within 6-10 hours) or if it's > fine to ask people to retrigger themselves. > > Please excuse the caused inconveniences. > > Best regards, > Marco > > > Am So., 25. Nov. 2018, 03:48 hat Steffen Rochel > geschrieben: > > > Thanks Marco for the updates and resolving the issues. > > However, I do see a number of PR waiting to be merged with inconsistent > PR > > validation status check. > > E.g. https://github.com/apache/incubator-mxnet/pull/13041 shows 9 > pending > > checks being queued. However, when you look at the details, either the > > checks have passed or failed (centos-cpu, edge, unix-cpu, window-cpu, > > windows-gpu failed, required pr-merge which includes edge, gpu tests > > passed). > > Similar also for other PR with label pr-awaiting-merge ( > > > > > https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge > > ) > > Please advice on resolution. > > > > Regards, > > Steffen > > > > On Thu, Nov 22, 2018 at 12:09 PM Marco de Abreu > > wrote: > > > > > Thanks everybody, I really appreciate it! > > > > > > Today was a good day, there were no incidents and everything appears to > > be > > > stable. In the meantime I did a deep dive on why we has such a > > significant > > > performance decrease with of our compilation jobs - which then clogged > up > > > the queue and resulted in 1000 jobs waiting to be scheduled. > > > > > > The reason was the way how we use ccache to speed up our compilation > > jobs. > > > Usually, this yields us a huge performance improvement (CPU openblas, > for > > > example, goes from 30 minutes down to ~3min, ARMv7 from 30 minutes down > > to > > > ~1.5min, etc.). Unfortunately in this case, ccache was our limiting > > factor. > > > Here's some background about how we operate our cache: > > > > > > We use EFS to have a distributed ccache between all of our > > > unrestricted-prod-slaves. EFS is classified for almost unlimited > > > scalability (being consumed by thousands of instances in parallel [1]) > > with > > > a theoretical throughput of over 10Gbps. One thing I didn't know when I > > > designed this approach was the method how throughput is being granted. > > > Similar to T2-CPU-Credits, EFS uses BurstCredits to allow you higher > > > throughput (default is 50MiB/s) [2]. Due to the high load, we consumed > > all > > > of our credits - here's a very interesting graph: [3]. > > > > > > To avoid similar incidents in future, I have taken the following > actions: > > > 1. I switched EFS from burst-mode to provisioned throughput with > 300MB/s > > > (in the graph at [3] you can see how our IO immediately increases - and > > > thus our CI gets faster - as soon as I added provisioned throughput). > > > 2. I created internal follow-up tickets to add monitoring and automated > > > actions. > > > > > > First, we should be notified if we are running low on credits to > kick-off > > > an investigation. Second (nice to have), we could have a > lambda-function > > > which listens for that event and automatically switches the EFS volume > > from > > > burst-mode to provisioned throughput during high-load-times. The > required > > > throughput could be retrieved via CloudWatch and then multiplied by a > > > factor. EFS allows to downgrade the throughput mode 24h after the last > > > changes (to reduce capacity if the load is over) and always allows to > > > upgrade the provisioned capacity (if the load goes even higher). I've > > been > > > looking for a pre-made
Re: CI impaired
Hello Steffen, thank you for bringing up these PRs. I had to abort the builds during the outage which means that the jobs didn't finish and not even the status propagation could have finished (hence they show pending instead of failure or aborted). Recently, we merged a PR that adds utility slaves. This will ensure that status updates will always be posted, no matter whether the main queue hangs or not. This means that the status would then be properly reflected and there should be no hanging pending runs. I could retrigger all PRs to kick off another round of validation, but this would result in 240 jobs (2 main pipelines times 120 open PRs) to run. Since we are currently in the pre-release stage, I wanted to avoid putting the system under such heavy load. Instead, I'd kindly like to request the PR creators to make a new commit to trigger the pipelines. In order to merge a PR, only PR-merge has to pass and I tried to retrigger all PRs that have been aborted during the outage. It might have been possible that I missed a few. Since it's still the weekend and there's not much going on, I can use the time to trigger all PRs. Please advise whether you think I should move forward (I expect the CI to finish all PRs within 6-10 hours) or if it's fine to ask people to retrigger themselves. Please excuse the caused inconveniences. Best regards, Marco Am So., 25. Nov. 2018, 03:48 hat Steffen Rochel geschrieben: > Thanks Marco for the updates and resolving the issues. > However, I do see a number of PR waiting to be merged with inconsistent PR > validation status check. > E.g. https://github.com/apache/incubator-mxnet/pull/13041 shows 9 pending > checks being queued. However, when you look at the details, either the > checks have passed or failed (centos-cpu, edge, unix-cpu, window-cpu, > windows-gpu failed, required pr-merge which includes edge, gpu tests > passed). > Similar also for other PR with label pr-awaiting-merge ( > > https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge > ) > Please advice on resolution. > > Regards, > Steffen > > On Thu, Nov 22, 2018 at 12:09 PM Marco de Abreu > wrote: > > > Thanks everybody, I really appreciate it! > > > > Today was a good day, there were no incidents and everything appears to > be > > stable. In the meantime I did a deep dive on why we has such a > significant > > performance decrease with of our compilation jobs - which then clogged up > > the queue and resulted in 1000 jobs waiting to be scheduled. > > > > The reason was the way how we use ccache to speed up our compilation > jobs. > > Usually, this yields us a huge performance improvement (CPU openblas, for > > example, goes from 30 minutes down to ~3min, ARMv7 from 30 minutes down > to > > ~1.5min, etc.). Unfortunately in this case, ccache was our limiting > factor. > > Here's some background about how we operate our cache: > > > > We use EFS to have a distributed ccache between all of our > > unrestricted-prod-slaves. EFS is classified for almost unlimited > > scalability (being consumed by thousands of instances in parallel [1]) > with > > a theoretical throughput of over 10Gbps. One thing I didn't know when I > > designed this approach was the method how throughput is being granted. > > Similar to T2-CPU-Credits, EFS uses BurstCredits to allow you higher > > throughput (default is 50MiB/s) [2]. Due to the high load, we consumed > all > > of our credits - here's a very interesting graph: [3]. > > > > To avoid similar incidents in future, I have taken the following actions: > > 1. I switched EFS from burst-mode to provisioned throughput with 300MB/s > > (in the graph at [3] you can see how our IO immediately increases - and > > thus our CI gets faster - as soon as I added provisioned throughput). > > 2. I created internal follow-up tickets to add monitoring and automated > > actions. > > > > First, we should be notified if we are running low on credits to kick-off > > an investigation. Second (nice to have), we could have a lambda-function > > which listens for that event and automatically switches the EFS volume > from > > burst-mode to provisioned throughput during high-load-times. The required > > throughput could be retrieved via CloudWatch and then multiplied by a > > factor. EFS allows to downgrade the throughput mode 24h after the last > > changes (to reduce capacity if the load is over) and always allows to > > upgrade the provisioned capacity (if the load goes even higher). I've > been > > looking for a pre-made CloudFormation template to facilitate that, but so > > far, I haven't been able to find it. > > > > I'm now running additional load tests on our test CI environment to > detect > > other potential bottlenecks. > > > > Thanks a lot for your support! > > > > Best regards, > > Marco > > > > [1]: https://docs.aws.amazon.com/efs/latest/ug/performance.html > > [2]: > > >
Re: CI impaired
Thanks Marco for the updates and resolving the issues. However, I do see a number of PR waiting to be merged with inconsistent PR validation status check. E.g. https://github.com/apache/incubator-mxnet/pull/13041 shows 9 pending checks being queued. However, when you look at the details, either the checks have passed or failed (centos-cpu, edge, unix-cpu, window-cpu, windows-gpu failed, required pr-merge which includes edge, gpu tests passed). Similar also for other PR with label pr-awaiting-merge ( https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge ) Please advice on resolution. Regards, Steffen On Thu, Nov 22, 2018 at 12:09 PM Marco de Abreu wrote: > Thanks everybody, I really appreciate it! > > Today was a good day, there were no incidents and everything appears to be > stable. In the meantime I did a deep dive on why we has such a significant > performance decrease with of our compilation jobs - which then clogged up > the queue and resulted in 1000 jobs waiting to be scheduled. > > The reason was the way how we use ccache to speed up our compilation jobs. > Usually, this yields us a huge performance improvement (CPU openblas, for > example, goes from 30 minutes down to ~3min, ARMv7 from 30 minutes down to > ~1.5min, etc.). Unfortunately in this case, ccache was our limiting factor. > Here's some background about how we operate our cache: > > We use EFS to have a distributed ccache between all of our > unrestricted-prod-slaves. EFS is classified for almost unlimited > scalability (being consumed by thousands of instances in parallel [1]) with > a theoretical throughput of over 10Gbps. One thing I didn't know when I > designed this approach was the method how throughput is being granted. > Similar to T2-CPU-Credits, EFS uses BurstCredits to allow you higher > throughput (default is 50MiB/s) [2]. Due to the high load, we consumed all > of our credits - here's a very interesting graph: [3]. > > To avoid similar incidents in future, I have taken the following actions: > 1. I switched EFS from burst-mode to provisioned throughput with 300MB/s > (in the graph at [3] you can see how our IO immediately increases - and > thus our CI gets faster - as soon as I added provisioned throughput). > 2. I created internal follow-up tickets to add monitoring and automated > actions. > > First, we should be notified if we are running low on credits to kick-off > an investigation. Second (nice to have), we could have a lambda-function > which listens for that event and automatically switches the EFS volume from > burst-mode to provisioned throughput during high-load-times. The required > throughput could be retrieved via CloudWatch and then multiplied by a > factor. EFS allows to downgrade the throughput mode 24h after the last > changes (to reduce capacity if the load is over) and always allows to > upgrade the provisioned capacity (if the load goes even higher). I've been > looking for a pre-made CloudFormation template to facilitate that, but so > far, I haven't been able to find it. > > I'm now running additional load tests on our test CI environment to detect > other potential bottlenecks. > > Thanks a lot for your support! > > Best regards, > Marco > > [1]: https://docs.aws.amazon.com/efs/latest/ug/performance.html > [2]: > https://docs.aws.amazon.com/efs/latest/ug/performance.html#throughput-modes > [3]: https://i.imgur.com/nboQLOn.png > > On Thu, Nov 22, 2018 at 1:40 AM Qing Lan wrote: > > > Appreciated for your effort and help to make CI a better place! > > > > Qing > > > > On 11/21/18, 4:38 PM, "Lin Yuan" wrote: > > > > Thanks for your efforts, Marco! > > > > On Wed, Nov 21, 2018 at 4:02 PM Anirudh Subramanian < > > anirudh2...@gmail.com> > > wrote: > > > > > Thanks for the quick response and mitigation! > > > > > > On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu > > > wrote: > > > > > > > Hello, > > > > > > > > today, CI had some issues and I had to cancel all jobs a few > > minutes ago. > > > > This was basically caused by the high load that is currently > being > > put on > > > > our CI system due to the pre-release efforts for this Friday. > > > > > > > > It's really unfortunate that we just had outages of three core > > components > > > > within the last two days - sorry about that!. To recap, we had > the > > > > following outages (which are unrelated to the parallel refactor > of > > the > > > > Jenkins pipeline): > > > > - (yesterday evening) The Jenkins master ran out of disk space > and > > thus > > > > processed requests at reduced capacity > > > > - (this morning) The Jenkins master got updated which broke our > > > > autoscalings upscaling capabilities. > > > > - (new, this evening) Jenkins API was irresponsive: Due to the > high > > > number > > > > of jobs and a bad API design in the Jenkins REST API, the > > time-complexity > > > >
Re: CI impaired
Thanks everybody, I really appreciate it! Today was a good day, there were no incidents and everything appears to be stable. In the meantime I did a deep dive on why we has such a significant performance decrease with of our compilation jobs - which then clogged up the queue and resulted in 1000 jobs waiting to be scheduled. The reason was the way how we use ccache to speed up our compilation jobs. Usually, this yields us a huge performance improvement (CPU openblas, for example, goes from 30 minutes down to ~3min, ARMv7 from 30 minutes down to ~1.5min, etc.). Unfortunately in this case, ccache was our limiting factor. Here's some background about how we operate our cache: We use EFS to have a distributed ccache between all of our unrestricted-prod-slaves. EFS is classified for almost unlimited scalability (being consumed by thousands of instances in parallel [1]) with a theoretical throughput of over 10Gbps. One thing I didn't know when I designed this approach was the method how throughput is being granted. Similar to T2-CPU-Credits, EFS uses BurstCredits to allow you higher throughput (default is 50MiB/s) [2]. Due to the high load, we consumed all of our credits - here's a very interesting graph: [3]. To avoid similar incidents in future, I have taken the following actions: 1. I switched EFS from burst-mode to provisioned throughput with 300MB/s (in the graph at [3] you can see how our IO immediately increases - and thus our CI gets faster - as soon as I added provisioned throughput). 2. I created internal follow-up tickets to add monitoring and automated actions. First, we should be notified if we are running low on credits to kick-off an investigation. Second (nice to have), we could have a lambda-function which listens for that event and automatically switches the EFS volume from burst-mode to provisioned throughput during high-load-times. The required throughput could be retrieved via CloudWatch and then multiplied by a factor. EFS allows to downgrade the throughput mode 24h after the last changes (to reduce capacity if the load is over) and always allows to upgrade the provisioned capacity (if the load goes even higher). I've been looking for a pre-made CloudFormation template to facilitate that, but so far, I haven't been able to find it. I'm now running additional load tests on our test CI environment to detect other potential bottlenecks. Thanks a lot for your support! Best regards, Marco [1]: https://docs.aws.amazon.com/efs/latest/ug/performance.html [2]: https://docs.aws.amazon.com/efs/latest/ug/performance.html#throughput-modes [3]: https://i.imgur.com/nboQLOn.png On Thu, Nov 22, 2018 at 1:40 AM Qing Lan wrote: > Appreciated for your effort and help to make CI a better place! > > Qing > > On 11/21/18, 4:38 PM, "Lin Yuan" wrote: > > Thanks for your efforts, Marco! > > On Wed, Nov 21, 2018 at 4:02 PM Anirudh Subramanian < > anirudh2...@gmail.com> > wrote: > > > Thanks for the quick response and mitigation! > > > > On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu > > wrote: > > > > > Hello, > > > > > > today, CI had some issues and I had to cancel all jobs a few > minutes ago. > > > This was basically caused by the high load that is currently being > put on > > > our CI system due to the pre-release efforts for this Friday. > > > > > > It's really unfortunate that we just had outages of three core > components > > > within the last two days - sorry about that!. To recap, we had the > > > following outages (which are unrelated to the parallel refactor of > the > > > Jenkins pipeline): > > > - (yesterday evening) The Jenkins master ran out of disk space and > thus > > > processed requests at reduced capacity > > > - (this morning) The Jenkins master got updated which broke our > > > autoscalings upscaling capabilities. > > > - (new, this evening) Jenkins API was irresponsive: Due to the high > > number > > > of jobs and a bad API design in the Jenkins REST API, the > time-complexity > > > of a simple create or delete request was quadratic which resulted > in all > > > requests timing out (that was the current outage). This resulted > in our > > > auto scaling to be unable to interface with the Jenkins master. > > > > > > I have now made improvements to our REST API calls which reduced > the > > > complexity from O(N^2) to O(1). The reason was an underlying > redirect > > loop > > > in the Jenkins createNode and deleteNode REST API in combination > with > > > unrolling the entire slave and job graph (which got quite huge > during > > > extensive load) upon every single request. Since we had about 150 > > > registered slaves and 1000 jobs in the queue, the duration for a > single > > > REST API call rose to up to 45 seconds (we execute up to a few > hundred > > > queries per auto scaling loop). This lead to our auto scaling > timing
Re: CI impaired
Appreciated for your effort and help to make CI a better place! Qing On 11/21/18, 4:38 PM, "Lin Yuan" wrote: Thanks for your efforts, Marco! On Wed, Nov 21, 2018 at 4:02 PM Anirudh Subramanian wrote: > Thanks for the quick response and mitigation! > > On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu > wrote: > > > Hello, > > > > today, CI had some issues and I had to cancel all jobs a few minutes ago. > > This was basically caused by the high load that is currently being put on > > our CI system due to the pre-release efforts for this Friday. > > > > It's really unfortunate that we just had outages of three core components > > within the last two days - sorry about that!. To recap, we had the > > following outages (which are unrelated to the parallel refactor of the > > Jenkins pipeline): > > - (yesterday evening) The Jenkins master ran out of disk space and thus > > processed requests at reduced capacity > > - (this morning) The Jenkins master got updated which broke our > > autoscalings upscaling capabilities. > > - (new, this evening) Jenkins API was irresponsive: Due to the high > number > > of jobs and a bad API design in the Jenkins REST API, the time-complexity > > of a simple create or delete request was quadratic which resulted in all > > requests timing out (that was the current outage). This resulted in our > > auto scaling to be unable to interface with the Jenkins master. > > > > I have now made improvements to our REST API calls which reduced the > > complexity from O(N^2) to O(1). The reason was an underlying redirect > loop > > in the Jenkins createNode and deleteNode REST API in combination with > > unrolling the entire slave and job graph (which got quite huge during > > extensive load) upon every single request. Since we had about 150 > > registered slaves and 1000 jobs in the queue, the duration for a single > > REST API call rose to up to 45 seconds (we execute up to a few hundred > > queries per auto scaling loop). This lead to our auto scaling timing out. > > > > Everything should be back to normal now. I'm closely observing the > > situation and I'll let you know if I encounter any additional issues. > > > > Again, sorry for any caused inconveniences. > > > > Best regards, > > Marco > > > > On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell > > wrote: > > > > > Yes, let me add to the kudos, very nice work Marco. > > > > > > > > > "I'm trying real hard to be the shepherd." -Jules Winnfield > > > > > > > > > > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen > > > wrote: > > > > > > > > Appreciate the big effort in bring the CI back so quickly. Thanks > > Marco. > > > > > > > > On Nov 21, 2018 5:52 AM, Marco de Abreu < > marco.g.ab...@googlemail.com > > .INVALID> > > > wrote: > > > > Thanks Aaron! Just for the record, the new Jenkins jobs were > unrelated > > to > > > > that incident. > > > > > > > > If somebody is interested in the details around the outage: > > > > > > > > Due to a required maintenance (disk running full), we had to upgrade > > our > > > > Jenkins master because it was running on Ubuntu 17.04 (for an unknown > > > > reason, it used to be 16.04) and we needed to install some packages. > > > Since > > > > the support for Ubuntu 17.04 was stopped, this resulted in all > package > > > > updates and installations to fail because the repositories were taken > > > > offline. Due to the unavailable maintenance package and other issues > > with > > > > the installed OpenJDK8 version, we made the decision to upgrade the > > > Jenkins > > > > master to Ubuntu 18.04 LTS in order to get back to a supported > version > > > with > > > > maintenance tools. During this upgrade, Jenkins was automatically > > updated > > > > by APT as part of the dist-upgrade process. > > > > > > > > In the latest version of Jenkins, some labels have been changed which > > we > > > > depend on for our auto scaling. To be more specific: > > > >> Waiting for next available executor on mxnetlinux-gpu > > > > has been changed to > > > >> Waiting for next available executor on ‘mxnetlinux-gpu’ > > > > Notice the quote characters. > > > > > > > > Jenkins does not offer a better way than to parse these messages > > > > unfortunately - there's no standardized way to express queue items. > > Since > > > > our parser expected the above message without quote signs, this > message > > > was > > > > discarded. > > > > > > > > We support various queue reasons (5 of them to be exact) that > indicate > > > > resource starvation. If we run super low on capacity,
Re: CI impaired
Hello, today, CI had some issues and I had to cancel all jobs a few minutes ago. This was basically caused by the high load that is currently being put on our CI system due to the pre-release efforts for this Friday. It's really unfortunate that we just had outages of three core components within the last two days - sorry about that!. To recap, we had the following outages (which are unrelated to the parallel refactor of the Jenkins pipeline): - (yesterday evening) The Jenkins master ran out of disk space and thus processed requests at reduced capacity - (this morning) The Jenkins master got updated which broke our autoscalings upscaling capabilities. - (new, this evening) Jenkins API was irresponsive: Due to the high number of jobs and a bad API design in the Jenkins REST API, the time-complexity of a simple create or delete request was quadratic which resulted in all requests timing out (that was the current outage). This resulted in our auto scaling to be unable to interface with the Jenkins master. I have now made improvements to our REST API calls which reduced the complexity from O(N^2) to O(1). The reason was an underlying redirect loop in the Jenkins createNode and deleteNode REST API in combination with unrolling the entire slave and job graph (which got quite huge during extensive load) upon every single request. Since we had about 150 registered slaves and 1000 jobs in the queue, the duration for a single REST API call rose to up to 45 seconds (we execute up to a few hundred queries per auto scaling loop). This lead to our auto scaling timing out. Everything should be back to normal now. I'm closely observing the situation and I'll let you know if I encounter any additional issues. Again, sorry for any caused inconveniences. Best regards, Marco On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell wrote: > Yes, let me add to the kudos, very nice work Marco. > > > "I'm trying real hard to be the shepherd." -Jules Winnfield > > > > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen > wrote: > > > > Appreciate the big effort in bring the CI back so quickly. Thanks Marco. > > > > On Nov 21, 2018 5:52 AM, Marco de Abreu > > > wrote: > > Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated to > > that incident. > > > > If somebody is interested in the details around the outage: > > > > Due to a required maintenance (disk running full), we had to upgrade our > > Jenkins master because it was running on Ubuntu 17.04 (for an unknown > > reason, it used to be 16.04) and we needed to install some packages. > Since > > the support for Ubuntu 17.04 was stopped, this resulted in all package > > updates and installations to fail because the repositories were taken > > offline. Due to the unavailable maintenance package and other issues with > > the installed OpenJDK8 version, we made the decision to upgrade the > Jenkins > > master to Ubuntu 18.04 LTS in order to get back to a supported version > with > > maintenance tools. During this upgrade, Jenkins was automatically updated > > by APT as part of the dist-upgrade process. > > > > In the latest version of Jenkins, some labels have been changed which we > > depend on for our auto scaling. To be more specific: > >> Waiting for next available executor on mxnetlinux-gpu > > has been changed to > >> Waiting for next available executor on ‘mxnetlinux-gpu’ > > Notice the quote characters. > > > > Jenkins does not offer a better way than to parse these messages > > unfortunately - there's no standardized way to express queue items. Since > > our parser expected the above message without quote signs, this message > was > > discarded. > > > > We support various queue reasons (5 of them to be exact) that indicate > > resource starvation. If we run super low on capacity, the queue reason is > > different and we would still be able to scale up, but most of the cases > > would have printed the unsupported message. This resulted in reduced > > capacity (to be specific, the limit during that time was 1 slave per > type). > > > > We have now fixed our autoscaling to automatically strip these characters > > and added that message to our test suite. > > > > Best regards, > > Marco > > > > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham > > > wrote: > > > >> Marco, thanks for your hard work on this. I'm super excited about the > new > >> Jenkins jobs. This is going to be very helpful and improve sanity for > our > >> PRs and ourselves! > >> > >> Cheers, > >> Aaron > >> > >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu > >> >> > >>> Hello, > >>> > >>> the CI is now back up and running. Auto scaling is working as expected > >> and > >>> it passed our load tests. > >>> > >>> Please excuse the caused inconveniences. > >>> > >>> Best regards, > >>> Marco > >>> > >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu < > >>> marco.g.ab...@googlemail.com> > >>> wrote: > >>> > Hello, > > I'd like to let you know that our CI was impaired and down for the >
Re: CI impaired
Yes, let me add to the kudos, very nice work Marco. "I'm trying real hard to be the shepherd." -Jules Winnfield > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen > wrote: > > Appreciate the big effort in bring the CI back so quickly. Thanks Marco. > > On Nov 21, 2018 5:52 AM, Marco de Abreu > wrote: > Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated to > that incident. > > If somebody is interested in the details around the outage: > > Due to a required maintenance (disk running full), we had to upgrade our > Jenkins master because it was running on Ubuntu 17.04 (for an unknown > reason, it used to be 16.04) and we needed to install some packages. Since > the support for Ubuntu 17.04 was stopped, this resulted in all package > updates and installations to fail because the repositories were taken > offline. Due to the unavailable maintenance package and other issues with > the installed OpenJDK8 version, we made the decision to upgrade the Jenkins > master to Ubuntu 18.04 LTS in order to get back to a supported version with > maintenance tools. During this upgrade, Jenkins was automatically updated > by APT as part of the dist-upgrade process. > > In the latest version of Jenkins, some labels have been changed which we > depend on for our auto scaling. To be more specific: >> Waiting for next available executor on mxnetlinux-gpu > has been changed to >> Waiting for next available executor on ‘mxnetlinux-gpu’ > Notice the quote characters. > > Jenkins does not offer a better way than to parse these messages > unfortunately - there's no standardized way to express queue items. Since > our parser expected the above message without quote signs, this message was > discarded. > > We support various queue reasons (5 of them to be exact) that indicate > resource starvation. If we run super low on capacity, the queue reason is > different and we would still be able to scale up, but most of the cases > would have printed the unsupported message. This resulted in reduced > capacity (to be specific, the limit during that time was 1 slave per type). > > We have now fixed our autoscaling to automatically strip these characters > and added that message to our test suite. > > Best regards, > Marco > > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham > wrote: > >> Marco, thanks for your hard work on this. I'm super excited about the new >> Jenkins jobs. This is going to be very helpful and improve sanity for our >> PRs and ourselves! >> >> Cheers, >> Aaron >> >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu >> > >>> Hello, >>> >>> the CI is now back up and running. Auto scaling is working as expected >> and >>> it passed our load tests. >>> >>> Please excuse the caused inconveniences. >>> >>> Best regards, >>> Marco >>> >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu < >>> marco.g.ab...@googlemail.com> >>> wrote: >>> Hello, I'd like to let you know that our CI was impaired and down for the last few hours. After getting the CI back up, I noticed that our auto >> scaling broke due to a silent update of Jenkins which broke our >>> upscale-detection. Manual scaling is currently not possible and stopping the scaling won't help either because there are currently no p3 instances available, >> which means that all jobs will fail none the less. In a few hours, the auto scaling will have recycled all slaves through the down-scale mechanism >>> and we will be out of capacity. This will lead to resource starvation and >>> thus timeouts. Your PRs will be properly registered by Jenkins, but please expect the jobs to time out and thus fail your PRs. I will fix the auto scaling as soon as I'm awake again. Sorry for the caused inconveniences. Best regards, Marco P.S. Sorry for the brief email and my lack of further fixes, but it's 5:30AM now and I've been working for 17 hours. >>> >>
Re: CI impaired
Appreciate the big effort in bring the CI back so quickly. Thanks Marco. On Nov 21, 2018 5:52 AM, Marco de Abreu wrote: Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated to that incident. If somebody is interested in the details around the outage: Due to a required maintenance (disk running full), we had to upgrade our Jenkins master because it was running on Ubuntu 17.04 (for an unknown reason, it used to be 16.04) and we needed to install some packages. Since the support for Ubuntu 17.04 was stopped, this resulted in all package updates and installations to fail because the repositories were taken offline. Due to the unavailable maintenance package and other issues with the installed OpenJDK8 version, we made the decision to upgrade the Jenkins master to Ubuntu 18.04 LTS in order to get back to a supported version with maintenance tools. During this upgrade, Jenkins was automatically updated by APT as part of the dist-upgrade process. In the latest version of Jenkins, some labels have been changed which we depend on for our auto scaling. To be more specific: > Waiting for next available executor on mxnetlinux-gpu has been changed to > Waiting for next available executor on ‘mxnetlinux-gpu’ Notice the quote characters. Jenkins does not offer a better way than to parse these messages unfortunately - there's no standardized way to express queue items. Since our parser expected the above message without quote signs, this message was discarded. We support various queue reasons (5 of them to be exact) that indicate resource starvation. If we run super low on capacity, the queue reason is different and we would still be able to scale up, but most of the cases would have printed the unsupported message. This resulted in reduced capacity (to be specific, the limit during that time was 1 slave per type). We have now fixed our autoscaling to automatically strip these characters and added that message to our test suite. Best regards, Marco On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham wrote: > Marco, thanks for your hard work on this. I'm super excited about the new > Jenkins jobs. This is going to be very helpful and improve sanity for our > PRs and ourselves! > > Cheers, > Aaron > > On Wed, Nov 21, 2018, 05:37 Marco de Abreu > > > Hello, > > > > the CI is now back up and running. Auto scaling is working as expected > and > > it passed our load tests. > > > > Please excuse the caused inconveniences. > > > > Best regards, > > Marco > > > > On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu < > > marco.g.ab...@googlemail.com> > > wrote: > > > > > Hello, > > > > > > I'd like to let you know that our CI was impaired and down for the last > > > few hours. After getting the CI back up, I noticed that our auto > scaling > > > broke due to a silent update of Jenkins which broke our > > upscale-detection. > > > Manual scaling is currently not possible and stopping the scaling won't > > > help either because there are currently no p3 instances available, > which > > > means that all jobs will fail none the less. In a few hours, the auto > > > scaling will have recycled all slaves through the down-scale mechanism > > and > > > we will be out of capacity. This will lead to resource starvation and > > thus > > > timeouts. > > > > > > Your PRs will be properly registered by Jenkins, but please expect the > > > jobs to time out and thus fail your PRs. > > > > > > I will fix the auto scaling as soon as I'm awake again. > > > > > > Sorry for the caused inconveniences. > > > > > > Best regards, > > > Marco > > > > > > > > > P.S. Sorry for the brief email and my lack of further fixes, but it's > > > 5:30AM now and I've been working for 17 hours. > > > > > >
Re: CI impaired
Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated to that incident. If somebody is interested in the details around the outage: Due to a required maintenance (disk running full), we had to upgrade our Jenkins master because it was running on Ubuntu 17.04 (for an unknown reason, it used to be 16.04) and we needed to install some packages. Since the support for Ubuntu 17.04 was stopped, this resulted in all package updates and installations to fail because the repositories were taken offline. Due to the unavailable maintenance package and other issues with the installed OpenJDK8 version, we made the decision to upgrade the Jenkins master to Ubuntu 18.04 LTS in order to get back to a supported version with maintenance tools. During this upgrade, Jenkins was automatically updated by APT as part of the dist-upgrade process. In the latest version of Jenkins, some labels have been changed which we depend on for our auto scaling. To be more specific: > Waiting for next available executor on mxnetlinux-gpu has been changed to > Waiting for next available executor on ‘mxnetlinux-gpu’ Notice the quote characters. Jenkins does not offer a better way than to parse these messages unfortunately - there's no standardized way to express queue items. Since our parser expected the above message without quote signs, this message was discarded. We support various queue reasons (5 of them to be exact) that indicate resource starvation. If we run super low on capacity, the queue reason is different and we would still be able to scale up, but most of the cases would have printed the unsupported message. This resulted in reduced capacity (to be specific, the limit during that time was 1 slave per type). We have now fixed our autoscaling to automatically strip these characters and added that message to our test suite. Best regards, Marco On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham wrote: > Marco, thanks for your hard work on this. I'm super excited about the new > Jenkins jobs. This is going to be very helpful and improve sanity for our > PRs and ourselves! > > Cheers, > Aaron > > On Wed, Nov 21, 2018, 05:37 Marco de Abreu > > > Hello, > > > > the CI is now back up and running. Auto scaling is working as expected > and > > it passed our load tests. > > > > Please excuse the caused inconveniences. > > > > Best regards, > > Marco > > > > On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu < > > marco.g.ab...@googlemail.com> > > wrote: > > > > > Hello, > > > > > > I'd like to let you know that our CI was impaired and down for the last > > > few hours. After getting the CI back up, I noticed that our auto > scaling > > > broke due to a silent update of Jenkins which broke our > > upscale-detection. > > > Manual scaling is currently not possible and stopping the scaling won't > > > help either because there are currently no p3 instances available, > which > > > means that all jobs will fail none the less. In a few hours, the auto > > > scaling will have recycled all slaves through the down-scale mechanism > > and > > > we will be out of capacity. This will lead to resource starvation and > > thus > > > timeouts. > > > > > > Your PRs will be properly registered by Jenkins, but please expect the > > > jobs to time out and thus fail your PRs. > > > > > > I will fix the auto scaling as soon as I'm awake again. > > > > > > Sorry for the caused inconveniences. > > > > > > Best regards, > > > Marco > > > > > > > > > P.S. Sorry for the brief email and my lack of further fixes, but it's > > > 5:30AM now and I've been working for 17 hours. > > > > > >
Re: CI impaired
Marco, thanks for your hard work on this. I'm super excited about the new Jenkins jobs. This is going to be very helpful and improve sanity for our PRs and ourselves! Cheers, Aaron On Wed, Nov 21, 2018, 05:37 Marco de Abreu Hello, > > the CI is now back up and running. Auto scaling is working as expected and > it passed our load tests. > > Please excuse the caused inconveniences. > > Best regards, > Marco > > On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu < > marco.g.ab...@googlemail.com> > wrote: > > > Hello, > > > > I'd like to let you know that our CI was impaired and down for the last > > few hours. After getting the CI back up, I noticed that our auto scaling > > broke due to a silent update of Jenkins which broke our > upscale-detection. > > Manual scaling is currently not possible and stopping the scaling won't > > help either because there are currently no p3 instances available, which > > means that all jobs will fail none the less. In a few hours, the auto > > scaling will have recycled all slaves through the down-scale mechanism > and > > we will be out of capacity. This will lead to resource starvation and > thus > > timeouts. > > > > Your PRs will be properly registered by Jenkins, but please expect the > > jobs to time out and thus fail your PRs. > > > > I will fix the auto scaling as soon as I'm awake again. > > > > Sorry for the caused inconveniences. > > > > Best regards, > > Marco > > > > > > P.S. Sorry for the brief email and my lack of further fixes, but it's > > 5:30AM now and I've been working for 17 hours. > > >
Re: CI impaired
Hello, the CI is now back up and running. Auto scaling is working as expected and it passed our load tests. Please excuse the caused inconveniences. Best regards, Marco On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu wrote: > Hello, > > I'd like to let you know that our CI was impaired and down for the last > few hours. After getting the CI back up, I noticed that our auto scaling > broke due to a silent update of Jenkins which broke our upscale-detection. > Manual scaling is currently not possible and stopping the scaling won't > help either because there are currently no p3 instances available, which > means that all jobs will fail none the less. In a few hours, the auto > scaling will have recycled all slaves through the down-scale mechanism and > we will be out of capacity. This will lead to resource starvation and thus > timeouts. > > Your PRs will be properly registered by Jenkins, but please expect the > jobs to time out and thus fail your PRs. > > I will fix the auto scaling as soon as I'm awake again. > > Sorry for the caused inconveniences. > > Best regards, > Marco > > > P.S. Sorry for the brief email and my lack of further fixes, but it's > 5:30AM now and I've been working for 17 hours. >
CI impaired
Hello, I'd like to let you know that our CI was impaired and down for the last few hours. After getting the CI back up, I noticed that our auto scaling broke due to a silent update of Jenkins which broke our upscale-detection. Manual scaling is currently not possible and stopping the scaling won't help either because there are currently no p3 instances available, which means that all jobs will fail none the less. In a few hours, the auto scaling will have recycled all slaves through the down-scale mechanism and we will be out of capacity. This will lead to resource starvation and thus timeouts. Your PRs will be properly registered by Jenkins, but please expect the jobs to time out and thus fail your PRs. I will fix the auto scaling as soon as I'm awake again. Sorry for the caused inconveniences. Best regards, Marco P.S. Sorry for the brief email and my lack of further fixes, but it's 5:30AM now and I've been working for 17 hours.