Thanks a lot! The following numbers are based on our experience in the test environment. Best case: ~1:50h (unchanged) (0:01 + 0:38 + 0:39 + 0:33 + 0:03) - conditions: No instances have to be provisioned and caches are primed Average case: 2:10h (1:50h + 0:10 for instance startup + 0:10 for cache loading) - conditions: Windows instances are available (they get turned off less frequently), Ubuntu instances have to be provisioned and cache no present Worst case: 3:06h (1:50h + 0:02 + 0:50 + 0:20 + 0:02 + 0:02) - conditions: no available instances
The bottleneck for the worst case is caused by the Windows instances. They take about 20 minutes to start and the unprimed MSVC cache results in about 30 minutes increased build times. To balance this out, we're driving a less aggressive downscaling policy for Windows and use increased buffers. At the same time, we're currently working on persistent build caches. An additional option could be reserved instances. We will observe the situation and then assess the required next steps. For now, we want to make sure everything is running stable and no builds are getting interrupted. Best regards, Marco On Wed, May 16, 2018 at 3:47 AM, Thomas DELTEIL <[email protected]> wrote: > Great news :) thanks Marco! > > On Tue, May 15, 2018, 18:29 Steffen Rochel <[email protected]> > wrote: > > > Thanks Marco, good step forward. > > What is the expected, typical and worst case TAT time for PR checks? > > > > Steffen > > > > On Tue, May 15, 2018 at 10:49 AM Marco de Abreu < > > [email protected]> wrote: > > > > > Hello, > > > > > > I'd like to announce the deployment of auto scaling for our CI system > > (see > > > [1] for reference, setup documentation at [2]) for today at 11:00PM PST > > > 05/15/18. I expect no downtime since these changes are happening > outside > > of > > > Jenkins. > > > > > > This system will increase the flexibility of our system to be able to > > > handle the increasing load, being a result of the growing number of > great > > > contributions! In future, our CI will automatically adapt to the > current > > > load and will thus support big tasks like the to-be-migrated nightly > > tests > > > or increased load before a release. Additionally, we're now able to > > provide > > > scalable p3.2xlarge instances and have the possibility to add new > > instance > > > types without much effort. > > > > > > In future, you will see that new slaves are being started up as the > queue > > > grows and stopped if they go into idle. Your tasks will automatically > be > > > picked up and our system makes sure every PR gets processes as fast as > > > possible. > > > > > > If you encounter any issues in the next week, please don't hesitate to > > > reach out to me. I'm looking forward to everyones feedback! > > > > > > Best regards, > > > Marco > > > > > > > > > [1]: > > > > > https://cwiki.apache.org/confluence/display/MXNET/ > Proposal%3A+Auto+Scaling > > > [2]: https://cwiki.apache.org/confluence/display/MXNET/Setup > > > > > >
