[
https://issues.apache.org/jira/browse/AURORA-350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Lambert updated AURORA-350:
---------------------------------
Sprint: Q2 Sprint 1, Q2 Sprint 2 (was: Q2 Sprint 1)
> Parallelize updates to speed up deploys
> ---------------------------------------
>
> Key: AURORA-350
> URL: https://issues.apache.org/jira/browse/AURORA-350
> Project: Aurora
> Issue Type: Story
> Components: Client
> Reporter: Maxim Khutornenko
> Assignee: Maxim Khutornenko
>
> The way aurora deploy works inherently contributes to depressed deploy speeds.
> Aurora deploy, like cap/TCU, uses the "batch" model. You have 100 things, you
> loop in a batch of N at a time. You restart N things all at once, those N
> things come back online all at once (cold), you wait for the all of them to
> become available, and repeat.
> Disadvantages:
> - you can proceed no faster than the slowest guy in the batch. If one
> instance is "stuck" or slow, the whole deploy slows down.
> - The speed at which your deploy is bounded by your success rate, which is
> bounded by the number of instances currently online but serving below par due
> to warmup (because, computers). The batch methodology maximizes this effect
> because the restarted shards tend to come back online all at the same time.
> Let's say a full cycle of shutdown, reschedule, restart,
> wait-for-online-and-good takes 2 minutes, but the "bad time" is only 15
> seconds. If we do these 8 at a time, we have a period where 8 boxes are bad
> for 15 seconds. That's a big success rate spike. What if we were able to 8 of
> these in parallel such that only one of them is bad at any given moment. It's
> the same speed (all other things being equal) but the impact is much less. We
> could leverage that to make the deploy go even faster.
> It's easy to see that we could speed deploys up by 2x or more by using an
> algorithm which minimizes the number of instances starting at any given time
> but still proceeds quickly in parallel.
> Aurora should be rewritten to use a thread-based deploy model. You have 100
> things and N threads. The main thread dispatches (in a blocking fashion if no
> threads are ready) restart tasks to each thread in a user-set rate-limited
> fashion (e.g. no more than one per 15 seconds) which is defined by your per
> instance warmup time (the time an instance is listening/serving but slow).
> Each thread then restarts one instance, waits it to come back healthy, and
> reports done/failure/etc. Continue until the list is exhausted.
> This way you have a steady stream of single instances coming online with no
> clumping of restarts, and if any one gets hung up or slow, it doesn't
> significantly impact the speed of the deploy (you can "overprovision" the
> number of threads). You can also retain most of the current deploy semantics
> around failure counts, retry intervals, etc.
--
This message was sent by Atlassian JIRA
(v6.2#6252)