[ 
https://issues.apache.org/jira/browse/AURORA-350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Lambert updated AURORA-350:
---------------------------------

    Sprint: Sprint 1

> Parallelize updates to speed up deploys
> ---------------------------------------
>
>                 Key: AURORA-350
>                 URL: https://issues.apache.org/jira/browse/AURORA-350
>             Project: Aurora
>          Issue Type: Story
>          Components: Client
>            Reporter: Maxim Khutornenko
>            Assignee: Maxim Khutornenko
>
> The way aurora deploy works inherently contributes to depressed deploy speeds.
> Aurora deploy, like cap/TCU, uses the "batch" model. You have 100 things, you 
> loop in a batch of N at a time. You restart N things all at once, those N 
> things come back online all at once (cold), you wait for the all of them to 
> become available, and repeat.
> Disadvantages:
> - you can proceed no faster than the slowest guy in the batch. If one 
> instance is "stuck" or slow, the whole deploy slows down.
> - The speed at which your deploy is bounded by your success rate, which is 
> bounded by the number of instances currently online but serving below par due 
> to warmup (because, computers). The batch methodology maximizes this effect 
> because the restarted shards tend to come back online all at the same time.
> Let's say a full cycle of shutdown, reschedule, restart, 
> wait-for-online-and-good takes 2 minutes, but the "bad time" is only 15 
> seconds. If we do these 8 at a time, we have a period where 8 boxes are bad 
> for 15 seconds. That's a big success rate spike. What if we were able to 8 of 
> these in parallel such that only one of them is bad at any given moment. It's 
> the same speed (all other things being equal) but the impact is much less. We 
> could leverage that to make the deploy go even faster.
> It's easy to see that we could speed deploys up by 2x or more by using an 
> algorithm which minimizes the number of instances starting at any given time 
> but still proceeds quickly in parallel.
> Aurora should be rewritten to use a thread-based deploy model. You have 100 
> things and N threads. The main thread dispatches (in a blocking fashion if no 
> threads are ready) restart tasks to each thread in a user-set rate-limited 
> fashion (e.g. no more than one per 15 seconds) which is defined by your per 
> instance warmup time (the time an instance is listening/serving but slow). 
> Each thread then restarts one instance, waits it to come back healthy, and 
> reports done/failure/etc. Continue until the list is exhausted.
> This way you have a steady stream of single instances coming online with no 
> clumping of restarts, and if any one gets hung up or slow, it doesn't 
> significantly impact the speed of the deploy (you can "overprovision" the 
> number of threads). You can also retain most of the current deploy semantics 
> around failure counts, retry intervals, etc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to