Maxim Khutornenko created AURORA-350:
----------------------------------------

             Summary: Parallelize updates to speed up deploys
                 Key: AURORA-350
                 URL: https://issues.apache.org/jira/browse/AURORA-350
             Project: Aurora
          Issue Type: Story
          Components: Client
            Reporter: Maxim Khutornenko
            Assignee: Maxim Khutornenko


The way aurora deploy works inherently contributes to depressed deploy speeds.

Aurora deploy, like cap/TCU, uses the "batch" model. You have 100 things, you 
loop in a batch of N at a time. You restart N things all at once, those N 
things come back online all at once (cold), you wait for the all of them to 
become available, and repeat.

Disadvantages:
- you can proceed no faster than the slowest guy in the batch. If one instance 
is "stuck" or slow, the whole deploy slows down.
- The speed at which your deploy is bounded by your success rate, which is 
bounded by the number of instances currently online but serving below par due 
to warmup (because, computers). The batch methodology maximizes this effect 
because the restarted shards tend to come back online all at the same time.

Let's say a full cycle of shutdown, reschedule, restart, 
wait-for-online-and-good takes 2 minutes, but the "bad time" is only 15 
seconds. If we do these 8 at a time, we have a period where 8 boxes are bad for 
15 seconds. That's a big success rate spike. What if we were able to 8 of these 
in parallel such that only one of them is bad at any given moment. It's the 
same speed (all other things being equal) but the impact is much less. We could 
leverage that to make the deploy go even faster.

It's easy to see that we could speed deploys up by 2x or more by using an 
algorithm which minimizes the number of instances starting at any given time 
but still proceeds quickly in parallel.

Aurora should be rewritten to use a thread-based deploy model. You have 100 
things and N threads. The main thread dispatches (in a blocking fashion if no 
threads are ready) restart tasks to each thread in a user-set rate-limited 
fashion (e.g. no more than one per 15 seconds) which is defined by your per 
instance warmup time (the time an instance is listening/serving but slow). Each 
thread then restarts one instance, waits it to come back healthy, and reports 
done/failure/etc. Continue until the list is exhausted.

This way you have a steady stream of single instances coming online with no 
clumping of restarts, and if any one gets hung up or slow, it doesn't 
significantly impact the speed of the deploy (you can "overprovision" the 
number of threads). You can also retain most of the current deploy semantics 
around failure counts, retry intervals, etc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to