Maxim Khutornenko created AURORA-350:
----------------------------------------
Summary: Parallelize updates to speed up deploys
Key: AURORA-350
URL: https://issues.apache.org/jira/browse/AURORA-350
Project: Aurora
Issue Type: Story
Components: Client
Reporter: Maxim Khutornenko
Assignee: Maxim Khutornenko
The way aurora deploy works inherently contributes to depressed deploy speeds.
Aurora deploy, like cap/TCU, uses the "batch" model. You have 100 things, you
loop in a batch of N at a time. You restart N things all at once, those N
things come back online all at once (cold), you wait for the all of them to
become available, and repeat.
Disadvantages:
- you can proceed no faster than the slowest guy in the batch. If one instance
is "stuck" or slow, the whole deploy slows down.
- The speed at which your deploy is bounded by your success rate, which is
bounded by the number of instances currently online but serving below par due
to warmup (because, computers). The batch methodology maximizes this effect
because the restarted shards tend to come back online all at the same time.
Let's say a full cycle of shutdown, reschedule, restart,
wait-for-online-and-good takes 2 minutes, but the "bad time" is only 15
seconds. If we do these 8 at a time, we have a period where 8 boxes are bad for
15 seconds. That's a big success rate spike. What if we were able to 8 of these
in parallel such that only one of them is bad at any given moment. It's the
same speed (all other things being equal) but the impact is much less. We could
leverage that to make the deploy go even faster.
It's easy to see that we could speed deploys up by 2x or more by using an
algorithm which minimizes the number of instances starting at any given time
but still proceeds quickly in parallel.
Aurora should be rewritten to use a thread-based deploy model. You have 100
things and N threads. The main thread dispatches (in a blocking fashion if no
threads are ready) restart tasks to each thread in a user-set rate-limited
fashion (e.g. no more than one per 15 seconds) which is defined by your per
instance warmup time (the time an instance is listening/serving but slow). Each
thread then restarts one instance, waits it to come back healthy, and reports
done/failure/etc. Continue until the list is exhausted.
This way you have a steady stream of single instances coming online with no
clumping of restarts, and if any one gets hung up or slow, it doesn't
significantly impact the speed of the deploy (you can "overprovision" the
number of threads). You can also retain most of the current deploy semantics
around failure counts, retry intervals, etc.
--
This message was sent by Atlassian JIRA
(v6.2#6252)