Hi folks,

we (Christian Bickel and I) just opened a pull-request for comments on 
introducing a notion of sharding instead of sharing the state between our 
controllers (loadbalancers). It also addresses a few deficiencies of the old 
loadbalancer to remove any kinds of bottlenecks there and make it as fast as 
possible.

Commit message for posterity:

The current ContainerPoolBalancer suffers a couple of problems and bottlenecks:

1. Inconsistent state: The data-structures keeping the state for that 
loadbalancer are not thread-safely handled, meaning there can be queuing to 
some invokers even though there is free capacity on other invokers.
2. Asynchronously shared state: Sharing the state is needed for a 
high-available deployment of multiple controllers and for horizontal scale in 
those. Said state-sharing makes point 1 even worse and isn't anywhere fast 
enough to be able to efficiently schedule quick bursts.
3. Bottlenecks: Getting the state from the outside (like for the 
ActivationThrottle) is a very costly operation (at least in the shared state 
case) and actually bottlenecks the whole invocation path. Getting the current 
state of the invokers is a second bottleneck, where one request is made to the 
corresponding actor for each invocation.
This new implementation aims to solve the problems mentioned above as follows:

1. All state is local: There is no shared state. Resources are managed through 
horizontal sharding. Horizontal sharding means: The invokers' slots are evenly 
divided between the loadbalancers in existence. If we deploy 2 loadbalancers 
and each invoker has 16 slots, each of the loadbalancers will have access to 8 
slots on each invoker.
2. Slots are given away atomically: When scheduling an activation, the slot is 
immediately assigned to that activation (implemented through Semaphores). That 
means: Even in concurrent schedules, there will not be an overload on an 
invoker as long as there is capacity left on that invoker.
3. Asynchronous updates of slow data: Slowly changing data, like a change in 
the invoker's state, is asynchronously handled and updated to a local version 
of the state. Querying the state is as cheap as it can be.

A few words on the implementation details:

We chose to use horizontal sharding (evenly dividing the capacity of each 
invoker) vs. vertical sharding (evenly dividing the invokers as a whole) for 
the sake of staging these changes mainly. Once we divide vertically, we'll need 
a good loadbalancing strategy in front of our controllers themselves, to keep 
unnecessary cold-starts to a minimum and maximize container reuse. By dividing 
horizontally, we maintain the same reuse policies as today and can even keep 
the same loadbalancing strategies intact. Horizontal sharding of course only 
scales so far (maybe to 4 controllers, assuming 16 slots on each invoker) but 
it will give us time to figure out good strategies for vertical sharding and 
learn along the way. For vertical sharding to work, it will also be crucial to 
have the centralized overflow queue to be able to offload work between shards 
through workstealing. All in all: Vertical sharding is a much bigger change 
than horizontal sharding.

We tried to implement everything in a single actor first, but that seemed to 
impose a bottleneck again. Note that this is very frequented code, it needs to 
be very tight. That might not match the actor model too well.

Subsuming everything: This keeps all proposed changes intact (most notably 
Tyson's parallel executions and overloading queue).

A note on the gains made by this: Our non-blocking invoke performance is now 
quite close to the raw Kafka produce performance that we have in the system 
(not that it's good in itself, but that's the next theoretical bottleneck). 
Before the changes, this was roughly bottlenecked to 4k requests per second on 
a sufficiently powerful machine. Blocking invoke performance was roughly 
doubled.

Any comments, thoughts?

Cheers,
Markus

Reply via email to