One related question on this is: how is throttling handled? 

Looking at the code it looks like each controller instance will maintain its 
own stats (totalActivations, activationsPerNamespace) independently. 
Am I missing it?

Thanks
Tyson

> On Feb 1, 2018, at 7:17 AM, Stephen Fink <[email protected]> wrote:
> 
> Perfect.  Thanks.
> 
> SJF
> 
>> On Feb 1, 2018, at 10:12 AM, Markus Thoemmes <[email protected]> 
>> wrote:
>> 
>> Hi Steve,
>> 
>> fair point and that's exactly what we're trying to exploit here: While the 
>> loadbalancer only sees 8 slots for example on that invoker, the invoker's 
>> logic remains unchanged. It always sees the full-picture of its state and 
>> doesn't care where the load is coming from. So OpenWhisk would (given the 
>> load is slow enough) only provision one container for this action. Both 
>> loadbalancers would operate using the same hashes and and stepSizes and thus 
>> choose the very same invoker (given no other load in the system).
>> 
>> Cheers.                       
>> 
>> -----Stephen Fink <[email protected]> wrote: -----
>> To: [email protected]
>> From: Stephen Fink <[email protected]>
>> Date: 02/01/2018 03:57PM
>> Subject: Re: Introducing sharding as an alternative for state sharing
>> 
>> Hi Markus,
>> 
>> Suppose I have a client which fires one action repeatedly in a loop.   So 
>> the server handles one invocation of this action at a time.
>> 
>> Under the current system,  OW will provision one container for this action, 
>> and it will get reused for every invocation.
>> 
>> Under horizontal sharding with factor N,  it seems OW will provision N 
>> containers — which is sub-optimal for several reasons (cold starts, memory 
>> footprint, consumption of stem cells).
>> 
>> Do you agree?
>> 
>> If so — I wonder if there’s a hybrid strategy — suppose the load balancer 
>> operates via horizontal sharding, but the invoke-side logic remains 
>> unchanged.   Then an invoker can continue to use all slots available to 
>> maximize container reuse, while the load balancer avoids shared state.
>> 
>> Make sense?
>> 
>> SJF
>> 
>> 
>>> On Feb 1, 2018, at 9:25 AM, Markus Thoemmes <[email protected]> 
>>> wrote:
>>> 
>>> Hi folks,
>>> 
>>> we (Christian Bickel and I) just opened a pull-request for comments on 
>>> introducing a notion of sharding instead of sharing the state between our 
>>> controllers (loadbalancers). It also addresses a few deficiencies of the 
>>> old loadbalancer to remove any kinds of bottlenecks there and make it as 
>>> fast as possible.
>>> 
>>> Commit message for posterity:
>>> 
>>> The current ContainerPoolBalancer suffers a couple of problems and 
>>> bottlenecks:
>>> 
>>> 1. Inconsistent state: The data-structures keeping the state for that 
>>> loadbalancer are not thread-safely handled, meaning there can be queuing to 
>>> some invokers even though there is free capacity on other invokers.
>>> 2. Asynchronously shared state: Sharing the state is needed for a 
>>> high-available deployment of multiple controllers and for horizontal scale 
>>> in those. Said state-sharing makes point 1 even worse and isn't anywhere 
>>> fast enough to be able to efficiently schedule quick bursts.
>>> 3. Bottlenecks: Getting the state from the outside (like for the 
>>> ActivationThrottle) is a very costly operation (at least in the shared 
>>> state case) and actually bottlenecks the whole invocation path. Getting the 
>>> current state of the invokers is a second bottleneck, where one request is 
>>> made to the corresponding actor for each invocation.
>>> This new implementation aims to solve the problems mentioned above as 
>>> follows:
>>> 
>>> 1. All state is local: There is no shared state. Resources are managed 
>>> through horizontal sharding. Horizontal sharding means: The invokers' slots 
>>> are evenly divided between the loadbalancers in existence. If we deploy 2 
>>> loadbalancers and each invoker has 16 slots, each of the loadbalancers will 
>>> have access to 8 slots on each invoker.
>>> 2. Slots are given away atomically: When scheduling an activation, the slot 
>>> is immediately assigned to that activation (implemented through 
>>> Semaphores). That means: Even in concurrent schedules, there will not be an 
>>> overload on an invoker as long as there is capacity left on that invoker.
>>> 3. Asynchronous updates of slow data: Slowly changing data, like a change 
>>> in the invoker's state, is asynchronously handled and updated to a local 
>>> version of the state. Querying the state is as cheap as it can be.
>>> 
>>> A few words on the implementation details:
>>> 
>>> We chose to use horizontal sharding (evenly dividing the capacity of each 
>>> invoker) vs. vertical sharding (evenly dividing the invokers as a whole) 
>>> for the sake of staging these changes mainly. Once we divide vertically, 
>>> we'll need a good loadbalancing strategy in front of our controllers 
>>> themselves, to keep unnecessary cold-starts to a minimum and maximize 
>>> container reuse. By dividing horizontally, we maintain the same reuse 
>>> policies as today and can even keep the same loadbalancing strategies 
>>> intact. Horizontal sharding of course only scales so far (maybe to 4 
>>> controllers, assuming 16 slots on each invoker) but it will give us time to 
>>> figure out good strategies for vertical sharding and learn along the way. 
>>> For vertical sharding to work, it will also be crucial to have the 
>>> centralized overflow queue to be able to offload work between shards 
>>> through workstealing. All in all: Vertical sharding is a much bigger change 
>>> than horizontal sharding.
>>> 
>>> We tried to implement everything in a single actor first, but that seemed 
>>> to impose a bottleneck again. Note that this is very frequented code, it 
>>> needs to be very tight. That might not match the actor model too well.
>>> 
>>> Subsuming everything: This keeps all proposed changes intact (most notably 
>>> Tyson's parallel executions and overloading queue).
>>> 
>>> A note on the gains made by this: Our non-blocking invoke performance is 
>>> now quite close to the raw Kafka produce performance that we have in the 
>>> system (not that it's good in itself, but that's the next theoretical 
>>> bottleneck). Before the changes, this was roughly bottlenecked to 4k 
>>> requests per second on a sufficiently powerful machine. Blocking invoke 
>>> performance was roughly doubled.
>>> 
>>> Any comments, thoughts?
>>> 
>>> Cheers,
>>> Markus
>>> 
>> 
>> 
> 

Reply via email to