Re: System overflow based on invoker status

Rodric Rabbah Tue, 17 Jul 2018 10:06:02 -0700

Hi Markus

Per our discussion on slack, I’m documenting below the concerns we discussed. 
(And thanks for fixing my math bug.)

The approach of being more introspective to detect overload is a good 
improvement over the ad hoc value set today. Thanks for bringing this up. This 
is a general improvement although I do have a concern about tying the system 
overload (and queuing depth) to active acks which also affects other 
components. Please allow me to explain these so that we can see if there's a 
real concern. Thanks for reviewing this on slack and discussion around this.

** Namely, the execution of sequences (and conductor actions) which wait for 
activations to process the next action --- if you're willing to tolerate a 
longer active ack, should the composition wait just as long? Second, we also 
had issues in the past where improper accounting of the requests outstanding 
for a user due to delayed or missing active acks would penalize and throttle a 
subject. Lastly, there is a backup mechanism for detecting completed 
activations from the active store, higher active acks means longer polls and 
load on the database.

** The need for this mechanism suggests the health protocol which uses pings 
alone is  not sufficient and needs this secondary mechanism. The active acks as 
noted above now have a few intertwined dependences.

** Since the definition of overloaded here is tied to active acks timing out, I 
also think we would be changing the behavior of the system overall where 
requests that would be accepted and queued in the past would be rejected much 
more eagerly. This makes the issue also related to re-architecting the system 
with the overflow queue as previously discussed on the dev list because there 
are requests or which _waiting_ is ok (e.g., batch and triggers) vs blocking 
requests (web actions) where waiting too long is not acceptable.

** Of course this is related to the capacity in the system and assumes static 
capacity. Shameless plug for The Serverless Contract 
https://medium.com/openwhisk/the-serverless-contract-44329fab10fb. If you 
detect overload and add capacity, it's a different discussion (not rejecting 
requests subject to a max elasticity vs rejecting requests for a given 
capacity).

Say an active ack for an activation _i_ times out if after time 
    T(i)  = L(i) x C + epsilon
where L(i) is the action's max duration for activation i, and C is the constant 
fudge factor (which is indirectly the wait time in the queue for this 
activation).

Let an invoker have N slots, all of which are occupied with max duration L(j) 
for all _j_ in the container pool >= L(i) that is all the slots are busy in the 
assigned pool and the hold time will be at least L(i) for all the slots.

Since an active ack's time out T(i) is oblivious to the the requests ahead of 
it in the queue, it would take C x S requests ahead of activation i in the 
queue for the request to timeout. I think wlog we can ignore the epsilon (C x S 
+ 1) for example would cover it, and we can ignore the actual execution time of 
activation i).

The system would be overloaded when there are (K x S) + (K x (S x C + 1)). 

where K is the number of invokers,
and S is the number of slots per invoker,
and C is the queuing factor for requests in the queue ( >= 0)
where all actions have an expected hold time that is the same

So some numbers: 
K = 1 invokers x S = 16 slots per invokers, and C = 2: system will overload 
(and reject requests) after 49 activations are accepted. 
K = 10, 490, and 
K = 100 then 4900.

If we increase C to tolerate more queuing, then we indirectly also affect the 
execution of compositions and quotas. I think we should as you suggest have a 
mechanism for detecting overload correctly so this is a better approach given 
where we are.

We should caution that if a deployment has a disproportionately high overload 
setting in their configuration they will need to be aware of this change. 

-r

> On Thu, Jul 12, 2018 at 11:38 AM, Markus Thoemmes 
> <[email protected]> wrote:
> Hi OpenWhiskers,
> 
> Today, we have an arbitrary system-wide limit of maximum concurrent 
> connections in the system. In general that is fine, but it doesn't have a 
> direct correlation to what's actually happening in the system.
> 
> I propose to a new state to each monitored invoker: Overloaded. An invoker 
> will go into overloaded state if active-acks are starting to timeout. 
> Eventually, if the system is really overloaded, all Invokers will be in 
> overloaded state which will cause the loadbalancer to return a failure. This 
> failure now results in a `503 - System overloaded` message back to the user. 
> The system-wide concurrency limit would be removed.
> 
> The organic system-limit will be adjustable by a timeout factor, which is 
> made adjustable https://github.com/apache/incubator-openwhisk/pull/3767. The 
> default is 2 * maximumActionRuntime + 1 minute. For the vast majority of 
> use-cases, this means that there are 3x more activations in the system than 
> it can handle or put differently: activations need to wait for minutes until 
> they are executed. I think it's safe to say that the system is overloaded if 
> this is true for all invokers in your system.
> 
> Note: We used to handle active-ack timeouts as system errors and take 
> invokers into unhealthy state. While having the old non-consistent 
> loadbalancer, that caused a lot of "flappy" states in the invokers. With the 
> new consistent implementation, active-ack timeouts should only occur in 
> problematic situations (either the invoker itself is having problems, or 
> queueing). Taking the invoker out of the loadbalancer if there are 
> active-acks missing on that invoker is generally helpful, because missing 
> active-acks also means inconsistent state in the loadbalancer (it updates its 
> state as if the active-ack arrived correctly).
> 
> A first stab at the implementation can be found here: 
> https://github.com/apache/incubator-openwhisk/pull/3875.
> 
> Any concerns with that approach to place an upper bound on the system?
> 
> Cheers,
> Markus

Re: System overflow based on invoker status

Reply via email to