Hi OpenWhiskers,

Today, we have an arbitrary system-wide limit of maximum concurrent connections 
in the system. In general that is fine, but it doesn't have a direct 
correlation to what's actually happening in the system.

I propose to a new state to each monitored invoker: Overloaded. An invoker will 
go into overloaded state if active-acks are starting to timeout. Eventually, if 
the system is really overloaded, all Invokers will be in overloaded state which 
will cause the loadbalancer to return a failure. This failure now results in a 
`503 - System overloaded` message back to the user. The system-wide concurrency 
limit would be removed.

The organic system-limit will be adjustable by a timeout factor, which is made 
adjustable https://github.com/apache/incubator-openwhisk/pull/3767. The default 
is 2 * maximumActionRuntime + 1 minute. For the vast majority of use-cases, 
this means that there are 3x more activations in the system than it can handle 
or put differently: activations need to wait for minutes until they are 
executed. I think it's safe to say that the system is overloaded if this is 
true for all invokers in your system.

Note: We used to handle active-ack timeouts as system errors and take invokers 
into unhealthy state. While having the old non-consistent loadbalancer, that 
caused a lot of "flappy" states in the invokers. With the new consistent 
implementation, active-ack timeouts should only occur in problematic situations 
(either the invoker itself is having problems, or queueing). Taking the invoker 
out of the loadbalancer if there are active-acks missing on that invoker is 
generally helpful, because missing active-acks also means inconsistent state in 
the loadbalancer (it updates its state as if the active-ack arrived correctly).

A first stab at the implementation can be found here: 
https://github.com/apache/incubator-openwhisk/pull/3875.

Any concerns with that approach to place an upper bound on the system?

Cheers,
Markus

Reply via email to