Recently, we observed a considerable amount of forced completion acks on 
our systems. While forced completion acks make sense in some scenarios, 
they cause trouble in our scenario - please see below for details. As a 
band-aid for our scenario, we want to make the forced completion ack 
timeout more configurable.

The completion ack timeout is the timeout within a completion ack must be 
received for an activation. It is calculated based on the action time 
limit. The current formula is: 
(actionTimeLimit.max(TimeLimit.STD_DURATION) * lbConfig.timeoutFactor) + 
1.minute  (for implementation details please follow the link under [1])

The default timeout factor is 2 which bases on invoker behavior that a 
cold invocation's init duration may be as long as its run duration. Based 
on this formula the calculated completion ack for an action with a timout 
limit of 60 seconds is be 180 seconds.

The motivation behind the completion ack timeout and discarding 
activations from the system that do not complete within that time is to 
not wait "forever" for activations that get lost. This could happen if 
activations were already read and committed from the kafka topic by the 
message feed but their processing is still in flight while at the same 
time the invoker is restarted for whatever reason.

While restarting invokers will rather remain the exception we often have 
the case that image pulls for cold black box invocations take a long time 
and exceed the calculated completion ack timeout for these invocation in 
our environment. By discarding activations that are still being processed 
by an invoker the controllers bookkeeping is invalidated step by step 
because the controller assumes that for each of the discarded invocations 
one invoker slot get freed up while it is not. As consequence the 
controller will make false decisions and what is even worse its 
bookkeeping that is out of sync won't repair by itself but remain in this 
state as long as the workload remains high. Activations have to wait for 
its processing on the chosen invoker as no free slots are available and 
hence will potentially exceed their completion ack timeout and in the end 
being discarded by the controller.

To make a long story short we would like to have the possibility to have 
the constant duration of 1 minute configurable.By increasing the duration 
to an appropriate number and by this the calculated completion ack timeout 
we think we can avoid the forced completion of activations in our system 
for many of the situations we observed in the past.

Please let me know what you think.


[1] 
https://github.com/apache/openwhisk/blob/81ac503f7efc8ee99ea1a37ef9ec3d6163d96c85/core/controller/src/main/scala/org/apache/openwhisk/core/loadBalancer/CommonLoadBalancer.scala#L86-L104


Mit freundlichen Gruessen / Kind regards
Steffen Rost
------------------------------------------------------------------------------------------------------------------------------------------
IBM Cloud Functions Development
Phone +49-7031-16-4841 (Fax: -3545)
E-Mail: sr...@de.ibm.com
------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Matthias Hartmann -- Geschäftsführung: 
Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, 
HRB 243294

Reply via email to