Recently, we observed a considerable amount of forced completion acks on our systems. While forced completion acks make sense in some scenarios, they cause trouble in our scenario - please see below for details. As a band-aid for our scenario, we want to make the forced completion ack timeout more configurable.
The completion ack timeout is the timeout within a completion ack must be received for an activation. It is calculated based on the action time limit. The current formula is: (actionTimeLimit.max(TimeLimit.STD_DURATION) * lbConfig.timeoutFactor) + 1.minute (for implementation details please follow the link under [1]) The default timeout factor is 2 which bases on invoker behavior that a cold invocation's init duration may be as long as its run duration. Based on this formula the calculated completion ack for an action with a timout limit of 60 seconds is be 180 seconds. The motivation behind the completion ack timeout and discarding activations from the system that do not complete within that time is to not wait "forever" for activations that get lost. This could happen if activations were already read and committed from the kafka topic by the message feed but their processing is still in flight while at the same time the invoker is restarted for whatever reason. While restarting invokers will rather remain the exception we often have the case that image pulls for cold black box invocations take a long time and exceed the calculated completion ack timeout for these invocation in our environment. By discarding activations that are still being processed by an invoker the controllers bookkeeping is invalidated step by step because the controller assumes that for each of the discarded invocations one invoker slot get freed up while it is not. As consequence the controller will make false decisions and what is even worse its bookkeeping that is out of sync won't repair by itself but remain in this state as long as the workload remains high. Activations have to wait for its processing on the chosen invoker as no free slots are available and hence will potentially exceed their completion ack timeout and in the end being discarded by the controller. To make a long story short we would like to have the possibility to have the constant duration of 1 minute configurable.By increasing the duration to an appropriate number and by this the calculated completion ack timeout we think we can avoid the forced completion of activations in our system for many of the situations we observed in the past. Please let me know what you think. [1] https://github.com/apache/openwhisk/blob/81ac503f7efc8ee99ea1a37ef9ec3d6163d96c85/core/controller/src/main/scala/org/apache/openwhisk/core/loadBalancer/CommonLoadBalancer.scala#L86-L104 Mit freundlichen Gruessen / Kind regards Steffen Rost ------------------------------------------------------------------------------------------------------------------------------------------ IBM Cloud Functions Development Phone +49-7031-16-4841 (Fax: -3545) E-Mail: sr...@de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------ IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Matthias Hartmann -- Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294