Hello Dominic, thanks for your detailed response.
I guess your understanding is right - just this small correction: > So the main issue here is there are too many "Rescheduling Run" messages in invokers? It's not the main issue to see these log entries in the invoker. This is just the indication that something is going wrong in the invoker - more activations are waiting to be processed than the ContainerPool can currently serve. Actually, there are different reasons why "Rescheduling Run message" log entries can show up in the invoker: 1. Controllers send too many activations to an invoker. 2. In the invoker, the container pool sends a Run message to a container proxy but the container proxy fails to process it properly and hands it back to the container pool. Examples: a Run message arrives while the proxy is already removing the container; if concurrency>1, the proxy buffers Run messages and returns them in failure situations. Although I'm not 100% sure, I see more indications for reason 1 in our logs than for reason 2. Regarding hypothesis "#controllers * getInvokerSlot(invoker user memory size) > invoker user memory size": I can rule out this hypothesis in our environments. We have "#controllers * getInvokerSlot(invoker user memory size) = invoker user memory size". I provided PR [1] to be sure about that. Regarding hypothesis "invoker simply pulls too many Run messages from MessageFeed". I think the part you described is perfectly right. The questions remains why controllers send too many Run messages or a Run message with an activation that is larger than free memory capacity currently available in the pool. The load balancer has a memory book-keeping for all of its invoker shards (memory size determined by getInvokerSlot()) so the load balancer is supposed to only schedule an activation to an invoker if the required memory does not exceed controller's shard of the invoker. Even if resulting Run messages arrive on the invoker in a changed order, the invoker's shard free memory should be sufficient. Do you see a considerable number of "Rescheduling Run message" log entries in your environments? [1] https://github.com/apache/incubator-openwhisk/pull/4520 Mit freundlichen Grüßen / Regards, Sven Lange-Last Senior Software Engineer IBM Cloud Functions Apache OpenWhisk E-mail: sven.lange-l...@de.ibm.com Find me on: Schoenaicher Str. 220 Boeblingen, 71032 Germany IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294