Hello Dominic,

thanks for your detailed response.

I guess your understanding is right - just this small correction:

> So the main issue here is there are too many "Rescheduling Run" messages 
in invokers?

It's not the main issue to see these log entries in the invoker. This is 
just the indication that something is going wrong in the invoker - more 
activations are waiting to be processed than the ContainerPool can 
currently serve.

Actually, there are different reasons why "Rescheduling Run message" log 
entries can show up in the invoker:

1. Controllers send too many activations to an invoker.

2. In the invoker, the container pool sends a Run message to a container 
proxy but the container proxy fails to process it properly and hands it 
back to the container pool. Examples: a Run message arrives while the 
proxy is already removing the container; if concurrency>1, the proxy 
buffers Run messages and returns them in failure situations.

Although I'm not 100% sure, I see more indications for reason 1 in our 
logs than for reason 2.

Regarding hypothesis "#controllers * getInvokerSlot(invoker user memory 
size) > invoker user memory size": I can rule out this hypothesis in our 
environments. We have "#controllers * getInvokerSlot(invoker user memory 
size) = invoker user memory size". I provided PR [1] to be sure about 
that.

Regarding hypothesis "invoker simply pulls too many Run messages from 
MessageFeed". I think the part you described is perfectly right. The 
questions remains why controllers send too many Run messages or a Run 
message with an activation that is larger than free memory capacity 
currently available in the pool.

The load balancer has a memory book-keeping for all of its invoker shards 
(memory size determined by getInvokerSlot()) so the load balancer is 
supposed to only schedule an activation to an invoker if the required 
memory does not exceed controller's shard of the invoker. Even if 
resulting Run messages arrive on the invoker in a changed order, the 
invoker's shard free memory should be sufficient.

Do you see a considerable number of "Rescheduling Run message" log entries 
in your environments?

[1] https://github.com/apache/incubator-openwhisk/pull/4520


Mit freundlichen Grüßen / Regards,

Sven Lange-Last
Senior Software Engineer
IBM Cloud Functions
Apache OpenWhisk


E-mail: sven.lange-l...@de.ibm.com
Find me on:  


Schoenaicher Str. 220
Boeblingen, 71032
Germany




IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, 
HRB 243294


Reply via email to