Related to the "Rescheduling Run message", one problem we have encountered in 
these cases is that the invoker becomes unstable due ( I think) to a tight 
message loop, since the message that couldn't run is immediately resent to the 
pool to be run, which fails again, etc. We saw CPU getting pegged, and invoker 
eventually would crash.
I have a PR related to cluster managed resources where, among other things, 
this message looping is removed:
https://github.com/apache/incubator-openwhisk/pull/4326/files#diff-726b36b3ab8c7cff0b93dead84311839L198

Instead of resending the message to the pool immediately, it just waits in the 
runbuffer, and the runbuffer is processed in reaction to any potential change 
in resources: NeedWork, ContainerRemoved, etc. This may add delay to any 
buffered message(s), but seems to avoid the catastrophic crash in our systems. 

Thanks
Tyson

On 7/5/19, 1:16 AM, "Sven Lange-Last" <sven.lange-l...@de.ibm.com> wrote:

    Hello Dominic,
    
    thanks for your detailed response.
    
    I guess your understanding is right - just this small correction:
    
    > So the main issue here is there are too many "Rescheduling Run" messages 
    in invokers?
    
    It's not the main issue to see these log entries in the invoker. This is 
    just the indication that something is going wrong in the invoker - more 
    activations are waiting to be processed than the ContainerPool can 
    currently serve.
    
    Actually, there are different reasons why "Rescheduling Run message" log 
    entries can show up in the invoker:
    
    1. Controllers send too many activations to an invoker.
    
    2. In the invoker, the container pool sends a Run message to a container 
    proxy but the container proxy fails to process it properly and hands it 
    back to the container pool. Examples: a Run message arrives while the 
    proxy is already removing the container; if concurrency>1, the proxy 
    buffers Run messages and returns them in failure situations.
    
    Although I'm not 100% sure, I see more indications for reason 1 in our 
    logs than for reason 2.
    
    Regarding hypothesis "#controllers * getInvokerSlot(invoker user memory 
    size) > invoker user memory size": I can rule out this hypothesis in our 
    environments. We have "#controllers * getInvokerSlot(invoker user memory 
    size) = invoker user memory size". I provided PR [1] to be sure about 
    that.
    
    Regarding hypothesis "invoker simply pulls too many Run messages from 
    MessageFeed". I think the part you described is perfectly right. The 
    questions remains why controllers send too many Run messages or a Run 
    message with an activation that is larger than free memory capacity 
    currently available in the pool.
    
    The load balancer has a memory book-keeping for all of its invoker shards 
    (memory size determined by getInvokerSlot()) so the load balancer is 
    supposed to only schedule an activation to an invoker if the required 
    memory does not exceed controller's shard of the invoker. Even if 
    resulting Run messages arrive on the invoker in a changed order, the 
    invoker's shard free memory should be sufficient.
    
    Do you see a considerable number of "Rescheduling Run message" log entries 
    in your environments?
    
    [1] 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-openwhisk%2Fpull%2F4520&amp;data=02%7C01%7Ctnorris%40adobe.com%7Ca7b761bd61e943c82fd308d701211f37%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C1%7C636979114118405554&amp;sdata=tRnHZ%2FN2bXgR4fWSIhvdrzCAvNmPX%2FW%2BY4BwwmVFKl0%3D&amp;reserved=0
    
    
    Mit freundlichen Grüßen / Regards,
    
    Sven Lange-Last
    Senior Software Engineer
    IBM Cloud Functions
    Apache OpenWhisk
    
    
    E-mail: sven.lange-l...@de.ibm.com
    Find me on:  
    
    
    Schoenaicher Str. 220
    Boeblingen, 71032
    Germany
    
    
    
    
    IBM Deutschland Research & Development GmbH
    Vorsitzende des Aufsichtsrats: Martina Koederitz
    Geschäftsführung: Dirk Wittkopp
    Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, 
    HRB 243294
    
    
    

Reply via email to