Re: OpenWhisk invoker overloads - "Rescheduling Run message"

Tyson Norris Thu, 29 Aug 2019 16:30:53 -0700

Hi Sven (and everyone!) - 
At long last I finally worked through some tests for separating out this 
feature (that changes how rescheduled Run messages are handled) into an 
isolated PR. 
Please see https://github.com/apache/openwhisk/pull/4593


I plan to do some more load testing on this, but so far no problems.

Thanks
Tyson

On 7/19/19, 7:49 AM, "Sven Lange-Last" <sven.lange-l...@de.ibm.com> wrote:

    Hello Tyson,
    
    regarding your feedback:
    
    > Related to the "Rescheduling Run message", one problem we have 
    > encountered in these cases is that the invoker becomes unstable due 
    > ( I think) to a tight message loop, since the message that couldn't 
    > run is immediately resent to the pool to be run, which fails again, 
    > etc. We saw CPU getting pegged, and invoker eventually would crash.
    > I have a PR related to cluster managed resources where, among other 
    > things, this message looping is removed:
    > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl&amp;data=02%7C01%7Ctnorris%40adobe.com%7Cd94246810f0a49ab04bd08d70c58436e%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636991445537142011&amp;sdata=fMJ4pQlV0B51ILNJOb0FCqRPWbN%2Bh8oAJn22behCML8%3D&amp;reserved=0?
    > 
    
u=https-3A__github.com_apache_incubator-2Dopenwhisk_pull_4326_files-23diff-2D726b36b3ab8c7cff0b93dead84311839L198&d=DwIGaQ&c=jf_iaSHvJObTbx-
    > 
    
siA1ZOg&r=Q324lzlz3X6vUQUlgmuIdvLXO6nnIRzq6I6LyOBKHBs&m=yqwkeUxYxei_G_X3fWA0cYYm47ekuejeO6sRUKUwUos&s=KEJSKEJwE-
    > zaTlnh8fovCFY4vY_uWmAQsgDsTkfueRI&e= 
    > 
    > Instead of resending the message to the pool immediately, it just 
    > waits in the runbuffer, and the runbuffer is processed in reaction 
    > to any potential change in resources: NeedWork, ContainerRemoved, 
    > etc. This may add delay to any buffered message(s), but seems to 
    > avoid the catastrophic crash in our systems. 
    
    From my point of view, your proposal on changing processing of rescheduled 
    Run messages makes sense. The PR you referenced above contains a lot of 
    other changes. It does not only improve this particular area but also 
    includes a lot of other changes - in particular, it adds a different way 
    of managing containers. Due to the PR's size and complexity, it's very 
    hard to understand and review... Would you be able to split this PR up 
    into smaller changes?
    
    
    Mit freundlichen Grüßen / Regards,
    
    Sven Lange-Last
    Senior Software Engineer
    IBM Cloud Functions
    Apache OpenWhisk
    
    
    E-mail: sven.lange-l...@de.ibm.com
    Find me on:  
    
    
    Schoenaicher Str. 220
    Boeblingen, 71032
    Germany
    
    
    
    
    IBM Deutschland Research & Development GmbH
    Vorsitzende des Aufsichtsrats: Martina Koederitz
    Geschäftsführung: Dirk Wittkopp
    Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, 
    HRB 243294
    
    
    Tyson Norris <tnor...@adobe.com.INVALID> wrote on 2019/07/08 17:52:14:
    
    > From: Tyson Norris <tnor...@adobe.com.INVALID>
    > To: "dev@openwhisk.apache.org" <dev@openwhisk.apache.org>
    > Date: 2019/07/08 18:01
    > Subject: [EXTERNAL] Re:  Re: OpenWhisk invoker overloads - 
    > "Rescheduling Run message"
    > 
    > Related to the "Rescheduling Run message", one problem we have 
    > encountered in these cases is that the invoker becomes unstable due 
    > ( I think) to a tight message loop, since the message that couldn't 
    > run is immediately resent to the pool to be run, which fails again, 
    > etc. We saw CPU getting pegged, and invoker eventually would crash.
    > I have a PR related to cluster managed resources where, among other 
    > things, this message looping is removed:
    > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl&amp;data=02%7C01%7Ctnorris%40adobe.com%7Cd94246810f0a49ab04bd08d70c58436e%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636991445537142011&amp;sdata=fMJ4pQlV0B51ILNJOb0FCqRPWbN%2Bh8oAJn22behCML8%3D&amp;reserved=0?
    > 
    
u=https-3A__github.com_apache_incubator-2Dopenwhisk_pull_4326_files-23diff-2D726b36b3ab8c7cff0b93dead84311839L198&d=DwIGaQ&c=jf_iaSHvJObTbx-
    > 
    
siA1ZOg&r=Q324lzlz3X6vUQUlgmuIdvLXO6nnIRzq6I6LyOBKHBs&m=yqwkeUxYxei_G_X3fWA0cYYm47ekuejeO6sRUKUwUos&s=KEJSKEJwE-
    > zaTlnh8fovCFY4vY_uWmAQsgDsTkfueRI&e= 
    > 
    > Instead of resending the message to the pool immediately, it just 
    > waits in the runbuffer, and the runbuffer is processed in reaction 
    > to any potential change in resources: NeedWork, ContainerRemoved, 
    > etc. This may add delay to any buffered message(s), but seems to 
    > avoid the catastrophic crash in our systems. 
    > 
    > Thanks
    > Tyson
    > 
    > On 7/5/19, 1:16 AM, "Sven Lange-Last" <sven.lange-l...@de.ibm.com> 
    wrote:
    > 
    >     Hello Dominic,
    > 
    >     thanks for your detailed response.
    > 
    >     I guess your understanding is right - just this small correction:
    > 
    >     > So the main issue here is there are too many "Rescheduling 
    > Run" messages 
    >     in invokers?
    > 
    >     It's not the main issue to see these log entries in the invoker. 
    This is 
    >     just the indication that something is going wrong in the invoker - 
    more 
    >     activations are waiting to be processed than the ContainerPool can 
    >     currently serve.
    > 
    >     Actually, there are different reasons why "Rescheduling Run message" 
    log 
    >     entries can show up in the invoker:
    > 
    >     1. Controllers send too many activations to an invoker.
    > 
    >     2. In the invoker, the container pool sends a Run message to a 
    container 
    >     proxy but the container proxy fails to process it properly and hands 
    it 
    >     back to the container pool. Examples: a Run message arrives while 
    the 
    >     proxy is already removing the container; if concurrency>1, the proxy 
    
    >     buffers Run messages and returns them in failure situations.
    > 
    >     Although I'm not 100% sure, I see more indications for reason 1 in 
    our 
    >     logs than for reason 2.
    > 
    >     Regarding hypothesis "#controllers * getInvokerSlot(invoker user 
    memory 
    >     size) > invoker user memory size": I can rule out this hypothesis in 
    our 
    >     environments. We have "#controllers * getInvokerSlot(invoker user 
    memory 
    >     size) = invoker user memory size". I provided PR [1] to be sure 
    about 
    >     that.
    > 
    >     Regarding hypothesis "invoker simply pulls too many Run messages 
    from 
    >     MessageFeed". I think the part you described is perfectly right. The 
    
    >     questions remains why controllers send too many Run messages or a 
    Run 
    >     message with an activation that is larger than free memory capacity 
    >     currently available in the pool.
    > 
    >     The load balancer has a memory book-keeping for all of its invoker 
    shards 
    >     (memory size determined by getInvokerSlot()) so the load balancer is 
    
    >     supposed to only schedule an activation to an invoker if the 
    required 
    >     memory does not exceed controller's shard of the invoker. Even if 
    >     resulting Run messages arrive on the invoker in a changed order, the 
    
    >     invoker's shard free memory should be sufficient.
    > 
    >     Do you see a considerable number of "Rescheduling Run message" 
    > log entries 
    >     in your environments?
    > 
    >     [1] 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl&amp;data=02%7C01%7Ctnorris%40adobe.com%7Cd94246810f0a49ab04bd08d70c58436e%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636991445537152001&amp;sdata=AgO9628fQob9mb%2FA%2B2ELlTKgF7gLqBQrlbGBg0Yz7wk%3D&amp;reserved=0?
    > 
    
u=https-3A__nam04.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fgithub.com-252Fapache-252Fincubator-2Dopenwhisk-252Fpull-252F4520-26amp-3Bdata-3D02-257C01-257Ctnorris-2540adobe.com-257Ca7b761bd61e943c82fd308d701211f37-257Cfa7b1b5a7b34438794aed2c178decee1-257C0-257C1-257C636979114118405554-26amp-3Bsdata-3DtRnHZ-252FN2bXgR4fWSIhvdrzCAvNmPX-252FW-252BY4BwwmVFKl0-253D-26amp-3Breserved-3D0&d=DwIGaQ&c=jf_iaSHvJObTbx-
    > 
    
siA1ZOg&r=Q324lzlz3X6vUQUlgmuIdvLXO6nnIRzq6I6LyOBKHBs&m=yqwkeUxYxei_G_X3fWA0cYYm47ekuejeO6sRUKUwUos&s=0I0tqwtW56yO7l6zPWNNuSLlZJNYGFsQNsoq56ArSQY&e=
    > 
    > 
    >     Mit freundlichen Grüßen / Regards,
    > 
    >     Sven Lange-Last
    >     Senior Software Engineer
    >     IBM Cloud Functions
    >     Apache OpenWhisk
    > 
    > 
    >     E-mail: sven.lange-l...@de.ibm.com
    >     Find me on: 
    > 
    > 
    >     Schoenaicher Str. 220
    >     Boeblingen, 71032
    >     Germany
    > 
    > 
    > 
    > 
    >     IBM Deutschland Research & Development GmbH
    >     Vorsitzende des Aufsichtsrats: Martina Koederitz
    >     Geschäftsführung: Dirk Wittkopp
    >     Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht 
    > Stuttgart, 
    >     HRB 243294
    > 
    > 
    > 
    > 
    >

Re: OpenWhisk invoker overloads - "Rescheduling Run message"

Reply via email to