[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

nathanmarz Wed, 18 Feb 2015 10:09:48 -0800

Github user nathanmarz commented on the pull request:

    https://github.com/apache/storm/pull/429#issuecomment-74914806
  
    Nimbus only knows a worker is having trouble when it stops sending 
heartbeats. If a worker gets into a bad state, the worst thing to do is have it 
continue trying to limp along in that bad state. It should instead suicide as 
quickly as possible. It seems counterintuitive, but this aggressive suiciding 
behavior actually makes things more robust as it prevents processes from 
getting into weird, potentially undefined states. This has been a crucial 
design principle in Storm from the beginning. One consequence of it is that any 
crucial system thread that receives an unrecoverable exception must suicide the 
process rather than die quietly. 
    
    For the connection retry problem, it's a tricky situation since it may not 
be able to connect because the other worker is still getting set up. So the 
retry policy should be somehow related to the launch timeouts for worker 
processes specified in the configuration. Not being able to connect after the 
launch timeout + a certain number of attempts + a buffer period would certainly 
qualify as a weird state, so the process should suicide in that case. 
*Suiciding and restarting gets the worker back to a known state*. 
    
    So in this case, I am heavily in favor of Option 2. I don't care about 
killing the other tasks in the worker because this is a rare situation. It is 
infinitely more important to get the worker back to a known, robust state than 
risk leaving it in a weird state permanently.
    
    I would like to see these issues addressed as part of this patch.
    
    @miguno Thanks for the explanation on this patch's relation to backpressure 
â we'll handle that in a future patch.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

Reply via email to