Github user nathanmarz commented on the pull request:
https://github.com/apache/storm/pull/429#issuecomment-74914806
Nimbus only knows a worker is having trouble when it stops sending
heartbeats. If a worker gets into a bad state, the worst thing to do is have it
continue trying to limp along in that bad state. It should instead suicide as
quickly as possible. It seems counterintuitive, but this aggressive suiciding
behavior actually makes things more robust as it prevents processes from
getting into weird, potentially undefined states. This has been a crucial
design principle in Storm from the beginning. One consequence of it is that any
crucial system thread that receives an unrecoverable exception must suicide the
process rather than die quietly.
For the connection retry problem, it's a tricky situation since it may not
be able to connect because the other worker is still getting set up. So the
retry policy should be somehow related to the launch timeouts for worker
processes specified in the configuration. Not being able to connect after the
launch timeout + a certain number of attempts + a buffer period would certainly
qualify as a weird state, so the process should suicide in that case.
*Suiciding and restarting gets the worker back to a known state*.
So in this case, I am heavily in favor of Option 2. I don't care about
killing the other tasks in the worker because this is a rare situation. It is
infinitely more important to get the worker back to a known, robust state than
risk leaving it in a weird state permanently.
I would like to see these issues addressed as part of this patch.
@miguno Thanks for the explanation on this patch's relation to backpressure
â we'll handle that in a future patch.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---