[
https://issues.apache.org/jira/browse/STORM-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326283#comment-14326283
]
ASF GitHub Bot commented on STORM-329:
--------------------------------------
Github user nathanmarz commented on the pull request:
https://github.com/apache/storm/pull/429#issuecomment-74914806
Nimbus only knows a worker is having trouble when it stops sending
heartbeats. If a worker gets into a bad state, the worst thing to do is have it
continue trying to limp along in that bad state. It should instead suicide as
quickly as possible. It seems counterintuitive, but this aggressive suiciding
behavior actually makes things more robust as it prevents processes from
getting into weird, potentially undefined states. This has been a crucial
design principle in Storm from the beginning. One consequence of it is that any
crucial system thread that receives an unrecoverable exception must suicide the
process rather than die quietly.
For the connection retry problem, it's a tricky situation since it may not
be able to connect because the other worker is still getting set up. So the
retry policy should be somehow related to the launch timeouts for worker
processes specified in the configuration. Not being able to connect after the
launch timeout + a certain number of attempts + a buffer period would certainly
qualify as a weird state, so the process should suicide in that case.
*Suiciding and restarting gets the worker back to a known state*.
So in this case, I am heavily in favor of Option 2. I don't care about
killing the other tasks in the worker because this is a rare situation. It is
infinitely more important to get the worker back to a known, robust state than
risk leaving it in a weird state permanently.
I would like to see these issues addressed as part of this patch.
@miguno Thanks for the explanation on this patch's relation to backpressure
– we'll handle that in a future patch.
> Fix cascading Storm failure by improving reconnection strategy and buffering
> messages
> -------------------------------------------------------------------------------------
>
> Key: STORM-329
> URL: https://issues.apache.org/jira/browse/STORM-329
> Project: Apache Storm
> Issue Type: Improvement
> Affects Versions: 0.9.2-incubating, 0.9.3
> Reporter: Sean Zhong
> Assignee: Michael Noll
> Labels: Netty
> Fix For: 0.10.0
>
> Attachments: storm-329.patch, worker-kill-recover3.jpg
>
>
> _Note: The original title of this ticket was: "Add Option to Config Message
> handling strategy when connection timeout"._
> This is to address a [concern brought
> up|https://github.com/apache/incubator-storm/pull/103#issuecomment-43632986]
> during the work at STORM-297:
> {quote}
> [~revans2] wrote: Your logic makes since to me on why these calls are
> blocking. My biggest concern around the blocking is in the case of a worker
> crashing. If a single worker crashes this can block the entire topology from
> executing until that worker comes back up. In some cases I can see that being
> something that you would want. In other cases I can see speed being the
> primary concern and some users would like to get partial data fast, rather
> then accurate data later.
> Could we make it configurable on a follow up JIRA where we can have a max
> limit to the buffering that is allowed, before we block, or throw data away
> (which is what zeromq does)?
> {quote}
> If some worker crash suddenly, how to handle the message which was supposed
> to be delivered to the worker?
> 1. Should we buffer all message infinitely?
> 2. Should we block the message sending until the connection is resumed?
> 3. Should we config a buffer limit, try to buffer the message first, if the
> limit is met, then block?
> 4. Should we neither block, nor buffer too much, but choose to drop the
> messages, and use the built-in storm failover mechanism?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)