[
https://issues.apache.org/jira/browse/STORM-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327642#comment-14327642
]
Nathan Marz commented on STORM-677:
-----------------------------------
Option 2 doesn't have to be long term as it should be easy to implement. I do
not view the options as looking very similar as I think Option 2 will be
significantly more robust – getting out of a weird state as fast as possible is
really important.
"If that itself can cause other workers to give up on a connection it could
result in the topology never reaching a stable state." –> This is exactly why
the amount of time attempting to make a connection must be related to the start
timeout for a worker.
> Maximum retries strategy may cause data loss
> --------------------------------------------
>
> Key: STORM-677
> URL: https://issues.apache.org/jira/browse/STORM-677
> Project: Apache Storm
> Issue Type: Bug
> Affects Versions: 0.9.3, 0.10.0
> Reporter: Michael Noll
> Priority: Minor
> Labels: Netty
>
> h3. Background
> Storm currently supports the configuration setting
> storm.messaging.netty.max_retries. This setting is supposed to limit the
> number of reconnection attempts a Netty client will perform in case of a
> connection loss.
> Unfortunately users have run into situations where this behavior will result
> in data loss:
> {quote}
> https://github.com/apache/storm/pull/429/files#r24681006
> This could be a separate JIRA, but we ran into a situation where we hit the
> maximum number of reconnection attempts, and the exception was eaten because
> it was thrown from a background thread and it just killed the background
> thread. This code appears to do the same thing.
> {quote}
> The problem can be summarized by the following example: Once a Netty client
> hits the maximum number of connection retries, it will stop trying to
> reconnect (as intended) but will also continue to run forever without being
> able to send any messages to its designated remote targets. At this point
> data will be lost because any messages that the Netty client is supposed to
> send will be dropped (by design). And since the Netty client is still alive
> and thus considered "functional", Storm is not able to do something about
> this data loss situation.
> For a more detailed description please take a look at the discussion in
> https://github.com/apache/storm/pull/429/files#r24742354.
> h3. Possible solutions
> (Most of this section is copy-pasted from an [earlier discussion on this
> problem|https://github.com/apache/storm/pull/429/files#r24742354].)
> There are at least three approaches we may consider:
> # Let the Netty client die if max retries is reached, so that the Storm task
> has the chance to re-create a client and thus break out of the client's
> discard-messages-forever state.
> # Let the "parent" Storm task die if (one of its possibly many) Netty clients
> dies, so that by restarting the task we'll also get a new Netty client.
> # Remove the max retries semantics as well as the corresponding setting from
> Storm's configuration. Here, a Netty client will continue to reconnect to a
> remote destination forever. The possible negative impact of these reconnects
> (e.g. number of TCP connection attempts in a cluster) are kept in check by
> our exponential backoff policy for such connection retries.
> My personal opinion on these three approaches:
> - I do not like (1) because I feel it introduces potentially confusing
> semantics: We keep having a max retries setting, but it is not really a hard
> limit anymore. It rather becomes a "max retries until we recreate a Netty
> client", and would also reset any exponential backoff strategy of the
> "previous" Netty client instance (cf. StormBoundedExponentialBackoffRetry).
> If we do want such resets (but I don't think we do at this point), then a
> cleaner approach would be to implement such resetting inside the retry policy
> (again, cf. StormBoundedExponentialBackoffRetry).
> - I do not like (2) because a single "bad" Netty client would be able to take
> down a Storm task, which among other things would also impact any other,
> working Netty clients of the Storm task.
> - Option (3) seems a reasonable approach, although it breaks backwards
> compatibility with regard to Storm's configuration (because we'd now ignore
> storm.messaging.netty.max_retries).
> Here's initial feedback from other developers:
> {quote}
> https://github.com/apache/storm/pull/429/files#r24824540
> revans2: I personally prefer option 3, no maximum number of reconnection
> attempts. Having the client decide that it is done, before nimbus does feels
> like it is asking for trouble.
> {quote}
> {quote}
> https://github.com/ptgoetz
> ptgoetz: I'm in favor of option 3 as well. I'm not that concerned about
> storm.messaging.netty.max_retries being ignored. We could probably just log a
> warning that that configuration option is deprecated and will be ignored if
> the value is set.
> {quote}
> {quote}
> https://github.com/apache/storm/pull/429#issuecomment-74914806
> nathanmarz: Nimbus only knows a worker is having trouble when it stops
> sending heartbeats. If a worker gets into a bad state, the worst thing to do
> is have it continue trying to limp along in that bad state. It should instead
> suicide as quickly as possible. It seems counterintuitive, but this
> aggressive suiciding behavior actually makes things more robust as it
> prevents processes from getting into weird, potentially undefined states.
> This has been a crucial design principle in Storm from the beginning. One
> consequence of it is that any crucial system thread that receives an
> unrecoverable exception must suicide the process rather than die quietly.
> For the connection retry problem, it's a tricky situation since it may not be
> able to connect because the other worker is still getting set up. So the
> retry policy should be somehow related to the launch timeouts for worker
> processes specified in the configuration. Not being able to connect after the
> launch timeout + a certain number of attempts + a buffer period would
> certainly qualify as a weird state, so the process should suicide in that
> case. Suiciding and restarting gets the worker back to a known state.
> So in this case, I am heavily in favor of Option 2. I don't care about
> killing the other tasks in the worker because this is a rare situation. It is
> infinitely more important to get the worker back to a known, robust state
> than risk leaving it in a weird state permanently.
> {quote}
> If we decide to go with option 3, then the essence of the fix is the
> following modification of Client.java:
> {code}
> private boolean reconnectingAllowed() {
> // BEFORE:
> // return !closing && connectionAttempts.get() <=
> (maxReconnectionAttempts + 1);
> return !closing;
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)