[ 
https://issues.apache.org/jira/browse/STORM-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327642#comment-14327642
 ] 

Nathan Marz commented on STORM-677:
-----------------------------------

Option 2 doesn't have to be long term as it should be easy to implement. I do 
not view the options as looking very similar as I think Option 2 will be 
significantly more robust – getting out of a weird state as fast as possible is 
really important.

"If that itself can cause other workers to give up on a connection it could 
result in the topology never reaching a stable state." –> This is exactly why 
the amount of time attempting to make a connection must be related to the start 
timeout for a worker. 

> Maximum retries strategy may cause data loss
> --------------------------------------------
>
>                 Key: STORM-677
>                 URL: https://issues.apache.org/jira/browse/STORM-677
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.3, 0.10.0
>            Reporter: Michael Noll
>            Priority: Minor
>              Labels: Netty
>
> h3. Background
> Storm currently supports the configuration setting 
> storm.messaging.netty.max_retries.  This setting is supposed to limit the 
> number of reconnection attempts a Netty client will perform in case of a 
> connection loss.
> Unfortunately users have run into situations where this behavior will result 
> in data loss:
> {quote}
> https://github.com/apache/storm/pull/429/files#r24681006
> This could be a separate JIRA, but we ran into a situation where we hit the 
> maximum number of reconnection attempts, and the exception was eaten because 
> it was thrown from a background thread and it just killed the background 
> thread. This code appears to do the same thing.
> {quote}
> The problem can be summarized by the following example:  Once a Netty client 
> hits the maximum number of connection retries, it will stop trying to 
> reconnect (as intended) but will also continue to run forever without being 
> able to send any messages to its designated remote targets.  At this point 
> data will be lost because any messages that the Netty client is supposed to 
> send will be dropped (by design).  And since the Netty client is still alive 
> and thus considered "functional", Storm is not able to do something about 
> this data loss situation.
> For a more detailed description please take a look at the discussion in 
> https://github.com/apache/storm/pull/429/files#r24742354.
> h3. Possible solutions
> (Most of this section is copy-pasted from an [earlier discussion on this 
> problem|https://github.com/apache/storm/pull/429/files#r24742354].)
> There are at least three approaches we may consider:
> # Let the Netty client die if max retries is reached, so that the Storm task 
> has the chance to re-create a client and thus break out of the client's 
> discard-messages-forever state.
> # Let the "parent" Storm task die if (one of its possibly many) Netty clients 
> dies, so that by restarting the task we'll also get a new Netty client.
> # Remove the max retries semantics as well as the corresponding setting from 
> Storm's configuration. Here, a Netty client will continue to reconnect to a 
> remote destination forever. The possible negative impact of these reconnects 
> (e.g. number of TCP connection attempts in a cluster) are kept in check by 
> our exponential backoff policy for such connection retries.
> My personal opinion on these three approaches:
> - I do not like (1) because I feel it introduces potentially confusing 
> semantics: We keep having a max retries setting, but it is not really a hard 
> limit anymore. It rather becomes a "max retries until we recreate a Netty 
> client", and would also reset any exponential backoff strategy of the 
> "previous" Netty client instance (cf. StormBoundedExponentialBackoffRetry). 
> If we do want such resets (but I don't think we do at this point), then a 
> cleaner approach would be to implement such resetting inside the retry policy 
> (again, cf. StormBoundedExponentialBackoffRetry).
> - I do not like (2) because a single "bad" Netty client would be able to take 
> down a Storm task, which among other things would also impact any other, 
> working Netty clients of the Storm task.
> - Option (3) seems a reasonable approach, although it breaks backwards 
> compatibility with regard to Storm's configuration (because we'd now ignore 
> storm.messaging.netty.max_retries).
> Here's initial feedback from other developers:
> {quote}
> https://github.com/apache/storm/pull/429/files#r24824540
> revans2: I personally prefer option 3, no maximum number of reconnection 
> attempts. Having the client decide that it is done, before nimbus does feels 
> like it is asking for trouble.
> {quote}
> {quote}
> https://github.com/ptgoetz
> ptgoetz: I'm in favor of option 3 as well. I'm not that concerned about 
> storm.messaging.netty.max_retries being ignored. We could probably just log a 
> warning that that configuration option is deprecated and will be ignored if 
> the value is set.
> {quote}
> {quote}
> https://github.com/apache/storm/pull/429#issuecomment-74914806
> nathanmarz: Nimbus only knows a worker is having trouble when it stops 
> sending heartbeats. If a worker gets into a bad state, the worst thing to do 
> is have it continue trying to limp along in that bad state. It should instead 
> suicide as quickly as possible. It seems counterintuitive, but this 
> aggressive suiciding behavior actually makes things more robust as it 
> prevents processes from getting into weird, potentially undefined states. 
> This has been a crucial design principle in Storm from the beginning. One 
> consequence of it is that any crucial system thread that receives an 
> unrecoverable exception must suicide the process rather than die quietly.
> For the connection retry problem, it's a tricky situation since it may not be 
> able to connect because the other worker is still getting set up. So the 
> retry policy should be somehow related to the launch timeouts for worker 
> processes specified in the configuration. Not being able to connect after the 
> launch timeout + a certain number of attempts + a buffer period would 
> certainly qualify as a weird state, so the process should suicide in that 
> case. Suiciding and restarting gets the worker back to a known state.
> So in this case, I am heavily in favor of Option 2. I don't care about 
> killing the other tasks in the worker because this is a rare situation. It is 
> infinitely more important to get the worker back to a known, robust state 
> than risk leaving it in a weird state permanently.
> {quote}
> If we decide to go with option 3, then the essence of the fix is the 
> following modification of Client.java:
> {code}
>     private boolean reconnectingAllowed() {
>         // BEFORE:
>         // return !closing && connectionAttempts.get() <= 
> (maxReconnectionAttempts + 1);
>         return !closing;
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to