[jira] [Commented] (STORM-329) Add Option to Config Message handling strategy when connection timeout

ASF GitHub Bot (JIRA) Tue, 28 Oct 2014 19:14:41 -0700

    [ 
https://issues.apache.org/jira/browse/STORM-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187905#comment-14187905
 ]


ASF GitHub Bot commented on STORM-329:
--------------------------------------

Github user clockfly commented on the pull request:

    https://github.com/apache/storm/pull/268#issuecomment-60863620
  
    Hi Ted,
    
    ```
    This will still cause another worker crash.
    Now there are two things happen in parallel, first nimbus inform worker A 
that worker B is not alive by zk timeout, second worker A connect to worker B 
by netty. When worker B died for some reason, we can't guarantee which happen 
first, if the connection from worker A to worker B broken first, then client's 
status will became closing, but right now nimbus has't inform worker A that 
worker has died, then worker A will still send message to worker B, the 
Exception will throw and what STORM-404 describe will happen.
    ```
    
    Suppose worker A send msg to worker B. 
    If B dies, At A, it will have multiple reconnection try. A will get 
notified that B is dead, and will set closing flag of Client(to B), which will 
abort the reconnection process, no RuntimeException will be thrown.
    
    If there is a network partitioning issue, suppose worker A is isolated from 
the rest of the cluster. A will not get notified that B is not alive. Then A 
will continue to retry reconnection to B for multiple times, and throw 
RuntimeException, and exit eventually. In this case, A cannot recover, throw is 
best option.
    
    
    



> Add Option to Config Message handling strategy when connection timeout
> ----------------------------------------------------------------------
>
>                 Key: STORM-329
>                 URL: https://issues.apache.org/jira/browse/STORM-329
>             Project: Apache Storm
>          Issue Type: Improvement
>    Affects Versions: 0.9.2-incubating
>            Reporter: Sean Zhong
>            Priority: Minor
>              Labels: Netty
>             Fix For: 0.9.2-incubating
>
>         Attachments: storm-329.patch
>
>
> This is to address a [concern brought 
> up|https://github.com/apache/incubator-storm/pull/103#issuecomment-43632986] 
> during the work at STORM-297:
> {quote}
> [~revans2] wrote: Your logic makes since to me on why these calls are 
> blocking. My biggest concern around the blocking is in the case of a worker 
> crashing. If a single worker crashes this can block the entire topology from 
> executing until that worker comes back up. In some cases I can see that being 
> something that you would want. In other cases I can see speed being the 
> primary concern and some users would like to get partial data fast, rather 
> then accurate data later.
> Could we make it configurable on a follow up JIRA where we can have a max 
> limit to the buffering that is allowed, before we block, or throw data away 
> (which is what zeromq does)?
> {quote}
> If some worker crash suddenly, how to handle the message which was supposed 
> to be delivered to the worker?
> 1. Should we buffer all message infinitely?
> 2. Should we block the message sending until the connection is resumed?
> 3. Should we config a buffer limit, try to buffer the message first, if the 
> limit is met, then block?
> 4. Should we neither block, nor buffer too much, but choose to drop the 
> messages, and use the built-in storm failover mechanism? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-329) Add Option to Config Message handling strategy when connection timeout

Reply via email to