[ 
https://issues.apache.org/jira/browse/STORM-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188473#comment-14188473
 ] 

ASF GitHub Bot commented on STORM-537:
--------------------------------------

GitHub user Sergeant007 opened a pull request:

    https://github.com/apache/storm/pull/304

    [STORM-537] A worker reconnects infinitely to another dead worker

    A fix for [STORM-537](https://issues.apache.org/jira/browse/STORM-537). The 
bug is that a worker reconnects to another dead worker infinitely when it tries 
to send a batch of messages. Each message in a batch causes a new reconnect. 
More details are in the jira issue.
    
    Pull request contains a simple fix and tests. Actually there is 
"test-reconnect-to-permanently-failed-server" which is exactly for this bug. 
There is also "test-reconnect-to-temporarily-failed-server" which was written 
just-for-fun, because this functionality is not covered by other tests.
    
    Note, that storm with applied fix works well and fixed the issue on our 
staging environment.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Sergeant007/storm 
storm-537-infinite-reconnection

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/304.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #304
    
----
commit 1aacccf286829e9289d86a6ed10b23cb2b21bc47
Author: Sergey Tryuber <[email protected]>
Date:   2014-10-29T15:27:56Z

    [STORM-537] A worker reconnects infinitely to another dead worker

----


> A worker reconnects infinitely to another dead worker
> -----------------------------------------------------
>
>                 Key: STORM-537
>                 URL: https://issues.apache.org/jira/browse/STORM-537
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.3
>            Reporter: Sergey Tryuber
>
> We're using 0.9.3-rc1. Most probably this wrong behavior was introduced as a 
> side efffect for STORM-409. When I kill a worker, another worker starts to 
> print messages like:
> {noformat}
> 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-<HOST>:4706... [0]
> 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-<HOST>:4706... [1]
> 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-<HOST>:4706... [2]
> ..... so on
> {noformat}
> Then it reaches default 300 max_retries and starts the cycle again:
> {noformat}
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] connection established to a remote 
> host Netty-Client-<HOST>:4706, [id: 
> 0xec088412, /<HOST>:39795 :> <HOST>:4706]
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-<HOST>:4706... [0]
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-<HOST>:4706... [1]
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-<HOST>:4706... [2]
> {noformat}
> And so on infinitely... 
> An issue most probably is in backtype.storm.messaging.netty.Client#connect 
> method in following place which determines that we give up on reconnection:
> {code}
> if (null != channel) {
>     LOG.info("connection established to a remote host " + name() + ", " + 
> channel.toString());
>     channelRef.set(channel);
> } else {
>     close();
>     throw new RuntimeException("Remote address is not reachable. We will 
> close this client " + name());
> }
> {code}
> I guess (not tried yet), that _channel_ object is not _null_ if this is a 
> real reconnection. So the method return a _channel_ object and then 
> reconnection starts again and again.
> This might be fixed by adding explicity *current = null;* into following code 
> block of the same method:
> {code}
> if (!future.isSuccess()) {
>     if (null != current) {
>         current.close();
>     }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to