[ 
https://issues.apache.org/jira/browse/STORM-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208012#comment-14208012
 ] 

ASF GitHub Bot commented on STORM-537:
--------------------------------------

Github user Sergeant007 commented on the pull request:

    https://github.com/apache/storm/pull/304#issuecomment-62714754
  
    Guys, it there any update on this pull request? As for me, the bug it fixes 
is rather critical and the fix is simple enough (without refactoring, etc) to 
be merged faster.


> A worker reconnects infinitely to another dead worker
> -----------------------------------------------------
>
>                 Key: STORM-537
>                 URL: https://issues.apache.org/jira/browse/STORM-537
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.3
>            Reporter: Sergey Tryuber
>
> We're using 0.9.3-rc1. Most probably this wrong behavior was introduced as a 
> side efffect for STORM-409. When I kill a worker, another worker starts to 
> print messages like:
> {noformat}
> 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-<HOST>:4706... [0]
> 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-<HOST>:4706... [1]
> 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-<HOST>:4706... [2]
> ..... so on
> {noformat}
> Then it reaches default 300 max_retries and starts the cycle again:
> {noformat}
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] connection established to a remote 
> host Netty-Client-<HOST>:4706, [id: 
> 0xec088412, /<HOST>:39795 :> <HOST>:4706]
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-<HOST>:4706... [0]
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-<HOST>:4706... [1]
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-<HOST>:4706... [2]
> {noformat}
> And so on infinitely... 
> An issue most probably is in backtype.storm.messaging.netty.Client#connect 
> method in following place which determines that we give up on reconnection:
> {code}
> if (null != channel) {
>     LOG.info("connection established to a remote host " + name() + ", " + 
> channel.toString());
>     channelRef.set(channel);
> } else {
>     close();
>     throw new RuntimeException("Remote address is not reachable. We will 
> close this client " + name());
> }
> {code}
> I guess (not tried yet), that _channel_ object is not _null_ if this is a 
> real reconnection. So the method return a _channel_ object and then 
> reconnection starts again and again.
> This might be fixed by adding explicity *current = null;* into following code 
> block of the same method:
> {code}
> if (!future.isSuccess()) {
>     if (null != current) {
>         current.close();
>     }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to