Sergey Tryuber created STORM-537:
------------------------------------

             Summary: A worker reconnects infinitely to another dead worker
                 Key: STORM-537
                 URL: https://issues.apache.org/jira/browse/STORM-537
             Project: Apache Storm
          Issue Type: Bug
    Affects Versions: 0.9.3
            Reporter: Sergey Tryuber


We're using 0.9.3-rc1. Most probably this wrong behavior was introduced as a 
side efffect for STORM-409. When I kill a worker, another worker starts to 
print messages like:
{noformat}
2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-<HOST>:4706... [0]
2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-<HOST>:4706... [1]
2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-<HOST>:4706... [2]
..... so on
{noformat}
Then it reaches default 300 max_retries and starts the cycle again:
{noformat}
2014-10-20 11:54:38 b.s.m.n.Client [INFO] connection established to a remote 
host Netty-Client-<HOST>:4706, [id: 
0xec088412, /<HOST>:39795 :> <HOST>:4706]
2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-<HOST>:4706... [0]
2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-<HOST>:4706... [1]
2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-<HOST>:4706... [2]
{noformat}
And so on infinitely... 

An issue most probably is in backtype.storm.messaging.netty.Client#connect 
method in following place which determines that we give up on reconnection:
{code}
if (null != channel) {
    LOG.info("connection established to a remote host " + name() + ", " + 
channel.toString());
    channelRef.set(channel);
} else {
    close();
    throw new RuntimeException("Remote address is not reachable. We will close 
this client " + name());
}
{code}
I guess (not tried yet), that _channel_ object is not _null_ if this is a real 
reconnection. So the method return a _channel_ object and then reconnection 
starts again and again.

This might be fixed by adding explicity *current = null;* into following code 
block of the same method:
{code}
if (!future.isSuccess()) {
    if (null != current) {
        current.close();
    }
}
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to