[ https://issues.apache.org/jira/browse/STORM-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208015#comment-14208015 ]
Sean Zhong commented on STORM-537: ---------------------------------- merged, thanks for your contribution. > A worker reconnects infinitely to another dead worker > ----------------------------------------------------- > > Key: STORM-537 > URL: https://issues.apache.org/jira/browse/STORM-537 > Project: Apache Storm > Issue Type: Bug > Affects Versions: 0.9.3 > Reporter: Sergey Tryuber > Assignee: Sergey Tryuber > > We're using 0.9.3-rc1. Most probably this wrong behavior was introduced as a > side efffect for STORM-409. When I kill a worker, another worker starts to > print messages like: > {noformat} > 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-<HOST>:4706... [0] > 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-<HOST>:4706... [1] > 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-<HOST>:4706... [2] > ..... so on > {noformat} > Then it reaches default 300 max_retries and starts the cycle again: > {noformat} > 2014-10-20 11:54:38 b.s.m.n.Client [INFO] connection established to a remote > host Netty-Client-<HOST>:4706, [id: > 0xec088412, /<HOST>:39795 :> <HOST>:4706] > 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-<HOST>:4706... [0] > 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-<HOST>:4706... [1] > 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for > Netty-Client-<HOST>:4706... [2] > {noformat} > And so on infinitely... > An issue most probably is in backtype.storm.messaging.netty.Client#connect > method in following place which determines that we give up on reconnection: > {code} > if (null != channel) { > LOG.info("connection established to a remote host " + name() + ", " + > channel.toString()); > channelRef.set(channel); > } else { > close(); > throw new RuntimeException("Remote address is not reachable. We will > close this client " + name()); > } > {code} > I guess (not tried yet), that _channel_ object is not _null_ if this is a > real reconnection. So the method return a _channel_ object and then > reconnection starts again and again. > This might be fixed by adding explicity *current = null;* into following code > block of the same method: > {code} > if (!future.isSuccess()) { > if (null != current) { > current.close(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)