[
https://issues.apache.org/jira/browse/STORM-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197658#comment-14197658
]
ASF GitHub Bot commented on STORM-537:
--------------------------------------
Github user Sergeant007 commented on the pull request:
https://github.com/apache/storm/pull/304#issuecomment-61764038
Thanks for the review, @clockfly
I have added necessary comment and removed the tests. Sorry, I wasn't able
to simplify them - if they were in simple synchronous mode, the tests would
hang infinitely if smth. is wrong instead of failures. So I've implemented them
in quite complicated (since I'm new in Clojure), but robust way. Another issue
was in complexity of reproducing of the bug: 1. you should be already connected
and 2. it is reproduced only if you send several messages at once. Anyway, I
have removed the tests as you asked me. Please, review.
> A worker reconnects infinitely to another dead worker
> -----------------------------------------------------
>
> Key: STORM-537
> URL: https://issues.apache.org/jira/browse/STORM-537
> Project: Apache Storm
> Issue Type: Bug
> Affects Versions: 0.9.3
> Reporter: Sergey Tryuber
>
> We're using 0.9.3-rc1. Most probably this wrong behavior was introduced as a
> side efffect for STORM-409. When I kill a worker, another worker starts to
> print messages like:
> {noformat}
> 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-<HOST>:4706... [0]
> 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-<HOST>:4706... [1]
> 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-<HOST>:4706... [2]
> ..... so on
> {noformat}
> Then it reaches default 300 max_retries and starts the cycle again:
> {noformat}
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] connection established to a remote
> host Netty-Client-<HOST>:4706, [id:
> 0xec088412, /<HOST>:39795 :> <HOST>:4706]
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-<HOST>:4706... [0]
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-<HOST>:4706... [1]
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-<HOST>:4706... [2]
> {noformat}
> And so on infinitely...
> An issue most probably is in backtype.storm.messaging.netty.Client#connect
> method in following place which determines that we give up on reconnection:
> {code}
> if (null != channel) {
> LOG.info("connection established to a remote host " + name() + ", " +
> channel.toString());
> channelRef.set(channel);
> } else {
> close();
> throw new RuntimeException("Remote address is not reachable. We will
> close this client " + name());
> }
> {code}
> I guess (not tried yet), that _channel_ object is not _null_ if this is a
> real reconnection. So the method return a _channel_ object and then
> reconnection starts again and again.
> This might be fixed by adding explicity *current = null;* into following code
> block of the same method:
> {code}
> if (!future.isSuccess()) {
> if (null != current) {
> current.close();
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)