[
https://issues.apache.org/jira/browse/STORM-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178611#comment-14178611
]
Sergey Tryuber commented on STORM-537:
--------------------------------------
I was able to reproduce the issue with following code snippet (should be added
to
[netty_unit_test.clj|https://github.com/apache/storm/blob/master/storm-core/test/clj/backtype/storm/messaging/netty_unit_test.clj]):
{code}
(deftest test-server-failed-permanently
(let [req_msg (String. "0123456789abcdefghijklmnopqrstuvwxyz")
storm-conf {STORM-MESSAGING-TRANSPORT
"backtype.storm.messaging.netty.Context"
STORM-MESSAGING-NETTY-BUFFER-SIZE 1024
STORM-MESSAGING-NETTY-MAX-RETRIES 10
STORM-MESSAGING-NETTY-MIN-SLEEP-MS 1000
STORM-MESSAGING-NETTY-MAX-SLEEP-MS 5000
STORM-MESSAGING-NETTY-SERVER-WORKER-THREADS 1
STORM-MESSAGING-NETTY-CLIENT-WORKER-THREADS 1
}
context (TransportFactory/makeContext storm-conf)
client (.connect context nil "localhost" port)
server (Thread.
(fn []
(let [server (.bind context nil port)
iter (.recv server 0 0)
resp (.next iter)]
(is (= task (.task resp)))
(is (= req_msg (String. (.message resp))))
(.close server)
)))
_ (.start server)
_ (println "Let the client to connect to server initially")
_ (.send client task (.getBytes req_msg))
_ (Thread/sleep 5000)
_ (println "Permanently stopping the server")
_ (.stop server)
_ (Thread/sleep 5000)
_ (println "Sending a message to the server")
_ (.send client task (.getBytes req_msg))
_ (println "We would expect to see
RuntimeException(RuntimeException(\"connection failed \" + name(), e) here")
_ (.send client task (.getBytes req_msg))
_ (println "But it wasn't raised. Indeed, we're trying to reconnect on
every consequetive message")
_ (.send client task (.getBytes req_msg))
_ (.send client task (.getBytes req_msg))
_ (.send client task (.getBytes req_msg))
_ (.send client task (.getBytes req_msg))
_ (.send client task (.getBytes req_msg))
_ (.send client task (.getBytes req_msg))
]
(.close client)
(.join server)
(.term context)))
{code}
Note, this is not a complete test yet.
The reconnect actually happens not infinitely but only
STORM-NETTY-MESSAGE-BATCH-SIZE count (which is quite long time). Then if
finally fails with ClosedChannelException when tries to write to the closed
channel...
> A worker reconnects infinitely to another dead worker
> -----------------------------------------------------
>
> Key: STORM-537
> URL: https://issues.apache.org/jira/browse/STORM-537
> Project: Apache Storm
> Issue Type: Bug
> Affects Versions: 0.9.3
> Reporter: Sergey Tryuber
>
> We're using 0.9.3-rc1. Most probably this wrong behavior was introduced as a
> side efffect for STORM-409. When I kill a worker, another worker starts to
> print messages like:
> {noformat}
> 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-<HOST>:4706... [0]
> 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-<HOST>:4706... [1]
> 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-<HOST>:4706... [2]
> ..... so on
> {noformat}
> Then it reaches default 300 max_retries and starts the cycle again:
> {noformat}
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] connection established to a remote
> host Netty-Client-<HOST>:4706, [id:
> 0xec088412, /<HOST>:39795 :> <HOST>:4706]
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-<HOST>:4706... [0]
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-<HOST>:4706... [1]
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-<HOST>:4706... [2]
> {noformat}
> And so on infinitely...
> An issue most probably is in backtype.storm.messaging.netty.Client#connect
> method in following place which determines that we give up on reconnection:
> {code}
> if (null != channel) {
> LOG.info("connection established to a remote host " + name() + ", " +
> channel.toString());
> channelRef.set(channel);
> } else {
> close();
> throw new RuntimeException("Remote address is not reachable. We will
> close this client " + name());
> }
> {code}
> I guess (not tried yet), that _channel_ object is not _null_ if this is a
> real reconnection. So the method return a _channel_ object and then
> reconnection starts again and again.
> This might be fixed by adding explicity *current = null;* into following code
> block of the same method:
> {code}
> if (!future.isSuccess()) {
> if (null != current) {
> current.close();
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)