[ 
https://issues.apache.org/jira/browse/STORM-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178611#comment-14178611
 ] 

Sergey Tryuber commented on STORM-537:
--------------------------------------

I was able to reproduce the issue with following code snippet (should be added 
to 
[netty_unit_test.clj|https://github.com/apache/storm/blob/master/storm-core/test/clj/backtype/storm/messaging/netty_unit_test.clj]):
{code}
(deftest test-server-failed-permanently
  (let [req_msg (String. "0123456789abcdefghijklmnopqrstuvwxyz")
        storm-conf {STORM-MESSAGING-TRANSPORT 
"backtype.storm.messaging.netty.Context"
                    STORM-MESSAGING-NETTY-BUFFER-SIZE 1024
                    STORM-MESSAGING-NETTY-MAX-RETRIES 10
                    STORM-MESSAGING-NETTY-MIN-SLEEP-MS 1000
                    STORM-MESSAGING-NETTY-MAX-SLEEP-MS 5000
                    STORM-MESSAGING-NETTY-SERVER-WORKER-THREADS 1
                    STORM-MESSAGING-NETTY-CLIENT-WORKER-THREADS 1
                    }
        context (TransportFactory/makeContext storm-conf)
        client (.connect context nil "localhost" port)

        server (Thread.
                 (fn []
                   (let [server (.bind context nil port)
                         iter (.recv server 0 0)
                         resp (.next iter)]
                     (is (= task (.task resp)))
                     (is (= req_msg (String. (.message resp))))
                     (.close server)
                     )))
        _ (.start server)
        _ (println "Let the client to connect to server initially")
        _ (.send client task (.getBytes req_msg))
        _ (Thread/sleep 5000)
        _ (println "Permanently stopping the server")
        _ (.stop server)
        _ (Thread/sleep 5000)
        _ (println "Sending a message to the server")
        _ (.send client task (.getBytes req_msg))
        _ (println "We would expect to see 
RuntimeException(RuntimeException(\"connection failed \" + name(), e) here")
        _ (.send client task (.getBytes req_msg))
        _ (println "But it wasn't raised. Indeed, we're trying to reconnect on 
every consequetive message")
        _ (.send client task (.getBytes req_msg))
        _ (.send client task (.getBytes req_msg))
        _ (.send client task (.getBytes req_msg))
        _ (.send client task (.getBytes req_msg))
        _ (.send client task (.getBytes req_msg))
        _ (.send client task (.getBytes req_msg))
        ]
    (.close client)
    (.join server)
    (.term context)))
{code}
Note, this is not a complete test yet. 

The reconnect actually happens not infinitely but only 
STORM-NETTY-MESSAGE-BATCH-SIZE count (which is quite long time). Then if 
finally fails with ClosedChannelException when tries to write to the closed 
channel...

> A worker reconnects infinitely to another dead worker
> -----------------------------------------------------
>
>                 Key: STORM-537
>                 URL: https://issues.apache.org/jira/browse/STORM-537
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.3
>            Reporter: Sergey Tryuber
>
> We're using 0.9.3-rc1. Most probably this wrong behavior was introduced as a 
> side efffect for STORM-409. When I kill a worker, another worker starts to 
> print messages like:
> {noformat}
> 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-<HOST>:4706... [0]
> 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-<HOST>:4706... [1]
> 2014-10-20 11:45:03 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-<HOST>:4706... [2]
> ..... so on
> {noformat}
> Then it reaches default 300 max_retries and starts the cycle again:
> {noformat}
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] connection established to a remote 
> host Netty-Client-<HOST>:4706, [id: 
> 0xec088412, /<HOST>:39795 :> <HOST>:4706]
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-<HOST>:4706... [0]
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-<HOST>:4706... [1]
> 2014-10-20 11:54:38 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-<HOST>:4706... [2]
> {noformat}
> And so on infinitely... 
> An issue most probably is in backtype.storm.messaging.netty.Client#connect 
> method in following place which determines that we give up on reconnection:
> {code}
> if (null != channel) {
>     LOG.info("connection established to a remote host " + name() + ", " + 
> channel.toString());
>     channelRef.set(channel);
> } else {
>     close();
>     throw new RuntimeException("Remote address is not reachable. We will 
> close this client " + name());
> }
> {code}
> I guess (not tried yet), that _channel_ object is not _null_ if this is a 
> real reconnection. So the method return a _channel_ object and then 
> reconnection starts again and again.
> This might be fixed by adding explicity *current = null;* into following code 
> block of the same method:
> {code}
> if (!future.isSuccess()) {
>     if (null != current) {
>         current.close();
>     }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to