Robert Joseph Evans created STORM-450:
-----------------------------------------

             Summary: netty cascading failure on clean shutdown of worker
                 Key: STORM-450
                 URL: https://issues.apache.org/jira/browse/STORM-450
             Project: Apache Storm (Incubating)
          Issue Type: Bug
    Affects Versions: 0.9.2-incubating, 0.9.0.1, 0.9.3-incubating
            Reporter: Robert Joseph Evans
            Assignee: Robert Joseph Evans


We recently had an issue where a worker process was shutdown cleaning on 0.9.0. 
 The reason the worker shutdown cleanly is not the issue here, but it caused a 
cascading failure that made a connected worker shutdown too.  This is going to 
be even more problematic in newer versions of storm when we give the worker 
time to shutdown cleanly instead of just shooting it with a kill -9

Ideally the client should continue to try and reconnect, because the worker may 
have exited on its own and will be re-spawned shortly.  If it is rescheduled 
elsewhere the worker will eventually detect it and reroute things accordingly.  
This is what happens already when the connection is just closed.  There really 
is no reason to have one side know when the other side is shutting down.  

{code}
2014-08-11 19:00:17 b.s.util [ERROR] Async loop died!
java.lang.RuntimeException: java.lang.RuntimeException: Client is being closed, 
and does not take requests any more
        at 
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:130)
 ~[storm-core-0.9.0-wip21.jar:na]
        at 
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:101)
 ~[storm-core-0.9.0-wip21.jar:na]
        at 
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:62) 
~[storm-core-0.9.0-wip21.jar:na]
        at 
backtype.storm.disruptor$consume_loop_STAR_$fn__1999.invoke(disruptor.clj:74) 
~[storm-core-0.9.0-wip21.jar:na]
        at backtype.storm.util$async_loop$fn__421.invoke(util.clj:400) 
~[storm-core-0.9.0-wip21.jar:na]
        at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na]
        at java.lang.Thread.run(Thread.java:722) [na:1.7.0_17]
Caused by: java.lang.RuntimeException: Client is being closed, and does not 
take requests any more
        at backtype.storm.messaging.netty.Client.send(Client.java:118) 
~[storm-netty-0.9.0-wip21.jar:na]
        at 
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4922$fn__4923.invoke(worker.clj:342)
 ~[storm-core-0.9.0-wip21.jar:na]
        at 
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4922.invoke(worker.clj:331)
 ~[storm-core-0.9.0-wip21.jar:na]
        at 
backtype.storm.disruptor$clojure_handler$reify__1986.onEvent(disruptor.clj:43) 
~[storm-core-0.9.0-wip21.jar:na]
        at 
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:127)
 ~[storm-core-0.9.0-wip21.jar:na]
        ... 6 common frames omitted
2014-08-11 19:00:17 b.s.util [INFO] Halting process: ("Async loop died!")
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to