[jira] [Commented] (STORM-404) Worker on one machine crashes due to a failure of another worker on another machine

Jungtaek Lim (JIRA) Thu, 14 Aug 2014 18:12:06 -0700

    [ 
https://issues.apache.org/jira/browse/STORM-404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098018#comment-14098018
 ]


Jungtaek Lim commented on STORM-404:
------------------------------------

I've tried to reproduce your situation with storm-starter, and it doesn't 
reproduced.

I've modified your situation to use 3 machines(A, B, C - A has nimbus, too.), 
which has 4 ports. (total 12 ports) And I've modified storm-starter to use 6 
workers. 
After submitting jar, 2 workers are created for each machine. 6 ports are 
remained.

I had shutdown B's supervisor to restrict resurrection and kill all workers 
belong to B.
Then all remaining workers was trying to reconnect killed workers. (as we 
expected)

But Nimbus recognized failed workers and reallocate workers to other ports.
Alive workers recognized new workers (with clean up connection with killed 
workers) and continue their tasks.

> Worker on one machine crashes due to a failure of another worker on another 
> machine
> -----------------------------------------------------------------------------------
>
>                 Key: STORM-404
>                 URL: https://issues.apache.org/jira/browse/STORM-404
>             Project: Apache Storm (Incubating)
>          Issue Type: Bug
>    Affects Versions: 0.9.2-incubating
>            Reporter: Itai Frenkel
>
> I have two workers (one on each machine). The first worker(10.30.206.125) had 
> a problem starting (could not find Nimbus host), however the second worker 
> crashed too since it could not connect to the first worker.
> This looks like a cascading failure, which seems like a bug.
> 2014-07-15 17:43:32 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [17]
> 2014-07-15 17:43:33 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [18]
> 2014-07-15 17:43:34 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [19]
> 2014-07-15 17:43:35 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [20]
> 2014-07-15 17:43:36 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [21]
> 2014-07-15 17:43:37 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [22]
> 2014-07-15 17:43:38 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [23]
> 2014-07-15 17:43:39 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [24]
> 2014-07-15 17:43:40 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [25]
> 2014-07-15 17:43:41 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [26]
> 2014-07-15 17:43:42 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [27]
> 2014-07-15 17:43:43 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [28]
> 2014-07-15 17:43:44 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [29]
> 2014-07-15 17:43:45 b.s.m.n.Client [INFO] Reconnect started for 
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [30]
> 2014-07-15 17:43:46 b.s.m.n.Client [INFO] Closing Netty Client 
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700
> 2014-07-15 17:43:46 b.s.m.n.Client [INFO] Waiting for pending batchs to be 
> sent with Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700..., 
> timeout: 600000ms, pendings: 0
> 2014-07-15 17:43:46 b.s.util [ERROR] Async loop died!
> java.lang.RuntimeException: java.lang.RuntimeException: Client is being 
> closed, and does not take requests any more
> at 
> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)
>  ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at 
> backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
>  ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at 
> backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
>  ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at 
> backtype.storm.disruptor$consume_loop_STAR_$fn__758.invoke(disruptor.clj:94) 
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at backtype.storm.util$async_loop$fn__457.invoke(util.clj:431) 
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60]
> Caused by: java.lang.RuntimeException: Client is being closed, and does not 
> take requests any more
> at backtype.storm.messaging.netty.Client.send(Client.java:194) 
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54) 
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at 
> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927$fn__5928.invoke(worker.clj:322)
>  ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at 
> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927.invoke(worker.clj:320)
>  ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at 
> backtype.storm.disruptor$clojure_handler$reify__745.onEvent(disruptor.clj:58) 
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at 
> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
>  ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> ... 6 common frames omitted
> 2014-07-15 17:43:46 b.s.util [INFO] Halting process: ("Async loop died!")



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (STORM-404) Worker on one machine crashes due to a failure of another worker on another machine

Reply via email to