[
https://issues.apache.org/jira/browse/STORM-404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120928#comment-14120928
]
xiajun edited comment on STORM-404 at 9/4/14 4:09 AM:
------------------------------------------------------
[~kabhwan] I reproduce this situation everytime in the case below:
In my case,there are 5 machines,4 machines for supervisor with 6 port and 1 for
nimbus; my topology use 24 worker(this means all the ports in the supervisors);
But there is a little trick, the prepare method in my bolt just throw exception
and do nothing else, this will cause the worker exit immediately. Then other
worker will exit with log that [~itaifrenkel] mentioned before.
I read the code and found that in my situation, when the first worker(we named
it as A) exit, other worker has not connect to worker A, so they retried for
many times and closing this Client, note that connect logic is async and called
from Client Constructor; Because connect and send method are synchronized, send
must wait until connect returned, but when send run, close had been called by
connect and mark closing as true, so connect will throw that Exception;
But i am still confused if this can happen: just remove the Exception in
Bolt::prepare, and worker A exit for some unknown reason, and worker B connect
to worker A will fail after retry, and then worker B can send tuple to worker
A, this will end up with call send again, this will cause worker B exit by that
Exception. You may say that nimbus will find worker A exit and tell worker B
not send tuple to worker A any more, but worker and nimbus connected by
zookeeper, and worker read nimbus's command Periodically, this is done by
mk-refresh-connections. mk-refresh-connections and send share the same RWlock,
when the machine load is heavy, there is a chance that mk-refresh-connections
not called between the sends, where the first send has close the connection in
Client;
was (Author: tedxia):
[~kabhwan] I reproduce this situation everytime in the case below:
In my case,there are 5 machines,4 machines for supervisor with 6 port and 1 for
nimbus; my topology use 24 worker(this means all the ports in the supervisors);
But there is a little trick, the prepare method in my bolt just throw exception
and do nothing else, this will cause the worker exit immediately. Then other
worker will exit with log that @ Itai Frenkel mentioned before.
I read the code and found that in my situation, when the first worker(we named
it as A) exit, other worker has not connect to worker A, so they retried for
many times and closing this Client, note that connect logic is async and called
from Client Constructor; Because connect and send method are synchronized, send
must wait until connect returned, but when send run, close had been called by
connect and mark closing as true, so connect will throw that Exception;
But i am still confused if this can happen: just remove the Exception in
Bolt::prepare, and worker A exit for some unknown reason, and worker B connect
to worker A will fail after retry, and then worker B can send tuple to worker
A, this will end up with call send again, this will cause worker B exit by that
Exception. You may say that nimbus will find worker A exit and tell worker B
not send tuple to worker A any more, but worker and nimbus connected by
zookeeper, and worker read nimbus's command Periodically, this is done by
mk-refresh-connections. mk-refresh-connections and send share the same RWlock,
when the machine load is heavy, there is a chance that mk-refresh-connections
not called between the sends, where the first send has close the connection in
Client;
> Worker on one machine crashes due to a failure of another worker on another
> machine
> -----------------------------------------------------------------------------------
>
> Key: STORM-404
> URL: https://issues.apache.org/jira/browse/STORM-404
> Project: Apache Storm (Incubating)
> Issue Type: Bug
> Affects Versions: 0.9.2-incubating
> Reporter: Itai Frenkel
>
> I have two workers (one on each machine). The first worker(10.30.206.125) had
> a problem starting (could not find Nimbus host), however the second worker
> crashed too since it could not connect to the first worker.
> This looks like a cascading failure, which seems like a bug.
> 2014-07-15 17:43:32 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [17]
> 2014-07-15 17:43:33 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [18]
> 2014-07-15 17:43:34 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [19]
> 2014-07-15 17:43:35 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [20]
> 2014-07-15 17:43:36 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [21]
> 2014-07-15 17:43:37 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [22]
> 2014-07-15 17:43:38 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [23]
> 2014-07-15 17:43:39 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [24]
> 2014-07-15 17:43:40 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [25]
> 2014-07-15 17:43:41 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [26]
> 2014-07-15 17:43:42 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [27]
> 2014-07-15 17:43:43 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [28]
> 2014-07-15 17:43:44 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [29]
> 2014-07-15 17:43:45 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700... [30]
> 2014-07-15 17:43:46 b.s.m.n.Client [INFO] Closing Netty Client
> Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700
> 2014-07-15 17:43:46 b.s.m.n.Client [INFO] Waiting for pending batchs to be
> sent with Netty-Client-ip-10-30-206-125.ec2.internal/10.30.206.125:6700...,
> timeout: 600000ms, pendings: 0
> 2014-07-15 17:43:46 b.s.util [ERROR] Async loop died!
> java.lang.RuntimeException: java.lang.RuntimeException: Client is being
> closed, and does not take requests any more
> at
> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at
> backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at
> backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at
> backtype.storm.disruptor$consume_loop_STAR_$fn__758.invoke(disruptor.clj:94)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at backtype.storm.util$async_loop$fn__457.invoke(util.clj:431)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60]
> Caused by: java.lang.RuntimeException: Client is being closed, and does not
> take requests any more
> at backtype.storm.messaging.netty.Client.send(Client.java:194)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at
> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927$fn__5928.invoke(worker.clj:322)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at
> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927.invoke(worker.clj:320)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at
> backtype.storm.disruptor$clojure_handler$reify__745.onEvent(disruptor.clj:58)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> at
> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
> ... 6 common frames omitted
> 2014-07-15 17:43:46 b.s.util [INFO] Halting process: ("Async loop died!")
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)