[
https://issues.apache.org/jira/browse/STORM-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303271#comment-14303271
]
ASF GitHub Bot commented on STORM-329:
--------------------------------------
Github user miguno commented on the pull request:
https://github.com/apache/storm/pull/268#issuecomment-72652704
Some additional feedback regarding Storm's behavior in the face of
failures, with and without this patch. This summary is slightly simplified to
make it a shorter read.
### Without this patch
* Tested Storm versions: 0.9.2, 0.9.3, 0.10.0-SNAPSHOT
* Configuration: default Storm settings (cf. `conf/defaults.yaml`)
Here, we can confirm the cascading failure previously described in this
pull request and the JIRA ticket.
Consider a simple topology such as:
```
+-----> bolt1 -----> bolt2
|
spout -----+
|
+-----> bolt3
```
* If the instances of `bolt2` die (say, because of a runtime exception),
then `bolt1` instances will enter a reconnect-until-success-or-die loop.
* If Storm decides to place the restarted `bolt2` instances on
different workers (read: machine+port pairs), then `bolt1` will eventually die.
* If Storm places the restarted `bolt2` instances on the same workers,
then the `bolt1` instances will not die because one of their reconnection
attempts will succeed, and normal operation will resume.
* If `bolt1` died, too, then we enter the same
reconnect-until-success-or-die loop in the `spout` instances. Hence the
cascading nature of the failure.
On top of that, we also noticed the following, later phases of this
cascading failure to occur in larger clusters, where each phase is less likely
to happen than the previous one:
1. Other spouts/bolts of the same topology -- "friends of friends" and so
on -- may enter such loops. In the example above, `bolt3` may start to die,
too.
2. Eventually the full topology may become disfunctional, a zombie: not
dead but not alive either.
3. Other topologies in the cluster may then become zombies, too.
4. The full Storm cluster may enter a zombie state. This state even turn
out to be unrecoverable without a full cluster restart.
> Funny anecdote: Because of a race condition scenario you may even observe
that Storm spouts/bolts will begin to talk to the wrong peers, e.g. `spout`
will talk directly to `bolt2`, even though this violates the wiring of our
topology.
### With this patch
* Tested Storm versions: We primarily tested a patched 0.10.0-SNAPSHOT
version but also tested a patched 0.9.3 briefly.
* Configuration: default Storm settings (cf. `conf/defaults.yaml`)
Here, the behavior is different. We didn't observe cascading failures
anymore at the expense of "silent data loss" (see below).
Again, consider this example topology:
```
+-----> bolt1 -----> bolt2
|
spout -----+
|
+-----> bolt3
```
With the patch, when the instances of `bolt2` die then the instances of
`bolt2` will continue to run; i.e. they will not enter a
reconnect-until-success-or-die loop anymore (which, particularly the not-dying
part, was the purpose of the patch).
**bolt2 behavior**
We wrote a special-purpose "storm-bolt-of-death" topology that would
consistently throw runtime exceptions in `bolt2` (aka the bolt of death)
whenever it receives an input tuple. The following example shows the timeline
of `bolt2` crashing intentionally. We observed that once the `bolt2` instances
were restarted -- and Storm would typically restart the instances on the same
workers (read: machine+port combinations) -- then they would not receive any
new input tuples even though their upstream peer `bolt2` was up and running and
constantly emitting output tuples.
Summary of the `bolt2` log snippet below:
* This `bolt2` instance dies at `12:49:21`, followed by an immediate
restart (here: on the same machine+port).
* The `bolt2` instance is up and running at `12:49:32`, but it would not
process any new input tuple until `52 mins` later.
* In our testing we found that the restarted `bolt2` instances took a
consistent `52 mins` (!) to receive their first, "new" input tuple from `bolt1`.
```
# New input tuple => let's crash! Now the shutdown procedure begins.
2015-02-03 12:49:21 c.v.s.t.s.b.BoltOfDeath [ERROR] Intentionally throwing
this exception to trigger bolt failures
2015-02-03 12:49:21 b.s.util [ERROR] Async loop died!
java.lang.RuntimeException: java.lang.RuntimeException: Intentionally
throwing this exception to trigger bolt failures
at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.daemon.executor$fn__6773$fn__6786$fn__6837.invoke(executor.clj:798)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at backtype.storm.util$async_loop$fn__550.invoke(util.clj:472)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.6.0.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
Caused by: java.lang.RuntimeException: Intentionally throwing this
exception to trigger bolt failures
at
com.verisign.storm.tools.sbod.bolts.BoltOfDeath.execute(BoltOfDeath.scala:77)
~[stormjar.jar:0.1.0-SNAPSHOT]
at
backtype.storm.topology.BasicBoltExecutor.execute(BasicBoltExecutor.java:50)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.daemon.executor$fn__6773$tuple_action_fn__6775.invoke(executor.clj:660)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.daemon.executor$mk_task_receiver$fn__6696.invoke(executor.clj:416)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.disruptor$clojure_handler$reify__871.onEvent(disruptor.clj:58)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
... 6 common frames omitted
2015-02-03 12:49:21 b.s.d.executor [ERROR]
java.lang.RuntimeException: java.lang.RuntimeException: Intentionally
throwing this exception to trigger bolt failures
at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.daemon.executor$fn__6773$fn__6786$fn__6837.invoke(executor.clj:798)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at backtype.storm.util$async_loop$fn__550.invoke(util.clj:472)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.6.0.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
Caused by: java.lang.RuntimeException: Intentionally throwing this
exception to trigger bolt failures
at
com.verisign.storm.tools.sbod.bolts.BoltOfDeath.execute(BoltOfDeath.scala:77)
~[stormjar.jar:0.1.0-SNAPSHOT]
at
backtype.storm.topology.BasicBoltExecutor.execute(BasicBoltExecutor.java:50)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.daemon.executor$fn__6773$tuple_action_fn__6775.invoke(executor.clj:660)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.daemon.executor$mk_task_receiver$fn__6696.invoke(executor.clj:416)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.disruptor$clojure_handler$reify__871.onEvent(disruptor.clj:58)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
... 6 common frames omitted
2015-02-03 12:49:21 b.s.util [ERROR] Halting process: ("Worker died")
java.lang.RuntimeException: ("Worker died")
at backtype.storm.util$exit_process_BANG_.doInvoke(util.clj:329)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at clojure.lang.RestFn.invoke(RestFn.java:423) [clojure-1.6.0.jar:na]
at
backtype.storm.daemon.worker$fn__7196$fn__7197.invoke(worker.clj:536)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.daemon.executor$mk_executor_data$fn__6606$fn__6607.invoke(executor.clj:246)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at backtype.storm.util$async_loop$fn__550.invoke(util.clj:482)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.6.0.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
2015-02-03 12:49:21 b.s.d.worker [INFO] Shutting down worker
bolt-of-death-topology-1-1422964754 783bcc5b-b571-4c7e-94f4-72ff49edc35e 6702
2015-02-03 12:49:21 b.s.m.n.Client [INFO] Closing Netty Client
Netty-Client-supervisor2/10.0.0.102:6702
2015-02-03 12:49:21 b.s.m.n.Client [INFO] Waiting for pending batchs to be
sent with Netty-Client-supervisor2/10.0.0.102:6702..., timeout: 600000ms,
pendings: 0
2015-02-03 12:49:21 b.s.m.n.Client [INFO] Closing Netty Client
Netty-Client-supervisor1/10.0.0.101:6702
2015-02-03 12:49:21 b.s.m.n.Client [INFO] Waiting for pending batchs to be
sent with Netty-Client-supervisor1/10.0.0.101:6702..., timeout: 600000ms,
pendings: 0
2015-02-03 12:49:21 b.s.d.worker [INFO] Shutting down receive thread
2015-02-03 12:49:21 o.a.s.c.r.ExponentialBackoffRetry [WARN] maxRetries too
large (300). Pinning to 29
2015-02-03 12:49:21 b.s.u.StormBoundedExponentialBackoffRetry [INFO] The
baseSleepTimeMs [100] the maxSleepTimeMs [1000] the maxRetries [300]
2015-02-03 12:49:21 b.s.m.n.Client [INFO] New Netty Client, connect to
supervisor3, 6702, config: , buffer_size: 5242880
2015-02-03 12:49:21 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-supervisor3/127.0.1.1:6702... [0]
2015-02-03 12:49:21 b.s.m.n.Client [INFO] connection established to a
remote host Netty-Client-supervisor3/127.0.1.1:6702, [id: 0x8415ae96,
/127.0.1.1:48837 => supervisor3/127.0.1.1:6702]
2015-02-03 12:49:21 b.s.m.loader [INFO] Shutting down receiving-thread:
[bolt-of-death-topology-1-1422964754, 6702]
2015-02-03 12:49:21 b.s.m.n.Client [INFO] Closing Netty Client
Netty-Client-supervisor3/127.0.1.1:6702
2015-02-03 12:49:21 b.s.m.n.Client [INFO] Waiting for pending batchs to be
sent with Netty-Client-supervisor3/127.0.1.1:6702..., timeout: 600000ms,
pendings: 0
2015-02-03 12:49:21 b.s.m.loader [INFO] Waiting for
receiving-thread:[bolt-of-death-topology-1-1422964754, 6702] to die
2015-02-03 12:49:21 b.s.m.loader [INFO] Shutdown receiving-thread:
[bolt-of-death-topology-1-1422964754, 6702]
2015-02-03 12:49:21 b.s.d.worker [INFO] Shut down receive thread
2015-02-03 12:49:21 b.s.d.worker [INFO] Terminating messaging context
2015-02-03 12:49:21 b.s.d.worker [INFO] Shutting down executors
2015-02-03 12:49:21 b.s.d.executor [INFO] Shutting down executor
bolt-of-death-A2:[3 3]
2015-02-03 12:49:21 b.s.util [INFO] Async loop interrupted!
# Now the restart begins, which happened to be on the same machine+port.
2015-02-03 12:49:28 o.a.s.z.ZooKeeper [INFO] Client
environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
2015-02-03 12:49:28 o.a.s.z.ZooKeeper [INFO] Client
environment:host.name=supervisor3
[...]
2015-02-03 12:49:32 b.s.d.executor [INFO] Prepared bolt bolt-of-death-A2:(3)
# Now the bolt instance would stay idle for the next 52 mins.
# Only metrics related log messages were reported.
2015-02-03 12:50:31 b.s.m.n.Client [INFO] Getting metrics for connection to
supervisor2/10.0.0.102:6702
2015-02-03 12:50:31 b.s.m.n.Client [INFO] Getting metrics for connection to
supervisor1/10.0.0.101:6702
2015-02-03 12:50:31 b.s.m.n.Server [INFO] Getting metrics for server on 6702
2015-02-03 12:51:31 b.s.m.n.Client [INFO] Getting metrics for connection to
supervisor2/10.0.0.102:6702
2015-02-03 12:51:31 b.s.m.n.Client [INFO] Getting metrics for connection to
supervisor1/10.0.0.101:6702
2015-02-03 12:51:31 b.s.m.n.Server [INFO] Getting metrics for server on 6702
[...]
```
Until then they were idle, with logs showing:
```
# Restart-related messages end with the "I am prepared!" log line below.
2015-02-03 11:58:52 b.s.d.executor [INFO] Prepared bolt bolt2:(3)
# Then, for the next 52 minutes, only log messages relating to metrics were
reported.
2015-02-03 11:59:52 b.s.m.n.Client [INFO] Getting metrics for connection to
supervisor2/10.0.0.102:6702
2015-02-03 11:59:52 b.s.m.n.Client [INFO] Getting metrics for connection to
supervisor1/10.0.0.101:6702
2015-02-03 11:59:52 b.s.m.n.Server [INFO] Getting metrics for server on 6702
2015-02-03 12:00:52 b.s.m.n.Client [INFO] Getting metrics for connection to
supervisor2/10.0.0.102:6702
2015-02-03 12:00:52 b.s.m.n.Client [INFO] Getting metrics for connection to
supervisor1/10.0.0.101:6702
2015-02-03 12:00:52 b.s.m.n.Server [INFO] Getting metrics for server on 6702
2015-02-03 12:01:52 b.s.m.n.Client [INFO] Getting metrics for connection to
supervisor2/10.0.0.102:6702
2015-02-03 12:01:52 b.s.m.n.Client [INFO] Getting metrics for connection to
supervisor1/10.0.0.101:6702
2015-02-03 12:01:52 b.s.m.n.Server [INFO] Getting metrics for server on 6702
# 52 minutes after the bolt restart it finally started to process data
again.
2015-02-03 12:49:21 ...
```
**bolt1 behavior**
During this time the upstream peer `bolt1` happily reported an increasing
number of emitted tuples, and there were no errors in the UI or in the logs.
Here is an example log snippet of `bolt1` at the time when `bolt2` died
(`ForwarderBolt` is `bolt1`).
* `bolt1` complains about a failed connection to `bolt2` at `12:52:24`,
which is about `3 mins` after the `bolt2` instance died `12:49:21`.
* `bolt1` subsequently reports it re-established a connection to `bolt2` at
`12:52:24` (the log timestamp granularity is 1 second).
* `bolt1` reports 9 new output tuples but - if my understanding of the
new patch is correct - this happens asynchronously now.
* `bolt1` complains about another failed connection to `bolt2` at
`12:52:25` (and another connection failure to a second instance of `bolt2` at
`12:52:26`).
* `bolt1` would then report new output tuples, but those would not reach
the downstream `bolt1` instances until 52 minutes later.
```
2015-02-03 12:52:24 b.s.m.n.StormClientErrorHandler [INFO] Connection
failed Netty-Client-supervisor3/10.0.0.103:6702
java.nio.channels.ClosedChannelException: null
at
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:433)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.writeFromUserCode(AbstractNioWorker.java:128)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:84)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at org.apache.storm.netty.channel.Channels.write(Channels.java:725)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.doEncode(OneToOneEncoder.java:71)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:59)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:582)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at org.apache.storm.netty.channel.Channels.write(Channels.java:704)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at org.apache.storm.netty.channel.Channels.write(Channels.java:671)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.AbstractChannel.write(AbstractChannel.java:248)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at backtype.storm.messaging.netty.Client.flushRequest(Client.java:398)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at backtype.storm.messaging.netty.Client.send(Client.java:279)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__7086$fn__7087.invoke(worker.clj:351)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__7086.invoke(worker.clj:349)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.disruptor$clojure_handler$reify__871.onEvent(disruptor.clj:58)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.disruptor$consume_loop_STAR_$fn__884.invoke(disruptor.clj:94)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at backtype.storm.util$async_loop$fn__550.invoke(util.clj:472)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at clojure.lang.AFn.run(AFn.java:22) ~[clojure-1.6.0.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
2015-02-03 12:52:24 b.s.m.n.Client [INFO] failed to send requests to
supervisor3/10.0.0.103:6702:
java.nio.channels.ClosedChannelException: null
at
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:433)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.writeFromUserCode(AbstractNioWorker.java:128)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:84)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at org.apache.storm.netty.channel.Channels.write(Channels.java:725)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.doEncode(OneToOneEncoder.java:71)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:59)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:582)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at org.apache.storm.netty.channel.Channels.write(Channels.java:704)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at org.apache.storm.netty.channel.Channels.write(Channels.java:671)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.AbstractChannel.write(AbstractChannel.java:248)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at backtype.storm.messaging.netty.Client.flushRequest(Client.java:398)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at backtype.storm.messaging.netty.Client.send(Client.java:279)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__7086$fn__7087.invoke(worker.clj:351)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__7086.invoke(worker.clj:349)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.disruptor$clojure_handler$reify__871.onEvent(disruptor.clj:58)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.disruptor$consume_loop_STAR_$fn__884.invoke(disruptor.clj:94)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at backtype.storm.util$async_loop$fn__550.invoke(util.clj:472)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.6.0.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
2015-02-03 12:52:24 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-supervisor3/10.0.0.103:6702... [0]
2015-02-03 12:52:24 b.s.m.n.Client [INFO] connection established to a
remote host Netty-Client-supervisor3/10.0.0.103:6702, [id: 0xa58c119c,
/10.0.0.102:44392 => supervisor3/10.0.0.103:6702]
2015-02-03 12:52:24 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [golda]
2015-02-03 12:52:24 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [bertels]
2015-02-03 12:52:24 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [mike]
2015-02-03 12:52:24 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [golda]
2015-02-03 12:52:24 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [bertels]
2015-02-03 12:52:24 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [golda]
2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [nathan]
2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [golda]
2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [nathan]
2015-02-03 12:52:25 b.s.m.n.StormClientErrorHandler [INFO] Connection
failed Netty-Client-supervisor3/10.0.0.103:6702
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.7.0_75]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
~[na:1.7.0_75]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
~[na:1.7.0_75]
at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.7.0_75]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
~[na:1.7.0_75]
at
org.apache.storm.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_75]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_75]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [jackson]
2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [nathan]
2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [jackson]
2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [mike]
2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [golda]
2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [golda]
2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [bertels]
2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [mike]
2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [golda]
2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [mike]
2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [golda]
2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [jackson]
2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [bertels]
2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [mike]
2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [golda]
2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [nathan]
2015-02-03 12:52:26 b.s.m.n.StormClientErrorHandler [INFO] Connection
failed Netty-Client-supervisor4/10.0.0.104:6702
java.nio.channels.ClosedChannelException: null
at
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:433)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.writeFromUserCode(AbstractNioWorker.java:128)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:84)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at org.apache.storm.netty.channel.Channels.write(Channels.java:725)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.doEncode(OneToOneEncoder.java:71)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:59)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:582)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at org.apache.storm.netty.channel.Channels.write(Channels.java:704)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at org.apache.storm.netty.channel.Channels.write(Channels.java:671)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.AbstractChannel.write(AbstractChannel.java:248)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at backtype.storm.messaging.netty.Client.flushRequest(Client.java:398)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at backtype.storm.messaging.netty.Client.send(Client.java:279)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__7086$fn__7087.invoke(worker.clj:351)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__7086.invoke(worker.clj:349)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.disruptor$clojure_handler$reify__871.onEvent(disruptor.clj:58)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.disruptor$consume_loop_STAR_$fn__884.invoke(disruptor.clj:94)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at backtype.storm.util$async_loop$fn__550.invoke(util.clj:472)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at clojure.lang.AFn.run(AFn.java:22) ~[clojure-1.6.0.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
2015-02-03 12:52:26 b.s.m.n.Client [INFO] failed to send requests to
supervisor4/10.0.0.104:6702:
java.nio.channels.ClosedChannelException: null
at
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:433)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.writeFromUserCode(AbstractNioWorker.java:128)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:84)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at org.apache.storm.netty.channel.Channels.write(Channels.java:725)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.doEncode(OneToOneEncoder.java:71)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:59)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:582)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at org.apache.storm.netty.channel.Channels.write(Channels.java:704)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at org.apache.storm.netty.channel.Channels.write(Channels.java:671)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.AbstractChannel.write(AbstractChannel.java:248)
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at backtype.storm.messaging.netty.Client.flushRequest(Client.java:398)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at backtype.storm.messaging.netty.Client.send(Client.java:279)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__7086$fn__7087.invoke(worker.clj:351)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__7086.invoke(worker.clj:349)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.disruptor$clojure_handler$reify__871.onEvent(disruptor.clj:58)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
backtype.storm.disruptor$consume_loop_STAR_$fn__884.invoke(disruptor.clj:94)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at backtype.storm.util$async_loop$fn__550.invoke(util.clj:472)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.6.0.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
2015-02-03 12:52:26 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-supervisor4/10.0.0.104:6702... [0]
2015-02-03 12:52:26 b.s.m.n.Client [INFO] connection established to a
remote host Netty-Client-supervisor4/10.0.0.104:6702, [id: 0xa9bd1839,
/10.0.0.102:37751 => supervisor4/10.0.0.104:6702]
2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [jackson]
2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [mike]
2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [golda]
2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [bertels]
2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [bertels]
2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [golda]
2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [bertels]
2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [nathan]
2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [jackson]
2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [jackson]
2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [nathan]
2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [bertels]
2015-02-03 12:52:28 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [mike]
2015-02-03 12:52:28 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [golda]
2015-02-03 12:52:28 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [mike]
2015-02-03 12:52:28 b.s.m.n.StormClientErrorHandler [INFO] Connection
failed Netty-Client-supervisor4/10.0.0.104:6702
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.7.0_75]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
~[na:1.7.0_75]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
~[na:1.7.0_75]
at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.7.0_75]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
~[na:1.7.0_75]
at
org.apache.storm.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
org.apache.storm.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_75]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_75]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
2015-02-03 12:52:28 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [golda]
2015-02-03 12:52:28 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:10, stream: default, id: {}, [mike]
2015-02-03 12:52:28 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple
source: wordSpout:9, stream: default, id: {}, [jackson]
# From this point on the bolt would continuously report new output tuples
# (cf. the "forwarding tuple" log messages above) until its downstream peer
# would come fully back to live (read: 52mins after restarts).
```
### Current conclusion
At the moment this patch seems not to be improving the situation. (I
wouldn't rule out that we screw up merging the patch into the current 0.10.0
master version but we didn't run into any merge conflicts so I'd say we applied
the patch correctly.) Silent data loss is as bad and arguably worse than a
cascading failure.
PS: We're about to share the storm-bolt-of-death topology, which might help
with reproducing this issue in a deterministic way for the various people
involved in this thread.
> Add Option to Config Message handling strategy when connection timeout
> ----------------------------------------------------------------------
>
> Key: STORM-329
> URL: https://issues.apache.org/jira/browse/STORM-329
> Project: Apache Storm
> Issue Type: Improvement
> Affects Versions: 0.9.2-incubating
> Reporter: Sean Zhong
> Priority: Minor
> Labels: Netty
> Attachments: storm-329.patch, worker-kill-recover3.jpg
>
>
> This is to address a [concern brought
> up|https://github.com/apache/incubator-storm/pull/103#issuecomment-43632986]
> during the work at STORM-297:
> {quote}
> [~revans2] wrote: Your logic makes since to me on why these calls are
> blocking. My biggest concern around the blocking is in the case of a worker
> crashing. If a single worker crashes this can block the entire topology from
> executing until that worker comes back up. In some cases I can see that being
> something that you would want. In other cases I can see speed being the
> primary concern and some users would like to get partial data fast, rather
> then accurate data later.
> Could we make it configurable on a follow up JIRA where we can have a max
> limit to the buffering that is allowed, before we block, or throw data away
> (which is what zeromq does)?
> {quote}
> If some worker crash suddenly, how to handle the message which was supposed
> to be delivered to the worker?
> 1. Should we buffer all message infinitely?
> 2. Should we block the message sending until the connection is resumed?
> 3. Should we config a buffer limit, try to buffer the message first, if the
> limit is met, then block?
> 4. Should we neither block, nor buffer too much, but choose to drop the
> messages, and use the built-in storm failover mechanism?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)