[ 
https://issues.apache.org/jira/browse/STORM-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303271#comment-14303271
 ] 

ASF GitHub Bot commented on STORM-329:
--------------------------------------

Github user miguno commented on the pull request:

    https://github.com/apache/storm/pull/268#issuecomment-72652704
  
    Some additional feedback regarding Storm's behavior in the face of 
failures, with and without this patch.  This summary is slightly simplified to 
make it a shorter read.
    
    ### Without this patch
    
    * Tested Storm versions: 0.9.2, 0.9.3, 0.10.0-SNAPSHOT
    * Configuration: default Storm settings (cf. `conf/defaults.yaml`)
    
    Here, we can confirm the cascading failure previously described in this 
pull request and the JIRA ticket.
    
    Consider a simple topology such as:
    
    ```
                    +-----> bolt1 -----> bolt2
                    |
         spout -----+
                    |
                    +-----> bolt3
    ```
    
    * If the instances of `bolt2` die (say, because of a runtime exception), 
then `bolt1` instances will enter a reconnect-until-success-or-die loop.
        * If Storm decides to place the restarted `bolt2` instances on 
different workers (read: machine+port pairs), then `bolt1` will eventually die.
        * If Storm places the restarted `bolt2` instances on the same workers, 
then the `bolt1` instances will not die because one of their reconnection 
attempts will succeed, and normal operation will resume.
    * If `bolt1` died, too, then we enter the same 
reconnect-until-success-or-die loop in the `spout` instances.  Hence the 
cascading nature of the failure.
    
    On top of that, we also noticed the following, later phases of this 
cascading failure to occur in larger clusters, where each phase is less likely 
to happen than the previous one:
    
    1. Other spouts/bolts of the same topology -- "friends of friends" and so 
on -- may enter such loops.  In the example above, `bolt3` may start to die, 
too.
    2. Eventually the full topology may become disfunctional, a zombie: not 
dead but not alive either.
    3. Other topologies in the cluster may then become zombies, too.
    4. The full Storm cluster may enter a zombie state.  This state even turn 
out to be unrecoverable without a full cluster restart.
    
    > Funny anecdote: Because of a race condition scenario you may even observe 
that Storm spouts/bolts will begin to talk to the wrong peers, e.g. `spout` 
will talk directly to `bolt2`, even though this violates the wiring of our 
topology.
    
    ### With this patch
    
    * Tested Storm versions: We primarily tested a patched 0.10.0-SNAPSHOT 
version but also tested a patched 0.9.3 briefly.
    * Configuration: default Storm settings (cf. `conf/defaults.yaml`)
    
    Here, the behavior is different.  We didn't observe cascading failures 
anymore at the expense of "silent data loss" (see below).
    
    Again, consider this example topology:
    
    ```
                    +-----> bolt1 -----> bolt2
                    |
         spout -----+
                    |
                    +-----> bolt3
    ```
    
    With the patch, when the instances of `bolt2` die then the instances of 
`bolt2` will continue to run; i.e. they will not enter a 
reconnect-until-success-or-die loop anymore (which, particularly the not-dying 
part, was the purpose of the patch).
    
    **bolt2 behavior**
    
    We wrote a special-purpose "storm-bolt-of-death" topology that would 
consistently throw runtime exceptions in `bolt2` (aka the bolt of death) 
whenever it receives an input tuple.  The following example shows the timeline 
of `bolt2` crashing intentionally.  We observed that once the `bolt2` instances 
were restarted -- and Storm would typically restart the instances on the same 
workers (read: machine+port combinations) -- then they would not receive any 
new input tuples even though their upstream peer `bolt2` was up and running and 
constantly emitting output tuples.
    
    Summary of the `bolt2` log snippet below:
    
    * This `bolt2` instance dies at `12:49:21`, followed by an immediate 
restart (here: on the same machine+port).
    * The `bolt2` instance is up and running at `12:49:32`, but it would not 
process any new input tuple until `52 mins` later.
        * In our testing we found that the restarted `bolt2` instances took a 
consistent `52 mins` (!) to receive their first, "new" input tuple from `bolt1`.
    
    ```
    # New input tuple => let's crash!   Now the shutdown procedure begins.
    
    2015-02-03 12:49:21 c.v.s.t.s.b.BoltOfDeath [ERROR] Intentionally throwing 
this exception to trigger bolt failures
    2015-02-03 12:49:21 b.s.util [ERROR] Async loop died!
    java.lang.RuntimeException: java.lang.RuntimeException: Intentionally 
throwing this exception to trigger bolt failures
        at 
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.daemon.executor$fn__6773$fn__6786$fn__6837.invoke(executor.clj:798)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at backtype.storm.util$async_loop$fn__550.invoke(util.clj:472) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at clojure.lang.AFn.run(AFn.java:22) [clojure-1.6.0.jar:na]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
    Caused by: java.lang.RuntimeException: Intentionally throwing this 
exception to trigger bolt failures
        at 
com.verisign.storm.tools.sbod.bolts.BoltOfDeath.execute(BoltOfDeath.scala:77) 
~[stormjar.jar:0.1.0-SNAPSHOT]
        at 
backtype.storm.topology.BasicBoltExecutor.execute(BasicBoltExecutor.java:50) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.daemon.executor$fn__6773$tuple_action_fn__6775.invoke(executor.clj:660)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.daemon.executor$mk_task_receiver$fn__6696.invoke(executor.clj:416)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.disruptor$clojure_handler$reify__871.onEvent(disruptor.clj:58) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        ... 6 common frames omitted
    2015-02-03 12:49:21 b.s.d.executor [ERROR]
    java.lang.RuntimeException: java.lang.RuntimeException: Intentionally 
throwing this exception to trigger bolt failures
        at 
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.daemon.executor$fn__6773$fn__6786$fn__6837.invoke(executor.clj:798)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at backtype.storm.util$async_loop$fn__550.invoke(util.clj:472) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at clojure.lang.AFn.run(AFn.java:22) [clojure-1.6.0.jar:na]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
    Caused by: java.lang.RuntimeException: Intentionally throwing this 
exception to trigger bolt failures
        at 
com.verisign.storm.tools.sbod.bolts.BoltOfDeath.execute(BoltOfDeath.scala:77) 
~[stormjar.jar:0.1.0-SNAPSHOT]
        at 
backtype.storm.topology.BasicBoltExecutor.execute(BasicBoltExecutor.java:50) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.daemon.executor$fn__6773$tuple_action_fn__6775.invoke(executor.clj:660)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.daemon.executor$mk_task_receiver$fn__6696.invoke(executor.clj:416)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.disruptor$clojure_handler$reify__871.onEvent(disruptor.clj:58) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        ... 6 common frames omitted
    2015-02-03 12:49:21 b.s.util [ERROR] Halting process: ("Worker died")
    java.lang.RuntimeException: ("Worker died")
        at backtype.storm.util$exit_process_BANG_.doInvoke(util.clj:329) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at clojure.lang.RestFn.invoke(RestFn.java:423) [clojure-1.6.0.jar:na]
        at 
backtype.storm.daemon.worker$fn__7196$fn__7197.invoke(worker.clj:536) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.daemon.executor$mk_executor_data$fn__6606$fn__6607.invoke(executor.clj:246)
 [storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at backtype.storm.util$async_loop$fn__550.invoke(util.clj:482) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at clojure.lang.AFn.run(AFn.java:22) [clojure-1.6.0.jar:na]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
    2015-02-03 12:49:21 b.s.d.worker [INFO] Shutting down worker 
bolt-of-death-topology-1-1422964754 783bcc5b-b571-4c7e-94f4-72ff49edc35e 6702
    2015-02-03 12:49:21 b.s.m.n.Client [INFO] Closing Netty Client 
Netty-Client-supervisor2/10.0.0.102:6702
    2015-02-03 12:49:21 b.s.m.n.Client [INFO] Waiting for pending batchs to be 
sent with Netty-Client-supervisor2/10.0.0.102:6702..., timeout: 600000ms, 
pendings: 0
    2015-02-03 12:49:21 b.s.m.n.Client [INFO] Closing Netty Client 
Netty-Client-supervisor1/10.0.0.101:6702
    2015-02-03 12:49:21 b.s.m.n.Client [INFO] Waiting for pending batchs to be 
sent with Netty-Client-supervisor1/10.0.0.101:6702..., timeout: 600000ms, 
pendings: 0
    2015-02-03 12:49:21 b.s.d.worker [INFO] Shutting down receive thread
    2015-02-03 12:49:21 o.a.s.c.r.ExponentialBackoffRetry [WARN] maxRetries too 
large (300). Pinning to 29
    2015-02-03 12:49:21 b.s.u.StormBoundedExponentialBackoffRetry [INFO] The 
baseSleepTimeMs [100] the maxSleepTimeMs [1000] the maxRetries [300]
    2015-02-03 12:49:21 b.s.m.n.Client [INFO] New Netty Client, connect to 
supervisor3, 6702, config: , buffer_size: 5242880
    2015-02-03 12:49:21 b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-supervisor3/127.0.1.1:6702... [0]
    2015-02-03 12:49:21 b.s.m.n.Client [INFO] connection established to a 
remote host Netty-Client-supervisor3/127.0.1.1:6702, [id: 0x8415ae96, 
/127.0.1.1:48837 => supervisor3/127.0.1.1:6702]
    2015-02-03 12:49:21 b.s.m.loader [INFO] Shutting down receiving-thread: 
[bolt-of-death-topology-1-1422964754, 6702]
    2015-02-03 12:49:21 b.s.m.n.Client [INFO] Closing Netty Client 
Netty-Client-supervisor3/127.0.1.1:6702
    2015-02-03 12:49:21 b.s.m.n.Client [INFO] Waiting for pending batchs to be 
sent with Netty-Client-supervisor3/127.0.1.1:6702..., timeout: 600000ms, 
pendings: 0
    2015-02-03 12:49:21 b.s.m.loader [INFO] Waiting for 
receiving-thread:[bolt-of-death-topology-1-1422964754, 6702] to die
    2015-02-03 12:49:21 b.s.m.loader [INFO] Shutdown receiving-thread: 
[bolt-of-death-topology-1-1422964754, 6702]
    2015-02-03 12:49:21 b.s.d.worker [INFO] Shut down receive thread
    2015-02-03 12:49:21 b.s.d.worker [INFO] Terminating messaging context
    2015-02-03 12:49:21 b.s.d.worker [INFO] Shutting down executors
    2015-02-03 12:49:21 b.s.d.executor [INFO] Shutting down executor 
bolt-of-death-A2:[3 3]
    2015-02-03 12:49:21 b.s.util [INFO] Async loop interrupted!
    
    # Now the restart begins, which happened to be on the same machine+port.
    
    2015-02-03 12:49:28 o.a.s.z.ZooKeeper [INFO] Client 
environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
    2015-02-03 12:49:28 o.a.s.z.ZooKeeper [INFO] Client 
environment:host.name=supervisor3
    [...]
    2015-02-03 12:49:32 b.s.d.executor [INFO] Prepared bolt bolt-of-death-A2:(3)
    
    # Now the bolt instance would stay idle for the next 52 mins.
    # Only metrics related log messages were reported.
    
    2015-02-03 12:50:31 b.s.m.n.Client [INFO] Getting metrics for connection to 
supervisor2/10.0.0.102:6702
    2015-02-03 12:50:31 b.s.m.n.Client [INFO] Getting metrics for connection to 
supervisor1/10.0.0.101:6702
    2015-02-03 12:50:31 b.s.m.n.Server [INFO] Getting metrics for server on 6702
    2015-02-03 12:51:31 b.s.m.n.Client [INFO] Getting metrics for connection to 
supervisor2/10.0.0.102:6702
    2015-02-03 12:51:31 b.s.m.n.Client [INFO] Getting metrics for connection to 
supervisor1/10.0.0.101:6702
    2015-02-03 12:51:31 b.s.m.n.Server [INFO] Getting metrics for server on 6702
    [...]
    ```
    
      Until then they were idle, with logs showing:
    
    ```
    # Restart-related messages end with the "I am prepared!" log line below.
    2015-02-03 11:58:52 b.s.d.executor [INFO] Prepared bolt bolt2:(3)
    
    # Then, for the next 52 minutes, only log messages relating to metrics were 
reported.
    2015-02-03 11:59:52 b.s.m.n.Client [INFO] Getting metrics for connection to 
supervisor2/10.0.0.102:6702
    2015-02-03 11:59:52 b.s.m.n.Client [INFO] Getting metrics for connection to 
supervisor1/10.0.0.101:6702
    2015-02-03 11:59:52 b.s.m.n.Server [INFO] Getting metrics for server on 6702
    2015-02-03 12:00:52 b.s.m.n.Client [INFO] Getting metrics for connection to 
supervisor2/10.0.0.102:6702
    2015-02-03 12:00:52 b.s.m.n.Client [INFO] Getting metrics for connection to 
supervisor1/10.0.0.101:6702
    2015-02-03 12:00:52 b.s.m.n.Server [INFO] Getting metrics for server on 6702
    2015-02-03 12:01:52 b.s.m.n.Client [INFO] Getting metrics for connection to 
supervisor2/10.0.0.102:6702
    2015-02-03 12:01:52 b.s.m.n.Client [INFO] Getting metrics for connection to 
supervisor1/10.0.0.101:6702
    2015-02-03 12:01:52 b.s.m.n.Server [INFO] Getting metrics for server on 6702
    
    # 52 minutes after the bolt restart it finally started to process data 
again.
    2015-02-03 12:49:21 ...
    ```
    
    **bolt1 behavior**
    
    During this time the upstream peer `bolt1` happily reported an increasing 
number of emitted tuples, and there were no errors in the UI or in the logs.  
Here is an example log snippet of `bolt1` at the time when `bolt2` died 
(`ForwarderBolt` is `bolt1`).
    
    * `bolt1` complains about a failed connection to `bolt2` at `12:52:24`, 
which is about `3 mins` after the `bolt2` instance died `12:49:21`.
    * `bolt1` subsequently reports it re-established a connection to `bolt2` at 
`12:52:24` (the log timestamp granularity is 1 second).
        * `bolt1` reports 9 new output tuples but - if my understanding of the 
new patch is correct - this happens asynchronously now.
    * `bolt1` complains about another failed connection to `bolt2` at 
`12:52:25` (and another connection failure to a second instance of `bolt2` at 
`12:52:26`).
    * `bolt1` would then report new output tuples, but those would not reach 
the downstream `bolt1` instances until 52 minutes later.
    
    ```
    2015-02-03 12:52:24 b.s.m.n.StormClientErrorHandler [INFO] Connection 
failed Netty-Client-supervisor3/10.0.0.103:6702
    java.nio.channels.ClosedChannelException: null
        at 
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:433)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.writeFromUserCode(AbstractNioWorker.java:128)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:84)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at org.apache.storm.netty.channel.Channels.write(Channels.java:725) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.doEncode(OneToOneEncoder.java:71)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:59)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:582)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at org.apache.storm.netty.channel.Channels.write(Channels.java:704) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at org.apache.storm.netty.channel.Channels.write(Channels.java:671) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.AbstractChannel.write(AbstractChannel.java:248) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at backtype.storm.messaging.netty.Client.flushRequest(Client.java:398) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at backtype.storm.messaging.netty.Client.send(Client.java:279) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__7086$fn__7087.invoke(worker.clj:351)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__7086.invoke(worker.clj:349)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.disruptor$clojure_handler$reify__871.onEvent(disruptor.clj:58) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.disruptor$consume_loop_STAR_$fn__884.invoke(disruptor.clj:94) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at backtype.storm.util$async_loop$fn__550.invoke(util.clj:472) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at clojure.lang.AFn.run(AFn.java:22) ~[clojure-1.6.0.jar:na]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
    2015-02-03 12:52:24 b.s.m.n.Client [INFO] failed to send requests to 
supervisor3/10.0.0.103:6702:
    java.nio.channels.ClosedChannelException: null
        at 
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:433)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.writeFromUserCode(AbstractNioWorker.java:128)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:84)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at org.apache.storm.netty.channel.Channels.write(Channels.java:725) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.doEncode(OneToOneEncoder.java:71)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:59)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:582)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at org.apache.storm.netty.channel.Channels.write(Channels.java:704) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at org.apache.storm.netty.channel.Channels.write(Channels.java:671) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.AbstractChannel.write(AbstractChannel.java:248) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at backtype.storm.messaging.netty.Client.flushRequest(Client.java:398) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at backtype.storm.messaging.netty.Client.send(Client.java:279) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__7086$fn__7087.invoke(worker.clj:351)
 [storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__7086.invoke(worker.clj:349)
 [storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.disruptor$clojure_handler$reify__871.onEvent(disruptor.clj:58) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
 [storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
 [storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.disruptor$consume_loop_STAR_$fn__884.invoke(disruptor.clj:94) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at backtype.storm.util$async_loop$fn__550.invoke(util.clj:472) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at clojure.lang.AFn.run(AFn.java:22) [clojure-1.6.0.jar:na]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
    2015-02-03 12:52:24 b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-supervisor3/10.0.0.103:6702... [0]
    2015-02-03 12:52:24 b.s.m.n.Client [INFO] connection established to a 
remote host Netty-Client-supervisor3/10.0.0.103:6702, [id: 0xa58c119c, 
/10.0.0.102:44392 => supervisor3/10.0.0.103:6702]
    2015-02-03 12:52:24 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [golda]
    2015-02-03 12:52:24 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [bertels]
    2015-02-03 12:52:24 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [mike]
    2015-02-03 12:52:24 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [golda]
    2015-02-03 12:52:24 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [bertels]
    2015-02-03 12:52:24 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [golda]
    2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [nathan]
    2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [golda]
    2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [nathan]
    2015-02-03 12:52:25 b.s.m.n.StormClientErrorHandler [INFO] Connection 
failed Netty-Client-supervisor3/10.0.0.103:6702
    java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.7.0_75]
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) 
~[na:1.7.0_75]
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) 
~[na:1.7.0_75]
        at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.7.0_75]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) 
~[na:1.7.0_75]
        at 
org.apache.storm.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
 [storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
 [storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
 [storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
 [storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
 [storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
[na:1.7.0_75]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
[na:1.7.0_75]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
    2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [jackson]
    2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [nathan]
    2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [jackson]
    2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [mike]
    2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [golda]
    2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [golda]
    2015-02-03 12:52:25 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [bertels]
    2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [mike]
    2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [golda]
    2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [mike]
    2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [golda]
    2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [jackson]
    2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [bertels]
    2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [mike]
    2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [golda]
    2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [nathan]
    2015-02-03 12:52:26 b.s.m.n.StormClientErrorHandler [INFO] Connection 
failed Netty-Client-supervisor4/10.0.0.104:6702
    java.nio.channels.ClosedChannelException: null
        at 
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:433)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.writeFromUserCode(AbstractNioWorker.java:128)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:84)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at org.apache.storm.netty.channel.Channels.write(Channels.java:725) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.doEncode(OneToOneEncoder.java:71)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:59)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:582)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at org.apache.storm.netty.channel.Channels.write(Channels.java:704) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at org.apache.storm.netty.channel.Channels.write(Channels.java:671) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.AbstractChannel.write(AbstractChannel.java:248) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at backtype.storm.messaging.netty.Client.flushRequest(Client.java:398) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at backtype.storm.messaging.netty.Client.send(Client.java:279) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__7086$fn__7087.invoke(worker.clj:351)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__7086.invoke(worker.clj:349)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.disruptor$clojure_handler$reify__871.onEvent(disruptor.clj:58) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.disruptor$consume_loop_STAR_$fn__884.invoke(disruptor.clj:94) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at backtype.storm.util$async_loop$fn__550.invoke(util.clj:472) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at clojure.lang.AFn.run(AFn.java:22) ~[clojure-1.6.0.jar:na]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
    2015-02-03 12:52:26 b.s.m.n.Client [INFO] failed to send requests to 
supervisor4/10.0.0.104:6702:
    java.nio.channels.ClosedChannelException: null
        at 
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:433)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.writeFromUserCode(AbstractNioWorker.java:128)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:84)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at org.apache.storm.netty.channel.Channels.write(Channels.java:725) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.doEncode(OneToOneEncoder.java:71)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:59)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:582)
 ~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at org.apache.storm.netty.channel.Channels.write(Channels.java:704) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at org.apache.storm.netty.channel.Channels.write(Channels.java:671) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.AbstractChannel.write(AbstractChannel.java:248) 
~[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at backtype.storm.messaging.netty.Client.flushRequest(Client.java:398) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at backtype.storm.messaging.netty.Client.send(Client.java:279) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__7086$fn__7087.invoke(worker.clj:351)
 [storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__7086.invoke(worker.clj:349)
 [storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.disruptor$clojure_handler$reify__871.onEvent(disruptor.clj:58) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
 [storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
 [storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
backtype.storm.disruptor$consume_loop_STAR_$fn__884.invoke(disruptor.clj:94) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at backtype.storm.util$async_loop$fn__550.invoke(util.clj:472) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at clojure.lang.AFn.run(AFn.java:22) [clojure-1.6.0.jar:na]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
    2015-02-03 12:52:26 b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-supervisor4/10.0.0.104:6702... [0]
    2015-02-03 12:52:26 b.s.m.n.Client [INFO] connection established to a 
remote host Netty-Client-supervisor4/10.0.0.104:6702, [id: 0xa9bd1839, 
/10.0.0.102:37751 => supervisor4/10.0.0.104:6702]
    2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [jackson]
    2015-02-03 12:52:26 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [mike]
    2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [golda]
    2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [bertels]
    2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [bertels]
    2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [golda]
    2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [bertels]
    2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [nathan]
    2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [jackson]
    2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [jackson]
    2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [nathan]
    2015-02-03 12:52:27 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [bertels]
    2015-02-03 12:52:28 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [mike]
    2015-02-03 12:52:28 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [golda]
    2015-02-03 12:52:28 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [mike]
    2015-02-03 12:52:28 b.s.m.n.StormClientErrorHandler [INFO] Connection 
failed Netty-Client-supervisor4/10.0.0.104:6702
    java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.7.0_75]
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) 
~[na:1.7.0_75]
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) 
~[na:1.7.0_75]
        at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.7.0_75]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) 
~[na:1.7.0_75]
        at 
org.apache.storm.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
 [storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
 [storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
 [storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) 
[storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
 [storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
org.apache.storm.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
 [storm-core-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
[na:1.7.0_75]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
[na:1.7.0_75]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
    2015-02-03 12:52:28 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [golda]
    2015-02-03 12:52:28 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:10, stream: default, id: {}, [mike]
    2015-02-03 12:52:28 c.v.s.t.s.b.ForwarderBolt [INFO] Forwarding tuple 
source: wordSpout:9, stream: default, id: {}, [jackson]
    
    
    # From this point on the bolt would continuously report new output tuples
    # (cf. the "forwarding tuple" log messages above) until its downstream peer
    # would come fully back to live (read: 52mins after restarts).
    ```
    
    ### Current conclusion
    
    At the moment this patch seems not to be improving the situation.  (I 
wouldn't rule out that we screw up merging the patch into the current 0.10.0 
master version but we didn't run into any merge conflicts so I'd say we applied 
the patch correctly.)  Silent data loss is as bad and arguably worse than a 
cascading failure.
    
    PS: We're about to share the storm-bolt-of-death topology, which might help 
with reproducing this issue in a deterministic way for the various people 
involved in this thread.


> Add Option to Config Message handling strategy when connection timeout
> ----------------------------------------------------------------------
>
>                 Key: STORM-329
>                 URL: https://issues.apache.org/jira/browse/STORM-329
>             Project: Apache Storm
>          Issue Type: Improvement
>    Affects Versions: 0.9.2-incubating
>            Reporter: Sean Zhong
>            Priority: Minor
>              Labels: Netty
>         Attachments: storm-329.patch, worker-kill-recover3.jpg
>
>
> This is to address a [concern brought 
> up|https://github.com/apache/incubator-storm/pull/103#issuecomment-43632986] 
> during the work at STORM-297:
> {quote}
> [~revans2] wrote: Your logic makes since to me on why these calls are 
> blocking. My biggest concern around the blocking is in the case of a worker 
> crashing. If a single worker crashes this can block the entire topology from 
> executing until that worker comes back up. In some cases I can see that being 
> something that you would want. In other cases I can see speed being the 
> primary concern and some users would like to get partial data fast, rather 
> then accurate data later.
> Could we make it configurable on a follow up JIRA where we can have a max 
> limit to the buffering that is allowed, before we block, or throw data away 
> (which is what zeromq does)?
> {quote}
> If some worker crash suddenly, how to handle the message which was supposed 
> to be delivered to the worker?
> 1. Should we buffer all message infinitely?
> 2. Should we block the message sending until the connection is resumed?
> 3. Should we config a buffer limit, try to buffer the message first, if the 
> limit is met, then block?
> 4. Should we neither block, nor buffer too much, but choose to drop the 
> messages, and use the built-in storm failover mechanism? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to