[ 
https://issues.apache.org/jira/browse/STORM-406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073515#comment-14073515
 ] 

Paul Poulosky commented on STORM-406:
-------------------------------------

We were able to reproduce this outside of Trident, using a modified Exclamation 
topology that has two workers and parallelism of 1 on the spouts and bolts.

    Word-Spout (worker1)
              |
              V
      Exclaim 1 (worker1)
              | 
              V
      Exclaim 2 (worker2)

If the worker2 containing the downstream bolt is killed and relaunched, the 
upstream worker does not recognize that the connection went down, and makes no 
attempt to reconnect.

We are working on a fix and will submit a patch soon.   This should be a 
blocking issue for 0.9.3.

> Trident topologies getting stuck when using Netty transport (reproducible)
> --------------------------------------------------------------------------
>
>                 Key: STORM-406
>                 URL: https://issues.apache.org/jira/browse/STORM-406
>             Project: Apache Storm (Incubating)
>          Issue Type: Bug
>    Affects Versions: 0.9.2-incubating, 0.9.1-incubating, 0.9.0.1
>         Environment: Linux, OpenJDK 7
>            Reporter: Danijel Schiavuzzi
>            Priority: Critical
>
> When using the new, default Netty transport, Trident topologies sometimes get 
> stuck, while under ZeroMQ everything is working fine.
> I can reliably reproduce this issue by killing a Storm worker on a running 
> Trident topology. If the worker gets re-spawned on the same slot (port), the 
> topology stops processing. But if the worker re-spawns on a different port, 
> topology processing continues normally.
> The Storm cluster configuration is pretty standard, there are two Supervisor 
> nodes, one node has also Nimbus, UI and DRPC running on it. I have four slots 
> per Supervisor, and run my test topology with setNumWorkers set to 8 so that 
> it occupies all eight slots across the cluster. Killing a worker in this 
> configuration will always re-spawn the worker on the same node and slot 
> (port), thus causing the topology to stop processing. This is 100% 
> reproducible on a few Storm clusters of mine, across multiple Storm versions 
> (0.9.0.1, 0.9.1, 0.9.2).
> I have reproduced this with multiple Trident topologies, the simplest of 
> which is the TridentWordCount topology from storm-starter. I've just modified 
> it a little to add an additional Trident filter to log the tuple throughput: 
> https://github.com/dschiavu/storm-trident-stuck-topology
> Non-transactional Trident topologies just silently stop processing, while 
> transactional topologies continuously retry the batches and are re-emitted by 
> the spout, however they never get processed by the next bolts in the chain so 
> they time out.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to