Danijel Schiavuzzi created STORM-406:
----------------------------------------

             Summary: Trident topologies getting stuck when using Netty 
transport (reproducible)
                 Key: STORM-406
                 URL: https://issues.apache.org/jira/browse/STORM-406
             Project: Apache Storm (Incubating)
          Issue Type: Bug
    Affects Versions: 0.9.2-incubating, 0.9.1-incubating, 0.9.0.1
         Environment: Linux, OpenJDK 7
            Reporter: Danijel Schiavuzzi
            Priority: Critical


When using the new, default Netty transport, Trident topologies sometimes get 
stuck, while under ZeroMQ everything is working fine.

I can reliably reproduce this issue by killing a Storm worker on a running 
Trident topology. If the worker gets re-spawned on the same slot (port), the 
topology stops processing. But if the worker re-spawns on a different port, 
topology processing continues normally.

The Storm cluster configuration is pretty standard, there are two Supervisor 
nodes, one node has also Nimbus, UI and DRPC running on it. I have four slots 
per Supervisor, and run my test topology with setNumWorkers set to 8 so that it 
occupies all eight slots across the cluster. Killing a worker in this 
configuration will always re-spawn the worker on the same node and slot (port), 
thus causing the topology to stop processing. This is 100% reproducible on a 
few Storm clusters of mine, across multiple Storm versions (0.9.0.1, 0.9.1, 
0.9.2).

I have reproduced this with multiple Trident topologies, the simplest of which 
is the TridentWordCount topology from storm-starter. I've just modified it a 
little to add an additional Trident filter to log the tuple throughput: 
https://github.com/dschiavu/storm-trident-stuck-topology

Non-transactional Trident topologies just silently stop processing, while 
transactional topologies continuously retry the batches and are re-emitted by 
the spout, however they never get processed by the next bolts in the chain so 
they time out.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to