Danijel Schiavuzzi created STORM-406:
----------------------------------------
Summary: Trident topologies getting stuck when using Netty
transport (reproducible)
Key: STORM-406
URL: https://issues.apache.org/jira/browse/STORM-406
Project: Apache Storm (Incubating)
Issue Type: Bug
Affects Versions: 0.9.2-incubating, 0.9.1-incubating, 0.9.0.1
Environment: Linux, OpenJDK 7
Reporter: Danijel Schiavuzzi
Priority: Critical
When using the new, default Netty transport, Trident topologies sometimes get
stuck, while under ZeroMQ everything is working fine.
I can reliably reproduce this issue by killing a Storm worker on a running
Trident topology. If the worker gets re-spawned on the same slot (port), the
topology stops processing. But if the worker re-spawns on a different port,
topology processing continues normally.
The Storm cluster configuration is pretty standard, there are two Supervisor
nodes, one node has also Nimbus, UI and DRPC running on it. I have four slots
per Supervisor, and run my test topology with setNumWorkers set to 8 so that it
occupies all eight slots across the cluster. Killing a worker in this
configuration will always re-spawn the worker on the same node and slot (port),
thus causing the topology to stop processing. This is 100% reproducible on a
few Storm clusters of mine, across multiple Storm versions (0.9.0.1, 0.9.1,
0.9.2).
I have reproduced this with multiple Trident topologies, the simplest of which
is the TridentWordCount topology from storm-starter. I've just modified it a
little to add an additional Trident filter to log the tuple throughput:
https://github.com/dschiavu/storm-trident-stuck-topology
Non-transactional Trident topologies just silently stop processing, while
transactional topologies continuously retry the batches and are re-emitted by
the spout, however they never get processed by the next bolts in the chain so
they time out.
--
This message was sent by Atlassian JIRA
(v6.2#6252)