[
https://issues.apache.org/jira/browse/STORM-406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073634#comment-14073634
]
ASF GitHub Bot commented on STORM-406:
--------------------------------------
GitHub user kishorvpatil opened a pull request:
https://github.com/apache/incubator-storm/pull/205
[STORM-406] Fix for reconnect logic in netty client
- Check if channel ``isConnected``.
- Reconnect if not before start of each batch.
- Increase max-retried for netty client so that other worker get enough
time to start/restart and starts accepting new netty connections.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/kishorvpatil/incubator-storm netty-client-fix
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-storm/pull/205.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #205
----
commit e1e6a602e330d71410b1876ca9fb6bfc29761f35
Author: Kishor Patil <[email protected]>
Date: 2014-07-24T19:34:38Z
Fix netty client reconnect issue
commit 9b3a632340b66f1fe493220c0e6c22c2912e8025
Author: Kishor Patil <[email protected]>
Date: 2014-07-24T19:36:41Z
Increase netty max retries defaults allowing more time for other workers to
come up
commit f1f5ecd92c7ea55d01163c8dbc2360466f34fd3a
Author: Kishor Patil <[email protected]>
Date: 2014-07-24T20:49:22Z
Increase max tries and reset local channel variable
----
> Trident topologies getting stuck when using Netty transport (reproducible)
> --------------------------------------------------------------------------
>
> Key: STORM-406
> URL: https://issues.apache.org/jira/browse/STORM-406
> Project: Apache Storm (Incubating)
> Issue Type: Bug
> Affects Versions: 0.9.2-incubating, 0.9.1-incubating, 0.9.0.1
> Environment: Linux, OpenJDK 7
> Reporter: Danijel Schiavuzzi
> Priority: Critical
> Labels: b
>
> When using the new, default Netty transport, Trident topologies sometimes get
> stuck, while under ZeroMQ everything is working fine.
> I can reliably reproduce this issue by killing a Storm worker on a running
> Trident topology. If the worker gets re-spawned on the same slot (port), the
> topology stops processing. But if the worker re-spawns on a different port,
> topology processing continues normally.
> The Storm cluster configuration is pretty standard, there are two Supervisor
> nodes, one node has also Nimbus, UI and DRPC running on it. I have four slots
> per Supervisor, and run my test topology with setNumWorkers set to 8 so that
> it occupies all eight slots across the cluster. Killing a worker in this
> configuration will always re-spawn the worker on the same node and slot
> (port), thus causing the topology to stop processing. This is 100%
> reproducible on a few Storm clusters of mine, across multiple Storm versions
> (0.9.0.1, 0.9.1, 0.9.2).
> I have reproduced this with multiple Trident topologies, the simplest of
> which is the TridentWordCount topology from storm-starter. I've just modified
> it a little to add an additional Trident filter to log the tuple throughput:
> https://github.com/dschiavu/storm-trident-stuck-topology
> Non-transactional Trident topologies just silently stop processing, while
> transactional topologies continuously retry the batches and are re-emitted by
> the spout, however they never get processed by the next bolts in the chain so
> they time out.
--
This message was sent by Atlassian JIRA
(v6.2#6252)