GitHub user miguno opened a pull request:
https://github.com/apache/storm/pull/428
STORM-329: fix cascading Storm failure by improving reconnection strategy
and buffering messages
This is an improved version of the original pull request discussed at
https://github.com/apache/storm/pull/268. Please refer to the discussion in
the link above.
**Note**: Please give attribution to @tedxia when merging the pull request
as he did a lot (most?) of the work in this pull request.
The changes of this pull request include:
- Most importantly, we fix a bug in Storm that may cause a cascading
failure in a Storm cluster, to the point where the whole cluster becomes
unusable. This is achieved by the work described in the next bullet points.
- We refactor and improve the Netty messaging backend, notably the client.
- During the initial startup of a topology, Storm will now wait until
worker (Netty) connections are ready for operation. See the [original
discussion thread](https://github.com/apache/storm/pull/268) for the detailed
explanation and justification of this change.
@clockfly, @tedxia: Please add any further comments to STORM-329 to this
pull request, if possible.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/miguno/storm 0.10.0-SNAPSHOT-STORM-329
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/storm/pull/428.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #428
----
commit 8c52a5b518021a6beff372acbeb66a963a1d4f74
Author: xiajun <[email protected]>
Date: 2014-09-24T12:39:18Z
STORM-329 : buffer message in client and reconnect remote server async
commit 7cc8c8b1b59d415f1fa54e081127336fcdaeb706
Author: xiajun <[email protected]>
Date: 2014-09-26T02:43:55Z
STORM-329: fix continue flush after client had been closed
commit 9826bc9ef7aee8c83e90d56f52ac70ac7165d769
Author: xiajun <[email protected]>
Date: 2014-09-30T03:25:08Z
Client not clean timeout TaskMessage
commit 978c969fdb4b5904b6a87c100fbd80fe26bf39cf
Author: Sean Zhong <[email protected]>
Date: 2014-10-20T01:08:43Z
Merge remote-tracking branch 'upstream/master'
commit 44f8260bbf489c6a2741fb4d8f9196ea6ddb51cc
Author: Sean Zhong <[email protected]>
Date: 2014-10-20T01:21:25Z
STORM-404, STORM-510, STORM-329: 1. break the reconnection to target if
worker is informed that the target is the down, we that we avoid
RuntimeException when reconnection failed. 2. When worker get started, need to
make sure all target workers are alive before launching spouts 3. When target
worker is down, all messages sending to the target worker will be dropped.
commit dea5fbe35c4d9b18a89dae320f9fc985f25bd31a
Author: Sean Zhong <[email protected]>
Date: 2014-10-20T01:24:45Z
STORM-329: fix comment grammar
commit 16de9f3321827624865a4450e93bd53efb75ed93
Author: Sean Zhong <[email protected]>
Date: 2014-10-28T03:31:23Z
test
commit e8dcf9155c85d3541c9352faf9a0651614b93eb6
Author: Sean Zhong <[email protected]>
Date: 2014-10-29T01:56:19Z
STORM-329: avoid deadlock
commit 22e7014dedfd580de9dd1d6b2083c8fb3d77d406
Author: Sean Zhong <[email protected]>
Date: 2014-10-29T01:59:09Z
Revert "test"
This reverts commit 16de9f3321827624865a4450e93bd53efb75ed93.
commit ddef6667cdc6de9aebf0d4006ad9e7df2bfbb3bb
Author: Sean Zhong <[email protected]>
Date: 2014-10-29T08:14:34Z
Merge remote-tracking branch 'upstream/master'
commit baf3c628db3899def89ee92c752524d041bc8b40
Author: Sean Zhong <[email protected]>
Date: 2014-10-30T10:17:07Z
STORM-329: fix UT. Add a new flag in worker data "worker-active-flag", Wait
connections to be ready asyncly.
commit e1c463f5681dbf6a66c868d467aca14064da1e9b
Author: Sean Zhong <[email protected]>
Date: 2014-10-30T10:22:22Z
STORM-329: fix comments. Add a description about
"storm.messaging.netty.max_retries", that the reconnection period should also
be bigger than "storm.zookeeper.session.timeout", so that the reconnection can
be aborted(when target worker is dead) before the reconnection failed and throw
RunTimeException
commit 2d3fad121481da40258af27e6d7fbcb148365e76
Author: Sean Zhong <[email protected]>
Date: 2014-10-30T10:22:22Z
STORM-329: fix a integration issue
commit 60f04f9e397cf49e4fba6fe6a2f0bfb23d5a8605
Author: Sean Zhong <[email protected]>
Date: 2014-10-30T10:28:17Z
Merge branch 'master' of https://github.com/tedxia/incubator-storm
commit 41aafbecac2cf3295255c7dc9b299b8c0c555390
Author: xiajun <[email protected]>
Date: 2014-11-18T06:53:14Z
Merge remote-tracking branch 'remotes/apache-storm/0.9.3-branch' into
ted-master
Conflicts:
storm-core/src/jvm/backtype/storm/messaging/netty/Client.java
commit 2c39866cf8bbca136d3b88f796b9f847b282fdd7
Author: xiajun <[email protected]>
Date: 2014-11-27T11:07:42Z
Merge remote-tracking branch 'apache-storm/0.9.3-branch'
commit 0a72182175ae2bc23738fdd28c88fe15acc1a27a
Author: xiajun <[email protected]>
Date: 2014-11-28T02:09:08Z
remove sleep from connect to escape block send message to other worker
commit 66e6284fbddb7acaca5d991d2991df8e51d7874e
Author: xiajun <[email protected]>
Date: 2014-12-02T03:13:22Z
remove unused test code in netty_unit_test.clj
commit afb3a93691e6e44755744d3632cece60d6239a2b
Author: xiajun <[email protected]>
Date: 2014-12-10T08:43:46Z
remove Status from Client
commit c084a2441ea54371b8e6a8c48a5209750d7376b9
Author: xiajun <[email protected]>
Date: 2014-12-15T04:24:26Z
Merge remote-tracking branch 'apache-storm/master' into ted-master
Conflicts:
storm-core/src/clj/backtype/storm/daemon/worker.clj
storm-core/src/jvm/backtype/storm/messaging/netty/Client.java
storm-core/src/jvm/backtype/storm/messaging/netty/Server.java
storm-core/test/clj/backtype/storm/messaging/netty_unit_test.clj
commit 8ebaaf8dbc63df3c2691e0cc3ac5102af7721ec3
Author: Michael G. Noll <[email protected]>
Date: 2015-02-03T23:29:40Z
STORM-327: fix and improve Netty transport to prevent cascading failures
commit b42cd4ffb3d8fef7e30b0253cebeebf5ad5e5ee1
Author: Michael G. Noll <[email protected]>
Date: 2015-02-09T14:18:08Z
Merge remote-tracking branch 'upstream/master' into
0.10.0-SNAPSHOT-STORM-392-miguno-merge
commit 679274b1f96d657a74b2d4d4c9cd69076b93c43e
Author: Michael G. Noll <[email protected]>
Date: 2015-02-10T08:45:54Z
Use 4 spaces instead of 2 spaces
commit 5980fd6191b05ef132825496ef913f9d7245f089
Author: Michael G. Noll <[email protected]>
Date: 2015-02-10T08:50:45Z
Use DEBUG instead of INFO for log messages of the background flusher thread
commit 46c5b663ebe7179c1dc407dd4fcc0765732131d2
Author: Michael G. Noll <[email protected]>
Date: 2015-02-10T09:01:52Z
flushPendingMessages(): trigger a reconnect in case of connection loss
Also, we ensure that background flushing is enabled in case we find that
the channel is CONNECTED but not WRITABLE; this ensures that we can
re-try the flush operation in the scenario that the client is being
gracefully closed.
commit 0944dcea5d0a36029f4182501df17cf6ea6a127b
Author: Michael G. Noll <[email protected]>
Date: 2015-02-11T17:15:35Z
Merge remote-tracking branch 'upstream/master' into
0.10.0-SNAPSHOT-STORM-329
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---