GitHub user miguno opened a pull request:

    https://github.com/apache/storm/pull/428

    STORM-329: fix cascading Storm failure by improving reconnection strategy 
and buffering messages

    This is an improved version of the original pull request discussed at 
https://github.com/apache/storm/pull/268.  Please refer to the discussion in 
the link above.
    
    **Note**:  Please give attribution to @tedxia when merging the pull request 
as he did a lot (most?) of the work in this pull request.
    
    The changes of this pull request include:
    
    - Most importantly, we fix a bug in Storm that may cause a cascading 
failure in a Storm cluster, to the point where the whole cluster becomes 
unusable.  This is achieved by the work described in the next bullet points.
    - We refactor and improve the Netty messaging backend, notably the client.
    - During the initial startup of a topology, Storm will now wait until 
worker (Netty) connections are ready for operation.  See the [original 
discussion thread](https://github.com/apache/storm/pull/268) for the detailed 
explanation and justification of this change.
    
    @clockfly, @tedxia: Please add any further comments to STORM-329 to this 
pull request, if possible.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/miguno/storm 0.10.0-SNAPSHOT-STORM-329

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/428.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #428
    
----
commit 8c52a5b518021a6beff372acbeb66a963a1d4f74
Author: xiajun <[email protected]>
Date:   2014-09-24T12:39:18Z

    STORM-329 : buffer message in client and reconnect remote server async

commit 7cc8c8b1b59d415f1fa54e081127336fcdaeb706
Author: xiajun <[email protected]>
Date:   2014-09-26T02:43:55Z

    STORM-329: fix continue flush after client had been closed

commit 9826bc9ef7aee8c83e90d56f52ac70ac7165d769
Author: xiajun <[email protected]>
Date:   2014-09-30T03:25:08Z

    Client not clean timeout TaskMessage

commit 978c969fdb4b5904b6a87c100fbd80fe26bf39cf
Author: Sean Zhong <[email protected]>
Date:   2014-10-20T01:08:43Z

    Merge remote-tracking branch 'upstream/master'

commit 44f8260bbf489c6a2741fb4d8f9196ea6ddb51cc
Author: Sean Zhong <[email protected]>
Date:   2014-10-20T01:21:25Z

    STORM-404, STORM-510, STORM-329: 1. break the reconnection to target if 
worker is informed that the target is the down, we that we avoid 
RuntimeException when reconnection failed. 2. When worker get started, need to 
make sure all target workers are alive before launching spouts 3. When target 
worker is down, all messages sending to the target worker will be dropped.

commit dea5fbe35c4d9b18a89dae320f9fc985f25bd31a
Author: Sean Zhong <[email protected]>
Date:   2014-10-20T01:24:45Z

    STORM-329: fix comment grammar

commit 16de9f3321827624865a4450e93bd53efb75ed93
Author: Sean Zhong <[email protected]>
Date:   2014-10-28T03:31:23Z

    test

commit e8dcf9155c85d3541c9352faf9a0651614b93eb6
Author: Sean Zhong <[email protected]>
Date:   2014-10-29T01:56:19Z

    STORM-329: avoid deadlock

commit 22e7014dedfd580de9dd1d6b2083c8fb3d77d406
Author: Sean Zhong <[email protected]>
Date:   2014-10-29T01:59:09Z

    Revert "test"
    
    This reverts commit 16de9f3321827624865a4450e93bd53efb75ed93.

commit ddef6667cdc6de9aebf0d4006ad9e7df2bfbb3bb
Author: Sean Zhong <[email protected]>
Date:   2014-10-29T08:14:34Z

    Merge remote-tracking branch 'upstream/master'

commit baf3c628db3899def89ee92c752524d041bc8b40
Author: Sean Zhong <[email protected]>
Date:   2014-10-30T10:17:07Z

    STORM-329: fix UT. Add a new flag in worker data "worker-active-flag", Wait 
connections to be ready asyncly.

commit e1c463f5681dbf6a66c868d467aca14064da1e9b
Author: Sean Zhong <[email protected]>
Date:   2014-10-30T10:22:22Z

    STORM-329: fix comments. Add a description about 
"storm.messaging.netty.max_retries", that the reconnection period should also 
be bigger than "storm.zookeeper.session.timeout", so that the reconnection can 
be aborted(when target worker is dead) before the reconnection failed and throw 
RunTimeException

commit 2d3fad121481da40258af27e6d7fbcb148365e76
Author: Sean Zhong <[email protected]>
Date:   2014-10-30T10:22:22Z

    STORM-329: fix a integration issue

commit 60f04f9e397cf49e4fba6fe6a2f0bfb23d5a8605
Author: Sean Zhong <[email protected]>
Date:   2014-10-30T10:28:17Z

    Merge branch 'master' of https://github.com/tedxia/incubator-storm

commit 41aafbecac2cf3295255c7dc9b299b8c0c555390
Author: xiajun <[email protected]>
Date:   2014-11-18T06:53:14Z

    Merge remote-tracking branch 'remotes/apache-storm/0.9.3-branch' into 
ted-master
    
    Conflicts:
        storm-core/src/jvm/backtype/storm/messaging/netty/Client.java

commit 2c39866cf8bbca136d3b88f796b9f847b282fdd7
Author: xiajun <[email protected]>
Date:   2014-11-27T11:07:42Z

    Merge remote-tracking branch 'apache-storm/0.9.3-branch'

commit 0a72182175ae2bc23738fdd28c88fe15acc1a27a
Author: xiajun <[email protected]>
Date:   2014-11-28T02:09:08Z

    remove sleep from connect to escape block send message to other worker

commit 66e6284fbddb7acaca5d991d2991df8e51d7874e
Author: xiajun <[email protected]>
Date:   2014-12-02T03:13:22Z

    remove unused test code in netty_unit_test.clj

commit afb3a93691e6e44755744d3632cece60d6239a2b
Author: xiajun <[email protected]>
Date:   2014-12-10T08:43:46Z

    remove Status from Client

commit c084a2441ea54371b8e6a8c48a5209750d7376b9
Author: xiajun <[email protected]>
Date:   2014-12-15T04:24:26Z

    Merge remote-tracking branch 'apache-storm/master' into ted-master
    
    Conflicts:
        storm-core/src/clj/backtype/storm/daemon/worker.clj
        storm-core/src/jvm/backtype/storm/messaging/netty/Client.java
        storm-core/src/jvm/backtype/storm/messaging/netty/Server.java
        storm-core/test/clj/backtype/storm/messaging/netty_unit_test.clj

commit 8ebaaf8dbc63df3c2691e0cc3ac5102af7721ec3
Author: Michael G. Noll <[email protected]>
Date:   2015-02-03T23:29:40Z

    STORM-327: fix and improve Netty transport to prevent cascading failures

commit b42cd4ffb3d8fef7e30b0253cebeebf5ad5e5ee1
Author: Michael G. Noll <[email protected]>
Date:   2015-02-09T14:18:08Z

    Merge remote-tracking branch 'upstream/master' into 
0.10.0-SNAPSHOT-STORM-392-miguno-merge

commit 679274b1f96d657a74b2d4d4c9cd69076b93c43e
Author: Michael G. Noll <[email protected]>
Date:   2015-02-10T08:45:54Z

    Use 4 spaces instead of 2 spaces

commit 5980fd6191b05ef132825496ef913f9d7245f089
Author: Michael G. Noll <[email protected]>
Date:   2015-02-10T08:50:45Z

    Use DEBUG instead of INFO for log messages of the background flusher thread

commit 46c5b663ebe7179c1dc407dd4fcc0765732131d2
Author: Michael G. Noll <[email protected]>
Date:   2015-02-10T09:01:52Z

    flushPendingMessages(): trigger a reconnect in case of connection loss
    
    Also, we ensure that background flushing is enabled in case we find that
    the channel is CONNECTED but not WRITABLE;  this ensures that we can
    re-try the flush operation in the scenario that the client is being
    gracefully closed.

commit 0944dcea5d0a36029f4182501df17cf6ea6a127b
Author: Michael G. Noll <[email protected]>
Date:   2015-02-11T17:15:35Z

    Merge remote-tracking branch 'upstream/master' into 
0.10.0-SNAPSHOT-STORM-329

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to