Github user miguno commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-73527901 I have improved the patch in this pull request, which particularly meant modifications to The updated patch is availabe at: * https://github.com/miguno/storm/commits/0.10.0-SNAPSHOT-STORM-392-miguno-merge * The relevant commit is https://github.com/miguno/storm/commit/8ebaaf8dbc63df3c2691e0cc3ac5102af7721ec3. Code lineage: This improved patch includes the latest Storm 0.10.0-SNAPSHOT code as of today, Feb 09. I used the patch in this pull request (#268) as the basis, added my changes, and then pulled in the latest upstream Storm changes for 0.10.0-SNAPSHOT via `git merge upstream/master`. High-level changes: - The great majority was refactoring the [Client.java](https://github.com/apache/storm/blob/master/storm-core/src/jvm/backtype/storm/messaging/netty/Client.java) in the Netty messaging backend. For instance, I removed unused imports, obsolete code, clarified variables/methods, consolidated duplicate code, added documentation, improved logging (which also helped me a lot while I was bug-hunting), etc. - I also introduced a few functional changes to the code in order to fix the issue at hand. Notably, the new code can properly detect disconnects / channel loss and trigger reconnect events. This also meant that the new code is now more consistently verifying whether a channel from the client to the server is truly `CONNECTED`. Also, we're now using a `ListenableFuture` instead of a `Runnable` to handle the connection/reconnection logic, which leads to cleaner and more readable code IMHO. - I updated [netty_unit_test.clj](https://github.com/apache/storm/blob/master/storm-core/test/clj/backtype/storm/messaging/netty_unit_test.clj) so that all tests also wait until the Netty server and client instances are ready (i.e. all of their connections are ready). This change was needed because the original+improved STORM-329 patches modify Storm so that it will ensure the readiness of server and client instances during the initial startup of a topology, but the Netty unit tests are test-driving servers and clients directly, so the tests couldn't rely on Storm's help. Testing: - The improved patch passes the full Storm's test suite. - I have successfully tested a patched version of Storm 0.10.0-SNAPSHOT (code as of today) in a Storm cluster via the aforementioned storm-bolt-of-death topology. - With the original patch, restarted bolt instances would not receive new incoming data. With the improved patch the bolt instances start processing new data immediately (as expected). Next steps: - I would appreciate any feedback of other developers on this improved patch. I'd particularly appreciate any performance tests in addition to functional tests. (I do not expect a performance degradation but of course I might be wrong.) - The storm-bolt-of-death topology to trigger the cascading failure should be available by tomorrow for your testing convenience. Many thanks again to @tedxia and @clockfly for the original patch, which covered a lot (most?) of work already.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---