Github user miguno commented on the pull request:

    https://github.com/apache/storm/pull/268#issuecomment-73527901
  
    I have improved the patch in this pull request, which particularly meant 
modifications to
    
    The updated patch is availabe at:
    * 
https://github.com/miguno/storm/commits/0.10.0-SNAPSHOT-STORM-392-miguno-merge
    * The relevant commit is 
https://github.com/miguno/storm/commit/8ebaaf8dbc63df3c2691e0cc3ac5102af7721ec3.
    
    Code lineage: This improved patch includes the latest Storm 0.10.0-SNAPSHOT 
code as of today, Feb 09.  I used the patch in this pull request (#268) as the 
basis, added my changes, and then pulled in the latest upstream Storm changes 
for 0.10.0-SNAPSHOT via `git merge upstream/master`.
    
    High-level changes:
    
    - The great majority was refactoring the 
[Client.java](https://github.com/apache/storm/blob/master/storm-core/src/jvm/backtype/storm/messaging/netty/Client.java)
 in the Netty messaging backend.  For instance, I removed unused imports, 
obsolete code, clarified variables/methods, consolidated duplicate code, added 
documentation, improved logging (which also helped me a lot while I was 
bug-hunting), etc.
    - I also introduced a few functional changes to the code in order to fix 
the issue at hand.  Notably, the new code can properly detect disconnects / 
channel loss and trigger reconnect events.  This also meant that the new code 
is now more consistently verifying whether a channel from the client to the 
server is truly `CONNECTED`. Also, we're now using a `ListenableFuture` instead 
of a `Runnable` to handle the connection/reconnection logic, which leads to 
cleaner and more readable code IMHO.
    - I updated 
[netty_unit_test.clj](https://github.com/apache/storm/blob/master/storm-core/test/clj/backtype/storm/messaging/netty_unit_test.clj)
 so that all tests also wait until the Netty server and client instances are 
ready (i.e. all of their connections are ready).  This change was needed 
because the original+improved STORM-329 patches modify Storm so that it will 
ensure the readiness of server and client instances during the initial startup 
of a topology, but the Netty unit tests are test-driving servers and clients 
directly, so the tests couldn't rely on Storm's help.
    
    Testing:
    - The improved patch passes the full Storm's test suite.
    - I have successfully tested a patched version of Storm 0.10.0-SNAPSHOT 
(code as of today) in a Storm cluster via the aforementioned 
storm-bolt-of-death topology.
        - With the original patch, restarted bolt instances would not receive 
new incoming data.  With the improved patch the bolt instances start processing 
new data immediately (as expected).
    
    Next steps:
    - I would appreciate any feedback of other developers on this improved 
patch.  I'd particularly appreciate any performance tests in addition to 
functional tests.  (I do not expect a performance degradation but of course I 
might be wrong.)
    - The storm-bolt-of-death topology to trigger the cascading failure should 
be available by tomorrow for your testing convenience.
    
    Many thanks again to @tedxia and @clockfly for the original patch, which 
covered a lot (most?) of work already.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to