pkuwm opened a new pull request #642: Fix zk session race condition when 
creating a live instance
URL: https://github.com/apache/helix/pull/642
 
 
   ### Issues
   
   - [ ] My PR addresses the following Helix issues and references them in the 
PR description:
   
   Fixes #641 
   
   When a storage node's network adapter has issues, network connection is 
lost, which causes around 5-10 Zookeeper sessions to become expired. Reconnect 
events are created after the expiration. Eventually this node has 40 minutes 
busy resetting the storage node's StateModel while helix controller regards 
this node as online, so helix does not move partitions mastership to other 
storage node. This caused 40 minutes down time for users of these partitions.
   
   Root cause is zk session race condition:
   Zk session may become expired and change before creating a live instance. So 
when a live instance(ephemeral node) is being created, if the expected session 
is expired, we should NOT create the ephemeral node.
   
   ### Description
   
   - [ ] Here are some details about my PR, including screenshots of any UI 
changes:
   
   This is a draft PR.
   
   Change list:
   1. Add a new interface ZkSessionAwareStateListener extends IZkStateListener 
to have handleNewSession(final String sessionId), which is used to handle new 
sessions that are session sensitive. ZkHelixManager implements the new 
interface.
   2. Add a new public API createEphemeral(final String path, final Object 
data, final String sessionId) to create a session aware ephemeral node.
   3. Change retryUntilConnnected() to guarantee session aware operations are 
completed in the expected session.
   4. Change code logic to handle new session operations that need to be 
completed in the expected session.
   5. Add a session id field to ZkEvent.
   
   ### Tests
   
   - [ ] The following tests are written for this issue:
   
   To be added...
   
   - [ ] The following is the result of the "mvn test" command on the 
appropriate module:
   
   To be added...
   
   ### Commits
   
   - [ ] My commits all reference appropriate Apache Helix GitHub issues in 
their subject lines, and I have squashed multiple commits if they address the 
same issue. In addition, my commits follow the guidelines from "[How to write a 
good git commit message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation in the 
following wiki page:
   
   (Link the GitHub wiki you added)
   
   ### Code Quality
   
   - [ ] My diff has been formatted using helix-style.xml

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to