pkuwm opened a new pull request #642: Fix zk session race condition when creating a live instance URL: https://github.com/apache/helix/pull/642 ### Issues - [ ] My PR addresses the following Helix issues and references them in the PR description: Fixes #641 When a storage node's network adapter has issues, network connection is lost, which causes around 5-10 Zookeeper sessions to become expired. Reconnect events are created after the expiration. Eventually this node has 40 minutes busy resetting the storage node's StateModel while helix controller regards this node as online, so helix does not move partitions mastership to other storage node. This caused 40 minutes down time for users of these partitions. Root cause is zk session race condition: Zk session may become expired and change before creating a live instance. So when a live instance(ephemeral node) is being created, if the expected session is expired, we should NOT create the ephemeral node. ### Description - [ ] Here are some details about my PR, including screenshots of any UI changes: This is a draft PR. Change list: 1. Add a new interface ZkSessionAwareStateListener extends IZkStateListener to have handleNewSession(final String sessionId), which is used to handle new sessions that are session sensitive. ZkHelixManager implements the new interface. 2. Add a new public API createEphemeral(final String path, final Object data, final String sessionId) to create a session aware ephemeral node. 3. Change retryUntilConnnected() to guarantee session aware operations are completed in the expected session. 4. Change code logic to handle new session operations that need to be completed in the expected session. 5. Add a session id field to ZkEvent. ### Tests - [ ] The following tests are written for this issue: To be added... - [ ] The following is the result of the "mvn test" command on the appropriate module: To be added... ### Commits - [ ] My commits all reference appropriate Apache Helix GitHub issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation in the following wiki page: (Link the GitHub wiki you added) ### Code Quality - [ ] My diff has been formatted using helix-style.xml
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
