pkuwm opened a new issue #641: Zk session race condition when creating a live instance URL: https://github.com/apache/helix/issues/641 When a storage node's network adapter has issues, network connection is lost, which causes around 5-10 Zookeeper sessions to become expired. Reconnect events are created after the expiration. Eventually this node has 40 minutes busy resetting the storage node's StateModel while helix controller regards this node as online, so helix does not move partitions mastership to other storage node. This caused 40 minutes down time for users of these partitions. Root cause is zk session race condition: Zk session may become expired and change before creating a live instance. So when a live instance(ephemeral node) is being created, if the expected session is expired, we should NOT create the ephemeral node.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
