pkuwm opened a new issue #641: Zk session race condition when creating a live 
instance
URL: https://github.com/apache/helix/issues/641
 
 
   When a storage node's network adapter has issues, network connection is 
lost, which causes around 5-10 Zookeeper sessions to become expired. Reconnect 
events are created after the expiration. Eventually this node has 40 minutes 
busy resetting the storage node's StateModel while helix controller regards 
this node as online, so helix does not move partitions mastership to other 
storage node. This caused 40 minutes down time for users of these partitions.
   
   Root cause is zk session race condition:
   Zk session may become expired and change before creating a live instance. So 
when a live instance(ephemeral node) is being created, if the expected session 
is expired, we should NOT create the ephemeral node.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to