junkaixue commented on issue #3064:
URL: https://github.com/apache/helix/issues/3064#issuecomment-3238264839

   Hi @jasondhoyt,
   
   I think there is a concept level gap needs to be filled in.
   
   How we defined "not connected"? What is the behavior of "not connected"?
   
   What you mentioned here "not connected" is not Helix defined behavior but 
Zookeeper behavior.  HelixManager only marks not connected when it is session 
timed out after connection established before. Otherwise, intermediate 
disconnect is allowed in Zookeeper connect and not count as disconnected. 
   
   The behavior we design not connected will reflect the node heart beats and 
offline status. If a short intermediate disconnects defined as "not connected", 
you will see endless and unstable rebalance for your cluster as intermediate 
disconnect is very common.
   
   This is why Zookeeper has this session timeout concept. Unless the 
disconnection last more than the timeout, it does not count as "not connected"
   
   
   
   For the part why session timeout, it could be multiple reasons like network 
partitioned, your ZK server is overload in handling traffic and having a long 
events queue or even the connected ZK server is saturated. If you frequently 
seeing this session timeout, most likely your ZK server is overloaded.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to