junkaixue commented on issue #3064: URL: https://github.com/apache/helix/issues/3064#issuecomment-3238264839
Hi @jasondhoyt, I think there is a concept level gap needs to be filled in. How we defined "not connected"? What is the behavior of "not connected"? What you mentioned here "not connected" is not Helix defined behavior but Zookeeper behavior. HelixManager only marks not connected when it is session timed out after connection established before. Otherwise, intermediate disconnect is allowed in Zookeeper connect and not count as disconnected. The behavior we design not connected will reflect the node heart beats and offline status. If a short intermediate disconnects defined as "not connected", you will see endless and unstable rebalance for your cluster as intermediate disconnect is very common. This is why Zookeeper has this session timeout concept. Unless the disconnection last more than the timeout, it does not count as "not connected" For the part why session timeout, it could be multiple reasons like network partitioned, your ZK server is overload in handling traffic and having a long events queue or even the connected ZK server is saturated. If you frequently seeing this session timeout, most likely your ZK server is overloaded. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
