Patrik Ivarsson created ZOOKEEPER-4893: ------------------------------------------
Summary: Excessive reconection delays due to hardcoded sleep intervals Key: ZOOKEEPER-4893 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4893 Project: ZooKeeper Issue Type: Improvement Components: java client Affects Versions: 3.9.3 Reporter: Patrik Ivarsson *Description* I'll try to explain our issue as clearly as I can. Some clients take too long to reconnect to a ZooKeeper cluster after a minor downtime. We have identified two hardcoded sleep intervals in the client connection logic that contribute to this issue, but they cannot be configured. Spending several seconds in this disconnected state, even though the cluster is up and healthy is an issue in our setup. *These are the two Thread.sleep() which I am referring to* 1. Random sleep (0-1000ms) before attempting a new connection: * [ClientCnxn.java#L1138 (release-3.9.3)|https://github.com/apache/zookeeper/blob/release-3.9.3/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1138] 2. Fixed 1000ms sleep before reconnecting to the last known server: * [StaticHostProvider#L363 (release-3.9.3)|https://github.com/apache/zookeeper/blob/release-3.9.3/zookeeper-server/src/main/java/org/apache/zookeeper/client/StaticHostProvider.java#L362] *Example Scenario* Consider a three-node ZooKeeper cluster (node01, node02, node03) where node01 is currently the leader. 1. Event: Node01 is temporarily taken down for short maintenance (e.g. for security patching). 2. Result: The remaining nodes (node02 and node03) elect a new leader, completing within ~1000ms. 3. Client (that was connected to node01) behavior (worst-case scenario): * Connection to node01 is lost → client enters a suspended state. * Waits 500ms, attempts connection to node02 → fails (cluster not ready). * Waits 499ms, attempts connection to node03 → fails (cluster still not ready). * Waits 1000ms as we are now back to original node01 (sleep #2 in the list above) * Waits 1000ms before connecting to node01 -> fails (this node is down for maintenance) * Waits 1000ms before retrying node02 → finally succeeds. 4. Total reconnection time: ~4 seconds, despite the cluster being available after just 1 second. *Impact* * Clients remain in a suspended state longer than necessary, leading to degraded service availability. * The reconnection delay is artificially inflated due to hardcoded sleeps. *Suggested improvement* * Give the user an option to provide our own logic for how long we should sleep before retry. It could be making these sleep intervals configurable, but even better would be to be able to provide our own implementation of the waiting logic. *Offer to contribute* We would be happy to submit a pull request to address this issue if that would be helpful. Please let us know if a contribution would be welcomed and if you have any guidance on the preferred approach. Would appreciate any insights from the maintainers. Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010)