Apologies is this not exactly clear, I spoke, some AI automatically turned into text that it found to be clearer, and I pasted…
The concept behind these retries with Zookeeper is to allow for recovery of lost connections if they happen before the session times out. It is recommended to only fail an interaction with Zookeeper if the session has timed out. In the event of a lost connection, it is best to wait and see if the connection recovers, continuing uninterrupted. The interruption only happens when there is a session loss. Retrying until a session loss occurs is a hacky way to attempt this behavior, which is why there are Zookeeper retries through the length of the session timeout. However, this is not an ideal implementation. A better approach is to rely on the fact that Zookeeper will notify you when a connection is lost and when it is recovered. When a connection is lost, the system should ideally go into a quiet mode and stop trying to communicate with Zookeeper. This is important because too much activity or overload can cause a connection loss, so it’s not ideal to have many things retrying. Instead, when a notification is received that the connection is recovered, the system should resume normal operations from the quiet state. To handle the issue of connection loss, it's better to implement a different architecture. Instead of continuously retrying the request, the better approach would be to have the ZK connection manager hold on to the request until a notification is received that the connection has been recovered. This can be achieved by setting up a notify loop where the request can wait until the connection is restored. Once a notification is received, a notify-all command can be executed to exit the wait state and proceed with the intended call. This ensures that the connection is fully recovered before making any further calls. This prevents a lot of threads from continuously retrying in vain, even when we know the connection has not been recovered, and it also helps defend against bugs the current system can cause. Due to things like GC pauses or thread starvation, the current retry approach can break things like leader election, as you can end up in a situation where you have moved on in a process, but a lingering retry can occur and create an out-of-order interaction. This can happen because you can hit a session expiration, upon which the zk manager can create a new session, which can trigger events like a new leader election; meanwhile, a retrying request can miss all this, come in with a retry on the new session, and in some cases succeed in that retry though the request is old. I don’t recall the exact circumstances that allow this to happen, but it is something I’ve very clearly seen happen.