Re: Zk timeouts question on ZkCmdExecutor (SOLR-14667)

Mark Miller Wed, 09 Aug 2023 22:03:39 -0700

Apologies is this not exactly clear, I spoke, some AI automatically turned
into text that it found to be clearer, and I pasted…


The concept behind these retries with Zookeeper is to allow for recovery of
lost connections if they happen before the session times out. It is
recommended to only fail an interaction with Zookeeper if the session has
timed out. In the event of a lost connection, it is best to wait and see if
the connection recovers, continuing uninterrupted. The interruption only
happens when there is a session loss. Retrying until a session loss occurs
is a hacky way to attempt this behavior, which is why there are Zookeeper
retries through the length of the session timeout. However, this is not an
ideal implementation. A better approach is to rely on the fact that
Zookeeper will notify you when a connection is lost and when it is
recovered. When a connection is lost, the system should ideally go into a
quiet mode and stop trying to communicate with Zookeeper. This is important
because too much activity or overload can cause a connection loss, so it’s
not ideal to have many things retrying. Instead, when a notification is
received that the connection is recovered, the system should resume normal
operations from the quiet state.

To handle the issue of connection loss, it's better to implement a
different architecture. Instead of continuously retrying the request, the
better approach would be to have the ZK connection manager hold on to the
request until a notification is received that the connection has been
recovered. This can be achieved by setting up a notify loop where the
request can wait until the connection is restored. Once a notification is
received, a notify-all command can be executed to exit the wait state and
proceed with the intended call. This ensures that the connection is fully
recovered before making any further calls. This prevents a lot of threads
from continuously retrying in vain, even when we know the connection has
not been recovered, and it also helps defend against bugs the current
system can cause. Due to things like GC pauses or thread starvation, the
current retry approach can break things like leader election, as you can
end up in a situation where you have moved on in a process, but a lingering
retry can occur and create an out-of-order interaction. This can happen
because you can hit a session expiration, upon which the zk manager can
create a new session, which can trigger events like a new leader election;
meanwhile, a retrying request can miss all this, come in with a retry on
the new session, and in some cases succeed in that retry though the request
is old. I don’t recall the exact circumstances that allow this to happen,
but it is something I’ve very clearly seen happen.

Re: Zk timeouts question on ZkCmdExecutor (SOLR-14667)

Reply via email to