Should ZK disconnect be handled at the individual call level to begin with?
Aren’t we implementing “recipes” (equivalent to “transactions” in a DB
world) that combine multiple actions and that implicitly assume ZK
continuity over the course of execution? It seems these should rather fail
and retry as a whole rather than individual actions?

I don’t have any existing examples in mind of where this is problematic in
existing code (or it would already be a bug) but the existing single call
level retry approach feels fragile.

Ilan

On Mon 27 Sep 2021 at 19:04, Mark Miller <[email protected]> wrote:

> There are a variety of ways you could do it.
>
> The easiest short term change is to simply modify what handles most zk
> retries - the ZkCmdExecutor - already plugged into SolrZkClient where it
> retries. It tries to guess when a session times out and does fall back
> retries up to that point.
>
> Because there can be any number of calls doing this, zk disconnects tend
> to spiral the cluster down.
>
> It shouldn’t work like this. Everything in the system related to zk should
> be event driven.
>
> So ZkCmdExecutor should not sleep and retry some number of times.
>
> It’s retry method should call something like
> ConnectionManager#waitForReconnect. Make that a wait on a lock. When zk
> notifies there is a reconnect, signallAll the lock. Or use a condition.
> Same thing if the ConnectionManager is closed.
>
> It’s not as ideal as entering a quite mode, but it’s tremendously simpler
> to do.
>
> Now when zk hits a dc, it doesn’t get repeatedly hit over and over up
> until a expiration guess or past a ConnectionManager close.
>
> Pretty much everything gets held up, the system is forced into what is
> essentially a quite state - though will all the outstanding calls hanging -
> which gives zookeeper the ability to easily reconnect to a valid zk server
> - in which case everything is released to retry and succeed.
>
> With this approach, (and removing the guess isExpired on ConnectionManager
> and using its actual zk client state) you can actually bombard and overload
> the system with updates - which currently will crush the system - and
> instead you can survive the bombard without any updates are disabled, zk is
> not connected fails. Unless your zk cluster is actually catastrophically
> down.
>
> Mark
>
> On Sun, Sep 26, 2021 at 7:54 AM David Smiley <[email protected]> wrote:
>
>>
>> On Wed, Sep 22, 2021 at 9:06 PM Mark Miller <[email protected]>
>> wrote:
>> ...
>>
>>> Zk alerts us when it losses a connection via callback. When the
>>> connection is back, another callback. An unlimited number of locations
>>> trying to work this out on there own is terrible zk. In an ideal world,
>>> everything enters a zk quiete mode and re-engaged when zk says hello again.
>>> A simpler shorter term improvement is to simply  sink all the zk calls when
>>> they hit the zk connection manager and don’t let them go until the
>>> connection is restored.
>>>
>>
>> While I don't tend to work on this stuff, I want to understand the
>> essence of your point.  Are you basically recommending that our ZK
>> interactions should all go through one instance of a ZK connection manager
>> class that can keep track of ZK's connection state?
>>
> --
> - Mark
>
> http://about.me/markrmiller
>

Reply via email to