There are a variety of ways you could do it.

The easiest short term change is to simply modify what handles most zk
retries - the ZkCmdExecutor - already plugged into SolrZkClient where it
retries. It tries to guess when a session times out and does fall back
retries up to that point.

Because there can be any number of calls doing this, zk disconnects tend to
spiral the cluster down.

It shouldn’t work like this. Everything in the system related to zk should
be event driven.

So ZkCmdExecutor should not sleep and retry some number of times.

It’s retry method should call something like
ConnectionManager#waitForReconnect. Make that a wait on a lock. When zk
notifies there is a reconnect, signallAll the lock. Or use a condition.
Same thing if the ConnectionManager is closed.

It’s not as ideal as entering a quite mode, but it’s tremendously simpler
to do.

Now when zk hits a dc, it doesn’t get repeatedly hit over and over up until
a expiration guess or past a ConnectionManager close.

Pretty much everything gets held up, the system is forced into what is
essentially a quite state - though will all the outstanding calls hanging -
which gives zookeeper the ability to easily reconnect to a valid zk server
- in which case everything is released to retry and succeed.

With this approach, (and removing the guess isExpired on ConnectionManager
and using its actual zk client state) you can actually bombard and overload
the system with updates - which currently will crush the system - and
instead you can survive the bombard without any updates are disabled, zk is
not connected fails. Unless your zk cluster is actually catastrophically
down.

Mark

On Sun, Sep 26, 2021 at 7:54 AM David Smiley <[email protected]> wrote:

>
> On Wed, Sep 22, 2021 at 9:06 PM Mark Miller <[email protected]> wrote:
> ...
>
>> Zk alerts us when it losses a connection via callback. When the
>> connection is back, another callback. An unlimited number of locations
>> trying to work this out on there own is terrible zk. In an ideal world,
>> everything enters a zk quiete mode and re-engaged when zk says hello again.
>> A simpler shorter term improvement is to simply  sink all the zk calls when
>> they hit the zk connection manager and don’t let them go until the
>> connection is restored.
>>
>
> While I don't tend to work on this stuff, I want to understand the essence
> of your point.  Are you basically recommending that our ZK interactions
> should all go through one instance of a ZK connection manager class that
> can keep track of ZK's connection state?
>
-- 
- Mark

http://about.me/markrmiller

Reply via email to