There are a variety of ways you could do it. The easiest short term change is to simply modify what handles most zk retries - the ZkCmdExecutor - already plugged into SolrZkClient where it retries. It tries to guess when a session times out and does fall back retries up to that point.
Because there can be any number of calls doing this, zk disconnects tend to spiral the cluster down. It shouldn’t work like this. Everything in the system related to zk should be event driven. So ZkCmdExecutor should not sleep and retry some number of times. It’s retry method should call something like ConnectionManager#waitForReconnect. Make that a wait on a lock. When zk notifies there is a reconnect, signallAll the lock. Or use a condition. Same thing if the ConnectionManager is closed. It’s not as ideal as entering a quite mode, but it’s tremendously simpler to do. Now when zk hits a dc, it doesn’t get repeatedly hit over and over up until a expiration guess or past a ConnectionManager close. Pretty much everything gets held up, the system is forced into what is essentially a quite state - though will all the outstanding calls hanging - which gives zookeeper the ability to easily reconnect to a valid zk server - in which case everything is released to retry and succeed. With this approach, (and removing the guess isExpired on ConnectionManager and using its actual zk client state) you can actually bombard and overload the system with updates - which currently will crush the system - and instead you can survive the bombard without any updates are disabled, zk is not connected fails. Unless your zk cluster is actually catastrophically down. Mark On Sun, Sep 26, 2021 at 7:54 AM David Smiley <[email protected]> wrote: > > On Wed, Sep 22, 2021 at 9:06 PM Mark Miller <[email protected]> wrote: > ... > >> Zk alerts us when it losses a connection via callback. When the >> connection is back, another callback. An unlimited number of locations >> trying to work this out on there own is terrible zk. In an ideal world, >> everything enters a zk quiete mode and re-engaged when zk says hello again. >> A simpler shorter term improvement is to simply sink all the zk calls when >> they hit the zk connection manager and don’t let them go until the >> connection is restored. >> > > While I don't tend to work on this stuff, I want to understand the essence > of your point. Are you basically recommending that our ZK interactions > should all go through one instance of a ZK connection manager class that > can keep track of ZK's connection state? > -- - Mark http://about.me/markrmiller
