Should ZK disconnect be handled at the individual call level to begin with? Aren’t we implementing “recipes” (equivalent to “transactions” in a DB world) that combine multiple actions and that implicitly assume ZK continuity over the course of execution? It seems these should rather fail and retry as a whole rather than individual actions?
I don’t have any existing examples in mind of where this is problematic in existing code (or it would already be a bug) but the existing single call level retry approach feels fragile. Ilan On Mon 27 Sep 2021 at 19:04, Mark Miller <[email protected]> wrote: > There are a variety of ways you could do it. > > The easiest short term change is to simply modify what handles most zk > retries - the ZkCmdExecutor - already plugged into SolrZkClient where it > retries. It tries to guess when a session times out and does fall back > retries up to that point. > > Because there can be any number of calls doing this, zk disconnects tend > to spiral the cluster down. > > It shouldn’t work like this. Everything in the system related to zk should > be event driven. > > So ZkCmdExecutor should not sleep and retry some number of times. > > It’s retry method should call something like > ConnectionManager#waitForReconnect. Make that a wait on a lock. When zk > notifies there is a reconnect, signallAll the lock. Or use a condition. > Same thing if the ConnectionManager is closed. > > It’s not as ideal as entering a quite mode, but it’s tremendously simpler > to do. > > Now when zk hits a dc, it doesn’t get repeatedly hit over and over up > until a expiration guess or past a ConnectionManager close. > > Pretty much everything gets held up, the system is forced into what is > essentially a quite state - though will all the outstanding calls hanging - > which gives zookeeper the ability to easily reconnect to a valid zk server > - in which case everything is released to retry and succeed. > > With this approach, (and removing the guess isExpired on ConnectionManager > and using its actual zk client state) you can actually bombard and overload > the system with updates - which currently will crush the system - and > instead you can survive the bombard without any updates are disabled, zk is > not connected fails. Unless your zk cluster is actually catastrophically > down. > > Mark > > On Sun, Sep 26, 2021 at 7:54 AM David Smiley <[email protected]> wrote: > >> >> On Wed, Sep 22, 2021 at 9:06 PM Mark Miller <[email protected]> >> wrote: >> ... >> >>> Zk alerts us when it losses a connection via callback. When the >>> connection is back, another callback. An unlimited number of locations >>> trying to work this out on there own is terrible zk. In an ideal world, >>> everything enters a zk quiete mode and re-engaged when zk says hello again. >>> A simpler shorter term improvement is to simply sink all the zk calls when >>> they hit the zk connection manager and don’t let them go until the >>> connection is restored. >>> >> >> While I don't tend to work on this stuff, I want to understand the >> essence of your point. Are you basically recommending that our ZK >> interactions should all go through one instance of a ZK connection manager >> class that can keep track of ZK's connection state? >> > -- > - Mark > > http://about.me/markrmiller >
