I did not mean to challenge the way you plan to refactor existing code to make it equivalent but simpler.
My point was different, but I mixed up disconnections i.e. ZK connection loss (no big deal) and session expiration (big deal). So please ignore my previous message. The correct way to look at it as I understand it: Connection loss is a non event (sorry for the pun) and indeed can/should be retried. Doing this efficiently as you suggest is better. Session expiration is the real signal from ZK that the client "transaction" (sequence of actions that achieve some goal such as electing a leader among participants or grabbing a lock) got interrupted and must be completely restarted using a new session. On Tue, Sep 28, 2021 at 1:03 PM Mark Miller <[email protected]> wrote: > That’s why I say that ideally you should actually enter a quiet mode on zk > disconnect and come out of it on reconnect, this isn’t necessarily the > ideal. I don’t think it assumes zk continuity - the retries are because > when you dc from a zk server, as long as all the servers are not dead, it > will automatically start trying to reconnect to another zk server - the > retries are just waiting for that to happen. They are just dumb retries, > which was silly, because zk tells you when it dcs and when it reconnects. I > don’t really see how it relates to a transaction - you just don’t want > things to fail when the zk client is failing over to a another server, in a > good impl, that should be mostly transparent to any user of the system. ZK > is designed to feed your app this info so that you can implement something > that makes these fail overs smooth. Sometimes it’s not even a failover it’s > just that overload or gc pauses have stalled zks heartbeat for too long. > > I did all that wrong though for sure - I did it before I really even > understood zookeeper that well. So the current stuff is just pretty poor. > What I describe it’s a relatively quick and simple way to make it much much > better. It would be much more invasive and a lot more work and effort to do > something too radically different. The end result would be similar, except > there really is no reason to sit on all the outstanding calls. It’s just > difficult to do anything that’s transparent to the user when zk is failing > over and that works with all the existing code. > > The end goal is not a transaction or anything though - it just to have > your app able to smoothly handle the zk client transitioning from one zk > server to another - or missing it’s heart beat due to load and then > connecting again. It’s certainly not currently that often a smooth event - > this just describes a way I have been able to make it smooth without having > to completely rewrite everything. > > Mark > > On Tue, Sep 28, 2021 at 2:06 AM Ilan Ginzburg <[email protected]> wrote: > >> Should ZK disconnect be handled at the individual call level to begin >> with? Aren’t we implementing “recipes” (equivalent to “transactions” in a >> DB world) that combine multiple actions and that implicitly assume ZK >> continuity over the course of execution? It seems these should rather fail >> and retry as a whole rather than individual actions? >> >> I don’t have any existing examples in mind of where this is problematic >> in existing code (or it would already be a bug) but the existing single >> call level retry approach feels fragile. >> >> Ilan >> >> On Mon 27 Sep 2021 at 19:04, Mark Miller <[email protected]> wrote: >> >>> There are a variety of ways you could do it. >>> >>> The easiest short term change is to simply modify what handles most zk >>> retries - the ZkCmdExecutor - already plugged into SolrZkClient where it >>> retries. It tries to guess when a session times out and does fall back >>> retries up to that point. >>> >>> Because there can be any number of calls doing this, zk disconnects tend >>> to spiral the cluster down. >>> >>> It shouldn’t work like this. Everything in the system related to zk >>> should be event driven. >>> >>> So ZkCmdExecutor should not sleep and retry some number of times. >>> >>> It’s retry method should call something like >>> ConnectionManager#waitForReconnect. Make that a wait on a lock. When zk >>> notifies there is a reconnect, signallAll the lock. Or use a condition. >>> Same thing if the ConnectionManager is closed. >>> >>> It’s not as ideal as entering a quite mode, but it’s tremendously >>> simpler to do. >>> >>> Now when zk hits a dc, it doesn’t get repeatedly hit over and over up >>> until a expiration guess or past a ConnectionManager close. >>> >>> Pretty much everything gets held up, the system is forced into what is >>> essentially a quite state - though will all the outstanding calls hanging - >>> which gives zookeeper the ability to easily reconnect to a valid zk server >>> - in which case everything is released to retry and succeed. >>> >>> With this approach, (and removing the guess isExpired on >>> ConnectionManager and using its actual zk client state) you can actually >>> bombard and overload the system with updates - which currently will crush >>> the system - and instead you can survive the bombard without any updates >>> are disabled, zk is not connected fails. Unless your zk cluster is actually >>> catastrophically down. >>> >>> Mark >>> >>> On Sun, Sep 26, 2021 at 7:54 AM David Smiley <[email protected]> wrote: >>> >>>> >>>> On Wed, Sep 22, 2021 at 9:06 PM Mark Miller <[email protected]> >>>> wrote: >>>> ... >>>> >>>>> Zk alerts us when it losses a connection via callback. When the >>>>> connection is back, another callback. An unlimited number of locations >>>>> trying to work this out on there own is terrible zk. In an ideal world, >>>>> everything enters a zk quiete mode and re-engaged when zk says hello >>>>> again. >>>>> A simpler shorter term improvement is to simply sink all the zk calls >>>>> when >>>>> they hit the zk connection manager and don’t let them go until the >>>>> connection is restored. >>>>> >>>> >>>> While I don't tend to work on this stuff, I want to understand the >>>> essence of your point. Are you basically recommending that our ZK >>>> interactions should all go through one instance of a ZK connection manager >>>> class that can keep track of ZK's connection state? >>>> >>> -- >>> - Mark >>> >>> http://about.me/markrmiller >>> >> -- > - Mark > > http://about.me/markrmiller >
