Michael I think you are describing subscribe - this? https://issues.apache.org/jira/browse/ZOOKEEPER-153 wasn't there some work done to keep tlogs around for a while? Or am I miss remembering? (fb folks?)
I'll also add that we haven't done any benchmarking in quite some time. It would be interesting to collect a few of these use cases from the community, esp downstreams, and evaluate performance, see if we can address. Patrick On Fri, Aug 2, 2019 at 11:03 AM Michael Han <h...@apache.org> wrote: > Folks, > > Some of you might already see this. Comments? > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum > > > What caught my eyes are: > > *Worse still, although ZooKeeper is the store of record, the state in > ZooKeeper often doesn't match the state that is held in memory in the > controller. For example, when a partition leader changes its ISR in ZK, > the controller will typically not learn about these changes for many > seconds. There is no generic way for the controller to follow the > ZooKeeper event log. Although the controller can set one-shot watches, the > number of watches is limited for performance reasons. When a watch > triggers, it doesn't tell the controller the current state-- only that the > state has changed. By the time the controller re-reads the znode and sets > up a new watch, the state may have changed from what it was when the watch > originally fired. If there is no watch set, the controller may not learn > about the change at all. In some cases, restarting the controller is the > only way to resolve the discrepancy.* > > I've seen some similar zookeeper use cases that ended up like what's > described here. How can ZooKeeper solve this? It seems to me that the only > solution is to provide linearizable read on watched operations. Thoughts? > > Michael. >