In the paragraph that Michael mentioned, among others it is written: "For example, when a partition leader changes its ISR in ZK, the controller will typically not learn about these changes for many seconds." Why would it take "many seconds"? Sending a watch event to the controller should be pretty fast. Also, in the same paragraph, Colin states "By the time the controller re-reads the znode and sets up a new watch, the state may have changed from what it was when the watch originally fired. [...] only way to resolve the discrepancy." Why would this lead to any discrepancy? It seems to me that the controller, will read an even newer state in such a scenario.
Also, another argument mentioned in original KIP-500 proposal had to do with speeding up the failover of a controller: "Because the controllers will now all track the latest state, controller failover will not require a lengthy reloading period where we transfer all the state to the new controller." But this does not seem to be a problem with ZK per se and could be solved by keeping a broker as a standby controller (briefly mentioned here https://www.slideshare.net/ConfluentInc/a-deep-dive-into- kafka-controller as future work.) By the way, in regards to the deployment & configuration issue, Colin provides some more arguments in Kafka's dev mailing list (https://lists. apache.org/list.html?d...@kafka.apache.org:lte=1M:KIP-500) if anybody is interested in having a look. On Fri, 2 Aug 2019 at 16:01, Michael Han <h...@apache.org> wrote: > Very well said, thank you Ted! > > >> I would still opt for quorum outside rather than quorum as a library. > > One observation on out side quorum vs library: for Raft, cockroach db and > TiDB both choose the library approach instead of depending on etcd, though > they all share the etcd's Raft implementation. ZooKeeper could be used in a > similar approach if we can abstract ZAB and provides a nice SMR interface > on top of it. > > On Fri, Aug 2, 2019 at 12:44 PM Ted Dunning <ted.dunn...@gmail.com> wrote: > > > The core issue in these situations in my experience is that having the > > quorum as a separate service can be a pain point. This misunderstanding > > about how watches work and why they don't provide the data is just a > > symptom of this. Having an integrated quorum is very attractive from the > > point of view of management and tighter integration with the record of > > state. > > > > If I had it all to do over again, though, I think I would still opt for > > quorum outside rather than quorum as a library. There are management > > burdens, but many of those management burdens are implicit in the fact > that > > managing the state of the system is different from managing the system or > > doing the stuff the system does. Pulling the quorum system into the > > do-stuff system doesn't actually make life all that much easier even if > it > > does simplify the installer. > > > > The countervailing risk that you are likely to get a quorum system wrong > is > > really significant. Having a battle-tested (some might say > battle-scarred) > > system like ZK is quite a virtue since you can have a different level of > > confidence in it than something you whipped up last week. > > > > > > > > On Fri, Aug 2, 2019 at 11:49 AM Patrick Hunt <ph...@apache.org> wrote: > > > > > Michael I think you are describing subscribe - this? > > > https://issues.apache.org/jira/browse/ZOOKEEPER-153 > > > wasn't there some work done to keep tlogs around for a while? Or am I > > miss > > > remembering? (fb folks?) > > > > > > I'll also add that we haven't done any benchmarking in quite some time. > > It > > > would be interesting to collect a few of these use cases from the > > > community, esp downstreams, and evaluate performance, see if we can > > > address. > > > > > > Patrick > > > > > > On Fri, Aug 2, 2019 at 11:03 AM Michael Han <h...@apache.org> wrote: > > > > > > > Folks, > > > > > > > > Some of you might already see this. Comments? > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum > > > > > > > > > > > > What caught my eyes are: > > > > > > > > *Worse still, although ZooKeeper is the store of record, the state in > > > > ZooKeeper often doesn't match the state that is held in memory in the > > > > controller. For example, when a partition leader changes its ISR in > > ZK, > > > > the controller will typically not learn about these changes for many > > > > seconds. There is no generic way for the controller to follow the > > > > ZooKeeper event log. Although the controller can set one-shot > watches, > > > the > > > > number of watches is limited for performance reasons. When a watch > > > > triggers, it doesn't tell the controller the current state-- only > that > > > the > > > > state has changed. By the time the controller re-reads the znode and > > > sets > > > > up a new watch, the state may have changed from what it was when the > > > watch > > > > originally fired. If there is no watch set, the controller may not > > learn > > > > about the change at all. In some cases, restarting the controller is > > the > > > > only way to resolve the discrepancy.* > > > > > > > > I've seen some similar zookeeper use cases that ended up like what's > > > > described here. How can ZooKeeper solve this? It seems to me that the > > > only > > > > solution is to provide linearizable read on watched operations. > > Thoughts? > > > > > > > > Michael. > > > > > > > > > >