Re: KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum

Karolos Antoniadis Wed, 07 Aug 2019 11:32:01 -0700

In the paragraph that Michael mentioned, among others it is written: "For
example, when a partition leader changes its ISR in ZK, the controller will
typically not learn about these changes for many seconds." Why would it
take "many seconds"?  Sending a watch event to the controller should be
pretty fast.
Also, in the same paragraph, Colin states "By the time the controller
re-reads the znode and sets up a new watch, the state may have changed from
what it was when the watch originally fired.  [...] only way to resolve the
discrepancy." Why would this lead to any discrepancy? It seems to me that
the controller, will read an even newer state in such a scenario.


Also, another argument mentioned in original KIP-500 proposal had to do
with speeding up the failover of a controller: "Because the controllers
will now all track the latest state, controller failover will not require a
lengthy reloading period where we transfer all the state to the new
controller." But this does not seem to be a problem with ZK per se and
could be solved by keeping a broker as a standby controller (briefly
mentioned here https://www.slideshare.net/ConfluentInc/a-deep-dive-into-
kafka-controller as future work.)

By the way, in regards to the deployment & configuration issue, Colin
provides some more arguments in Kafka's dev mailing list (https://lists.
apache.org/[email protected]:lte=1M:KIP-500) if anybody is
interested in having a look.


On Fri, 2 Aug 2019 at 16:01, Michael Han <[email protected]> wrote:

> Very well said, thank you Ted!
>
> >> I would still opt for quorum outside rather than quorum as a library.
>
> One observation on out side quorum vs library: for Raft, cockroach db and
> TiDB both choose the library approach instead of depending on etcd, though
> they all share the etcd's Raft implementation. ZooKeeper could be used in a
> similar approach if we can abstract ZAB and provides a nice SMR interface
> on top of it.
>
> On Fri, Aug 2, 2019 at 12:44 PM Ted Dunning <[email protected]> wrote:
>
> > The core issue in these situations in my experience is that having the
> > quorum as a separate service can be a pain point. This misunderstanding
> > about how watches work and why they don't provide the data is just a
> > symptom of this. Having an integrated quorum is very attractive from the
> > point of view of management and tighter integration with the record of
> > state.
> >
> > If I had it all to do over again, though, I think I would still opt for
> > quorum outside rather than quorum as a library. There are management
> > burdens, but many of those management burdens are implicit in the fact
> that
> > managing the state of the system is different from managing the system or
> > doing the stuff the system does. Pulling the quorum system into the
> > do-stuff system doesn't actually make life all that much easier even if
> it
> > does simplify the installer.
> >
> > The countervailing risk that you are likely to get a quorum system wrong
> is
> > really significant. Having a battle-tested (some might say
> battle-scarred)
> > system like ZK is quite a virtue since you can have a different level of
> > confidence in it than something you whipped up last week.
> >
> >
> >
> > On Fri, Aug 2, 2019 at 11:49 AM Patrick Hunt <[email protected]> wrote:
> >
> > > Michael I think you are describing subscribe - this?
> > > https://issues.apache.org/jira/browse/ZOOKEEPER-153
> > > wasn't there some work done to keep tlogs around for a while? Or am I
> > miss
> > > remembering? (fb folks?)
> > >
> > > I'll also add that we haven't done any benchmarking in quite some time.
> > It
> > > would be interesting to collect a few of these use cases from the
> > > community, esp downstreams, and evaluate performance, see if we can
> > > address.
> > >
> > > Patrick
> > >
> > > On Fri, Aug 2, 2019 at 11:03 AM Michael Han <[email protected]> wrote:
> > >
> > > > Folks,
> > > >
> > > > Some of you might already see this. Comments?
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum
> > > >
> > > >
> > > > What caught my eyes are:
> > > >
> > > > *Worse still, although ZooKeeper is the store of record, the state in
> > > > ZooKeeper often doesn't match the state that is held in memory in the
> > > > controller.  For example, when a partition leader changes its ISR in
> > ZK,
> > > > the controller will typically not learn about these changes for many
> > > > seconds.  There is no generic way for the controller to follow the
> > > > ZooKeeper event log.  Although the controller can set one-shot
> watches,
> > > the
> > > > number of watches is limited for performance reasons.  When a watch
> > > > triggers, it doesn't tell the controller the current state-- only
> that
> > > the
> > > > state has changed.  By the time the controller re-reads the znode and
> > > sets
> > > > up a new watch, the state may have changed from what it was when the
> > > watch
> > > > originally fired.  If there is no watch set, the controller may not
> > learn
> > > > about the change at all.  In some cases, restarting the controller is
> > the
> > > > only way to resolve the discrepancy.*
> > > >
> > > > I've seen some similar zookeeper use cases that ended up like what's
> > > > described here. How can ZooKeeper solve this? It seems to me that the
> > > only
> > > > solution is to provide linearizable read on watched operations.
> > Thoughts?
> > > >
> > > > Michael.
> > > >
> > >
> >
>

Re: KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum

Reply via email to