Re: [DISCUSS] KIP-595: A Raft Protocol for the Metadata Quorum

Guozhang Wang Wed, 29 Apr 2020 20:57:16 -0700

Hello Leonard,

Thanks for your comments, I'm relying in line below:


On Wed, Apr 29, 2020 at 1:58 AM Wang (Leonard) Ge <w...@confluent.io> wrote:

> Hi Kafka developers,
>
> It's great to see this proposal and it took me some time to finish reading
> it.
>
> And I have the following questions about the Proposal:
>
>    - How do we plan to test this design to ensure its correctness? Or more
>    broadly, how do we ensure that our new ‘pull’ based model is functional
> and
>    correct given that it is different from the original RAFT implementation
>    which has formal proof of correctness?
>

We have two planned verifications on the correctness and liveness of the
design. One is via model verification (TLA+)
https://github.com/guozhangwang/kafka-specification

Another is via the concurrent simulation tests
https://github.com/confluentinc/kafka/commit/5c0c054597d2d9f458cad0cad846b0671efa2d91

   - Have we considered any sensible defaults for the configuration, i.e.
>    all the election timeout, fetch time out, etc.? Or we want to leave
> this to
>    a later stage when we do the performance testing, etc.
>

This is a good question, the reason we did not set any default values for
the timeout configurations is that we think it may take some benchmarking
experiments to get these defaults right. Some high-level principles to
consider: 1) the fetch.timeout should be around the same scale with zk
session timeout, which is now 18 seconds by default -- in practice we've
seen unstable networks having more than 10 secs of transient connectivity,
2) the election.timeout, however, should be smaller than the fetch timeout
as is also suggested as a practical optimization in literature:
https://www.cl.cam.ac.uk/~ms705/pub/papers/2015-osr-raft.pdf

Some more discussions can be found here:
https://github.com/confluentinc/kafka/pull/301/files#r415420081


>    - Have we considered piggybacking `BeginQuorumEpoch` with the `
>    FetchQuorumRecords`? I might be missing something obvious but I am just
>    wondering why don’t we just use the `FindQuorum` and
> `FetchQuorumRecords`
>    APIs and remove the `BeginQuorumEpoch` API?
>

Note that Begin/EndQuorumEpoch is sent from leader -> other voter
followers, while FindQuorum / Fetch are sent from follower to leader.
Arguably one can eventually realize the new leader and epoch via gossiping
FindQuorum, but that could in practice require a long delay. Having a
leader -> other voters request helps the new leader epoch to be propagated
faster under a pull model.


>    - And about the `FetchQuorumRecords` response schema, in the `Records`
>    field of the response, is it just one record or all the records starting
>    from the FetchOffset? It seems a lot more efficient if we sent all the
>    records during the bootstrapping of the brokers.
>

Yes the fetching is batched: FetchOffset is just the starting offset of the
batch of records.


>    - Regarding the disruptive broker issues, does our pull based model
>    suffer from it? If so, have we considered the Pre-Vote stage? If not,
> why?
>
>
The disruptive broker is stated in the original Raft paper which is the
result of the push model design. Our analysis showed that with the pull
model it is no longer an issue.


> Thanks a lot for putting this up, and I hope that my questions can be of
> some value to make this KIP better.
>
> Hope to hear from you soon!
>
> Best wishes,
> Leonard
>
>
> On Wed, Apr 29, 2020 at 1:46 AM Colin McCabe <cmcc...@apache.org> wrote:
>
> > Hi Jason,
> >
> > It's amazing to see this coming together :)
> >
> > I haven't had a chance to read in detail, but I read the outline and a
> few
> > things jumped out at me.
> >
> > First, for every epoch that is 32 bits rather than 64, I sort of wonder
> if
> > that's a good long-term choice.  I keep reading about stuff like this:
> > https://issues.apache.org/jira/browse/ZOOKEEPER-1277 .  Obviously, that
> > JIRA is about zxid, which increments much faster than we expect these
> > leader epochs to, but it would still be good to see some rough
> calculations
> > about how long 32 bits (or really, 31 bits) will last us in the cases
> where
> > we're using it, and what the space savings we're getting really is.  It
> > seems like in most cases the tradeoff may not be worth it?
> >
> > Another thing I've been thinking about is how we do bootstrapping.  I
> > would prefer to be in a world where formatting a new Kafka node was a
> first
> > class operation explicitly initiated by the admin, rather than something
> > that happened implicitly when you started up the broker and things
> "looked
> > blank."
> >
> > The first problem is that things can "look blank" accidentally if the
> > storage system is having a bad day.  Clearly in the non-Raft world, this
> > leads to data loss if the broker that is (re)started this way was the
> > leader for some partitions.
> >
> > The second problem is that we have a bit of a chicken and egg problem
> with
> > certain configuration keys.  For example, maybe you want to configure
> some
> > connection security settings in your cluster, but you don't want them to
> > ever be stored in a plaintext config file.  (For example, SCRAM
> passwords,
> > etc.)  You could use a broker API to set the configuration, but that
> brings
> > up the chicken and egg problem.  The broker needs to be configured to
> know
> > how to talk to you, but you need to configure it before you can talk to
> > it.  Using an external secret manager like Vault is one way to solve
> this,
> > but not everyone uses an external secret manager.
> >
> > quorum.voters seems like a similar configuration key.  In the current
> KIP,
> > this is only read if there is no other configuration specifying the
> quorum
> > voter set.  If we had a kafka.mkfs command, we wouldn't need this key
> > because we could assume that there was always quorum information stored
> > locally.
> >
> > best,
> > Colin
> >
> >
> > On Thu, Apr 16, 2020, at 16:44, Jason Gustafson wrote:
> > > Hi All,
> > >
> > > I'd like to start a discussion on KIP-595:
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum
> > .
> > > This proposal specifies a Raft protocol to ultimately replace Zookeeper
> > > as
> > > documented in KIP-500. Please take a look and share your thoughts.
> > >
> > > A few minor notes to set the stage a little bit:
> > >
> > > - This KIP does not specify the structure of the messages used to
> > represent
> > > metadata in Kafka, nor does it specify the internal API that will be
> used
> > > by the controller. Expect these to come in later proposals. Here we are
> > > primarily concerned with the replication protocol and basic operational
> > > mechanics.
> > > - We expect many details to change as we get closer to integration with
> > > the controller. Any changes we make will be made either as amendments
> to
> > > this KIP or, in the case of larger changes, as new proposals.
> > > - We have a prototype implementation which I will put online within the
> > > next week which may help in understanding some details. It has
> diverged a
> > > little bit from our proposal, so I am taking a little time to bring it
> in
> > > line. I'll post an update to this thread when it is available for
> review.
> > >
> > > Finally, I want to mention that this proposal was drafted by myself,
> > Boyang
> > > Chen, and Guozhang Wang.
> > >
> > > Thanks,
> > > Jason
> > >
> >
>
>
> --
> Leonard Ge
> Software Engineer Intern - Confluent
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-595: A Raft Protocol for the Metadata Quorum

Reply via email to