Re: [VOTE] KIP-595: A Raft Protocol for the Metadata Quorum

Jun Rao Wed, 05 Aug 2020 14:00:08 -0700

Hi, Jason,

Thanks for the KIP. +1


Just to confirm. For those newly added request types, will we expose the
existing latency metrics (total, local, remote, etc) with a new tag
request=[request-type]?

Jun

On Tue, Aug 4, 2020 at 3:00 PM Boyang Chen <reluctanthero...@gmail.com>
wrote:

> Thanks for the KIP Jason, +1 (binding) from me as well for sure :)
>
>
> On Tue, Aug 4, 2020 at 2:46 PM Colin McCabe <cmcc...@apache.org> wrote:
>
> > On Mon, Aug 3, 2020, at 20:55, Jason Gustafson wrote:
> > > Hi Colin,
> > >
> > > Thanks for the responses.
> > >
> > > > I have a few lingering questions.  I still don't like the fact that
> the
> > > > leader epoch / fetch epoch is 31 bits.  What happens when this rolls
> > over?
> > > > Can we just make this 63 bits now so that we never have to worry
> about
> > it
> > > > again?  ZK has some awful bugs surrounding 32 bit rollover, due to a
> > > > similar decision to use a 32 bit counter in their log structure.
> > Doesn't
> > > > seem like a good tradeoff.
> > >
> > > This is a bit difficult to do at the moment since the leader epoch is 4
> > > bytes in the message format. One option that I have considered is
> > toggling
> > > a batch attribute that lets us turn the producerId into an 8-byte
> leader
> > > epoch instead since we do not have a use for it in the metadata quorum.
> > We
> > > would need another solution if we ever wanted to use Raft for partition
> > > replication, but perhaps by then we can make the case for a new message
> > > format.
> > >
> >
> > Hi Jason,
> >
> > Thanks for the explanation.  I suspected that there was a technical
> > limitation like this lurking somewhere.  I think a hack like the one you
> > suggested would be OK for now.  I just really want to avoid thinking
> about
> > rollover :)
> >
> > Regarding the epoch overflow, some offline discussions among Jason,
> Guozhang, Jose and I reached some conclusions:
>
> 1. The current default election timeout is 10 seconds, which means it takes
> hundreds of years to be exhausted if just bumping through election timeout.
> Even if the user sets it to 1 second, it still needs years to exhaust.
>
> 2. The most common case for fast epoch bumps is due to network partition.
> If a certain voter couldn't connect to the quorum, it will repeatedly start
> elections and do the epoch bump. To mitigate this concern, we already
> planned a follow-up KIP to add the `pre-vote` feature to Kafka Raft
> implementation as described in the literature to avoid rapid epoch
> increments in the algorithm level.
>
> 3. As you suggested, the leader epoch overflow was a common problem not
> just for Raft. We could kick off a separate KIP to address changing epoch
> from 4 bytes to 8 bytes through message format upgrade, to solve the issue
> for Kafka in a holistic manner.
>
>
>
> > >
> > > > Just like in bootstrap.servers, I don't think we want to manually
> > assign
> > > > IDs per hostname.  The hosts know their own IDs, after all.  Having
> to
> > > > manually specify the IDs also opens up the possibility of
> > > > misconfigurations: what I say the foobar server is node 2, but it's
> > > > actually node 3? This would make the logs extremely confusing.  I
> > realize
> > > > this may require a little finesse to do, but there's got to be a way
> > we can
> > > > avoid hard-coding IDs
> > >
> > > Fine. We can move this to KIP-631, but I think it would be a mistake to
> > > take IDs out of this configuration. For safety, the one thing that the
> > > configuration needs to tell us is what the IDs of the voters are.
> Without
> > > that, it's really easy for a cluster to get into a state where none of
> > > the quorum members agree on what the proper set of voters is. I think
> > > perhaps you are confused on the usage of these IDs. It is what enables
> > > validation of voter requests. Without it, a voter would have to accept
> a
> > > vote request from any ID. There is a reason that other consensus
> systems
> > > like zookeeper and etcd require ids when configured statically.
> > >
> >
> > I hadn't considered the fact that we need to validate incoming voter
> > requests.  The fact that nodes can have multiple DNS addresses does make
> > this difficult to do with just a list of hostnames.
> >
> > I guess you're right that we should keep the IDs.  But let's be careful
> to
> > validate that the node's ID really is what we think it is, and consider
> > that peer failed if it's not.
> >
> > >
> > > > Also, here's another case where we are saying "broker" when we mean
> > > > "controller."  It's really hard to break old habits.  :)
> > >
> > > I think we still have this basic disagreement on the KIP-500 vision :).
> > I'm
> > > not sure I understand why you are so eager to force users to think
> about
> > > the controller as a separate system. It's almost like Zookeeper is not
> > > going anywhere!
> > >
> >
> > Well, KIP-500 clearly does identify the controller as a separate system,
> > not as part of the broker, even if it runs in the same JVM.  :) A system
> > where all the nodes had the same role would need a fundamentally
> different
> > design, like Cassandra or something.
> >
> > I know you're joking, but just so that others understand, it's not fair
> to
> > say that "it's almost like ZK is not going anyway."  KIP-500 clusters
> will
> > have simpler deployment and support a lot of interesting use-cases like
> > single-JVM clusters, that would not be possible with the current setup.
> >
> > At the same time, saying "broker" when you mean "controller" confuses
> > people.  For example, I had someone ask a question recently about why we
> > needed BrokerHeartbeat when Raft already specifies a mechanism for leader
> > change.  I had to explain the different between broker nodes and
> controller
> > nodes.
> >
> > Anyway, +1 (binding).  Excited to see Raftka going forward!
> >
> > best,
> > Colin
> >
> > >
> > > -Jason
> > >
> > >
> > >
> > >
> > > On Mon, Aug 3, 2020 at 4:36 PM Jose Garcia Sancio <
> jsan...@confluent.io>
> > > wrote:
> > >
> > > > +1.
> > > >
> > > > Thanks for the detailed KIP!
> > > >
> > > > On Mon, Aug 3, 2020 at 11:03 AM Jason Gustafson <ja...@confluent.io>
> > > > wrote:
> > > > >
> > > > > Hi All, I'd like to start a vote on this proposal:
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum
> > > > .
> > > > > The discussion has been active for a bit more than 3 months and I
> > think
> > > > the
> > > > > main points have been addressed. We have also moved some of the
> > pieces
> > > > into
> > > > > follow-up proposals, such as KIP-630.
> > > > >
> > > > > Please keep in mind that the details are bound to change as all of
> > > > > the pieces start coming together. As usual, we will keep this
> thread
> > > > > notified of such changes.
> > > > >
> > > > > For me personally, this is super exciting since we have been
> thinking
> > > > about
> > > > > this work ever since I started working on Kafka! I am +1 of course.
> > > > >
> > > > > Best,
> > > > > Jason
> > > >
> > > >
> > > >
> > > > --
> > > > -Jose
> > > >
> > >
> >
>

Re: [VOTE] KIP-595: A Raft Protocol for the Metadata Quorum

Reply via email to