Hi, Jason, Thanks for the KIP. +1
Just to confirm. For those newly added request types, will we expose the existing latency metrics (total, local, remote, etc) with a new tag request=[request-type]? Jun On Tue, Aug 4, 2020 at 3:00 PM Boyang Chen <reluctanthero...@gmail.com> wrote: > Thanks for the KIP Jason, +1 (binding) from me as well for sure :) > > > On Tue, Aug 4, 2020 at 2:46 PM Colin McCabe <cmcc...@apache.org> wrote: > > > On Mon, Aug 3, 2020, at 20:55, Jason Gustafson wrote: > > > Hi Colin, > > > > > > Thanks for the responses. > > > > > > > I have a few lingering questions. I still don't like the fact that > the > > > > leader epoch / fetch epoch is 31 bits. What happens when this rolls > > over? > > > > Can we just make this 63 bits now so that we never have to worry > about > > it > > > > again? ZK has some awful bugs surrounding 32 bit rollover, due to a > > > > similar decision to use a 32 bit counter in their log structure. > > Doesn't > > > > seem like a good tradeoff. > > > > > > This is a bit difficult to do at the moment since the leader epoch is 4 > > > bytes in the message format. One option that I have considered is > > toggling > > > a batch attribute that lets us turn the producerId into an 8-byte > leader > > > epoch instead since we do not have a use for it in the metadata quorum. > > We > > > would need another solution if we ever wanted to use Raft for partition > > > replication, but perhaps by then we can make the case for a new message > > > format. > > > > > > > Hi Jason, > > > > Thanks for the explanation. I suspected that there was a technical > > limitation like this lurking somewhere. I think a hack like the one you > > suggested would be OK for now. I just really want to avoid thinking > about > > rollover :) > > > > Regarding the epoch overflow, some offline discussions among Jason, > Guozhang, Jose and I reached some conclusions: > > 1. The current default election timeout is 10 seconds, which means it takes > hundreds of years to be exhausted if just bumping through election timeout. > Even if the user sets it to 1 second, it still needs years to exhaust. > > 2. The most common case for fast epoch bumps is due to network partition. > If a certain voter couldn't connect to the quorum, it will repeatedly start > elections and do the epoch bump. To mitigate this concern, we already > planned a follow-up KIP to add the `pre-vote` feature to Kafka Raft > implementation as described in the literature to avoid rapid epoch > increments in the algorithm level. > > 3. As you suggested, the leader epoch overflow was a common problem not > just for Raft. We could kick off a separate KIP to address changing epoch > from 4 bytes to 8 bytes through message format upgrade, to solve the issue > for Kafka in a holistic manner. > > > > > > > > > > Just like in bootstrap.servers, I don't think we want to manually > > assign > > > > IDs per hostname. The hosts know their own IDs, after all. Having > to > > > > manually specify the IDs also opens up the possibility of > > > > misconfigurations: what I say the foobar server is node 2, but it's > > > > actually node 3? This would make the logs extremely confusing. I > > realize > > > > this may require a little finesse to do, but there's got to be a way > > we can > > > > avoid hard-coding IDs > > > > > > Fine. We can move this to KIP-631, but I think it would be a mistake to > > > take IDs out of this configuration. For safety, the one thing that the > > > configuration needs to tell us is what the IDs of the voters are. > Without > > > that, it's really easy for a cluster to get into a state where none of > > > the quorum members agree on what the proper set of voters is. I think > > > perhaps you are confused on the usage of these IDs. It is what enables > > > validation of voter requests. Without it, a voter would have to accept > a > > > vote request from any ID. There is a reason that other consensus > systems > > > like zookeeper and etcd require ids when configured statically. > > > > > > > I hadn't considered the fact that we need to validate incoming voter > > requests. The fact that nodes can have multiple DNS addresses does make > > this difficult to do with just a list of hostnames. > > > > I guess you're right that we should keep the IDs. But let's be careful > to > > validate that the node's ID really is what we think it is, and consider > > that peer failed if it's not. > > > > > > > > > Also, here's another case where we are saying "broker" when we mean > > > > "controller." It's really hard to break old habits. :) > > > > > > I think we still have this basic disagreement on the KIP-500 vision :). > > I'm > > > not sure I understand why you are so eager to force users to think > about > > > the controller as a separate system. It's almost like Zookeeper is not > > > going anywhere! > > > > > > > Well, KIP-500 clearly does identify the controller as a separate system, > > not as part of the broker, even if it runs in the same JVM. :) A system > > where all the nodes had the same role would need a fundamentally > different > > design, like Cassandra or something. > > > > I know you're joking, but just so that others understand, it's not fair > to > > say that "it's almost like ZK is not going anyway." KIP-500 clusters > will > > have simpler deployment and support a lot of interesting use-cases like > > single-JVM clusters, that would not be possible with the current setup. > > > > At the same time, saying "broker" when you mean "controller" confuses > > people. For example, I had someone ask a question recently about why we > > needed BrokerHeartbeat when Raft already specifies a mechanism for leader > > change. I had to explain the different between broker nodes and > controller > > nodes. > > > > Anyway, +1 (binding). Excited to see Raftka going forward! > > > > best, > > Colin > > > > > > > > -Jason > > > > > > > > > > > > > > > On Mon, Aug 3, 2020 at 4:36 PM Jose Garcia Sancio < > jsan...@confluent.io> > > > wrote: > > > > > > > +1. > > > > > > > > Thanks for the detailed KIP! > > > > > > > > On Mon, Aug 3, 2020 at 11:03 AM Jason Gustafson <ja...@confluent.io> > > > > wrote: > > > > > > > > > > Hi All, I'd like to start a vote on this proposal: > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum > > > > . > > > > > The discussion has been active for a bit more than 3 months and I > > think > > > > the > > > > > main points have been addressed. We have also moved some of the > > pieces > > > > into > > > > > follow-up proposals, such as KIP-630. > > > > > > > > > > Please keep in mind that the details are bound to change as all of > > > > > the pieces start coming together. As usual, we will keep this > thread > > > > > notified of such changes. > > > > > > > > > > For me personally, this is super exciting since we have been > thinking > > > > about > > > > > this work ever since I started working on Kafka! I am +1 of course. > > > > > > > > > > Best, > > > > > Jason > > > > > > > > > > > > > > > > -- > > > > -Jose > > > > > > > > > >