Re: [DISCUSS] Road to Kafka 4.0

Colin McCabe Mon, 20 Nov 2023 12:03:46 -0800

Hi Josep,

I think there is some confusion here. Quorum reconfiguration is not needed for 
KRaft to become production ready. Confluent runs thousands of KRaft clusters 
without quorum reconfiguration, and has for years. While dynamic quorum 
reconfiguration is a nice feature, it doesn't block anything: not migration, 
not deployment. As best as I understand it, the use-case Aiven has isn't even 
reconfiguration per se, just wiping a disk. There are ways to handle this -- I 
discussed some earlier in the thread. I think it would be productive to 
continue that discussion -- especially the part around documentation and 
testing of these cases.


A lot of people have done a lot of work to get Kafka 4.0 ready. I would not 
want to delay that because we want an additional feature. And we will always 
want additional features. So I am concerned we will end up in an infinite loop 
of people asking for "just one more feature" before they migrate.

best,
Colin


On Mon, Nov 20, 2023, at 04:15, Josep Prat wrote:
> Hi all,
>
> I wanted to share my opinion regarding this topic. I know some 
> discussions happened some time ago (over a year) but I believe it's 
> wise to reflect and re-evaluate if those decisions are still valid.
> KRaft, as of Kafka 3.6.x and 3.7.x, has not yet feature parity with 
> Zookeeper. By dropping Zookeeper altogether before achieving such 
> parity, we are opening the door to leaving a chunk of Apache Kafka 
> users without an easy way to upgrade to 4.0.
> In pro of making upgrades as smooth as possible, I propose to have a 
> Kafka version where KIP-853 is merged and Zookeeper still is supported. 
> This will enable community members who can't migrate yet to KRaft to do 
> so in a safe way (rolling back is something goes wrong). Additionally, 
> this will give us more confidence on having KRaft replacing 
> successfully Zookeeper without any big problems by discovering and 
> fixing bugs or by confirming that KRaft works as expected.
> For this I strongly believe we should have a 3.8.x version before 4.0.x.
>
> What do other think in this regard?
>
> Best,
>
> On 2023/11/14 20:47:10 Colin McCabe wrote:
>> On Tue, Nov 14, 2023, at 04:37, Anton Agestam wrote:
>> > Hi Colin,
>> >
>> > Thank you for your thoughtful and comprehensive response.
>> >
>> >> KIP-853 is not a blocker for either 3.7 or 4.0. We discussed this in
>> >> several KIPs that happened this year and last year. The most notable was
>> >> probably KIP-866, which was approved in May 2022.
>> >
>> > I understand this is the case, I'm raising my concern because I was
>> > foreseeing some major pain points as a consequence of this decision. Just
>> > to make it clear though: I am not asking for anyone to do work for me, and
>> > I understand the limitations of resources available to implement features.
>> > What I was asking is rather to consider the implications of _removing_
>> > features before there exists a replacement for them.
>> >
>> > I understand that the timeframe for 3.7 isn't feasible, and because of that
>> > I think what I was asking is rather: can we make sure that there are more
>> > 3.x releases until controller quorum online resizing is implemented?
>> >
>> > From your response, I gather that your stance is that it's important to
>> > drop ZK support sooner rather than later and that the necessary pieces for
>> > doing so are already in place.
>> 
>> Hi Anton,
>> 
>> Yes. I'm basically just repeating what we agreed upon in 2022 as part of 
>> KIP-833.
>> 
>> >
>> > ---
>> >
>> > I want to make sure I've understood your suggested sequence for controller
>> > node replacement. I hope the mentions of Kubernetes are rather for examples
>> > of how to carry things out, rather than saying "this is only supported on
>> > Kubernetes"?
>> 
>> Apache Kafka is supported in lots of environments, including non-k8s ones. I 
>> was just pointing out that using k8s means that you control your own DNS 
>> resolution, which simplifies matters. If you don't control DNS there are 
>> some extra steps for changing the quorum voters.
>> 
>> >
>> > Given we have three existing nodes as such:
>> >
>> > - a.local -> 192.168.0.100
>> > - b.local -> 192.168.0.101
>> > - c.local -> 192.168.0.102
>> >
>> > As well as a candidate node 192.168.0.103 that we want to replace for the
>> > role of c.local.
>> >
>> > 1. Shut down controller process on node .102 (to make sure we don't "go
>> > back in time").
>> > 2. rsync state from leader to .103.
>> > 3. Start controller process on .103.
>> > 4. Point the c.local entry at .103.
>> >
>> > I have a few questions about this sequence:
>> >
>> > 1. Would this sequence be safe against leadership changes?
>> >
>> 
>> If the leader changes, the new leader should have all of the committed 
>> entries that the old leader had.
>> 
>> > 2. Does it work
>> 
>> Probably the biggest issue is dealing with "torn writes" that happen because 
>> you're copying the current log segment while it's being written to. The 
>> system should be robust against this. However, we don't regularly do this, 
>> so there hasn't been a lot of testing.
>> 
>> I think Jose had a PR for improving the handling of this which we might want 
>> to dig up. We'd want the system to auto-truncate the partial record at the 
>> end of the log, if there is one.
>> 
>> > 3. By "state", do we mean `metadata.log.dir`? Something else?
>> 
>> Yes, the state of the metadata.log.dir. Keep in mind you will need to change 
>> the node ID in meta.properties after copying, of course.
>> 
>> > 4. What are the effects on cluster availability? (I think this is the same
>> > as asking what happens if a or b crashes during the process, or if network
>> > partitions occur).
>> 
>> Cluster metadata state tends to be pretty small. typically a hundred 
>> megabytes or so. Therefore, I do not think it will take more than a second 
>> or two to copy from one node to another. However, if you do experience a 
>> crash when one node out of three is down, then you will be unavailable until 
>> you can bring up a second node to regain a majority.
>> 
>> >
>> > ---
>> >
>> > If this is considered the official way of handling controller node
>> > replacements, does it make sense to improve documentation in this area? Is
>> > there already a plan for this documentation layed out in some KIPs? This is
>> > something I'd be happy to contribute to.
>> >
>> 
>> Yes, I think we should have official documentation about this. We'd be happy 
>> to review anything in that area.
>> 
>> >> To circle back to KIP-853, I think it stands a good chance of making it
>> >> into AK 4.0.
>> >
>> > This sounds good, but the point I was making was if we could have a release
>> > with both KRaft and ZK supporting this feature to ease the migration out of
>> > ZK.
>> >
>> 
>> The problem is, supporting multiple controller implementations is a huge 
>> burden. So we don't want to extend the 3.x release past the point that's 
>> needed to complete all the must-dos (SCRAM, delegation tokens, JBOD)
>> 
>> best,
>> Colin
>> 
>> 
>> > BR,
>> > Anton
>> >
>> > Den tors 9 nov. 2023 kl 23:04 skrev Colin McCabe <cmcc...@apache.org>:
>> >
>> >> Hi Anton,
>> >>
>> >> It rarely makes sense to scale up and down the number of controller nodes
>> >> in the cluster. Only one controller node will be active at any given time.
>> >> The main reason to use 5 nodes would be to be able to tolerate 2 failures
>> >> instead of 1.
>> >>
>> >> At Confluent, we generally run KRaft with 3 controllers. We have not seen
>> >> problems with this setup, even with thousands of clusters. We have
>> >> discussed using 5 node controller clusters on certain very big clusters,
>> >> but we haven't done that yet. This is all very similar to ZK, where most
>> >> deployments were 3 nodes as well.
>> >>
>> >> KIP-853 is not a blocker for either 3.7 or 4.0. We discussed this in
>> >> several KIPs that happened this year and last year. The most notable was
>> >> probably KIP-866, which was approved in May 2022.
>> >>
>> >> Many users these days run in a Kubernetes environment where Kubernetes
>> >> actually controls the DNS. This makes changing the set of voters less
>> >> important than it was historically.
>> >>
>> >> For example, in a world with static DNS, you might have to change the
>> >> controller.quorum.voters setting from:
>> >>
>> >> 100@a.local:9073,101@b.local:9073,102@c.local:9073
>> >>
>> >> to:
>> >>
>> >> 100@a.local:9073,101@b.local:9073,102@d.local:9073
>> >>
>> >> In a world with k8s controlling the DNS, you simply remap c.local to point
>> >> ot the IP address of your new pod for controller 102, and you're done. No
>> >> need to update controller.quorum.voters.
>> >>
>> >> Another question is whether you re-create the pod data from scratch every
>> >> time you add a new node. If you store the controller data on an EBS volume
>> >> (or cloud-specific equivalent), you really only have to detach it from the
>> >> previous pod and re-attach it to the new pod. k8s also handles this
>> >> automatically, of course.
>> >>
>> >> If you want to reconstruct the full controller pod state each time you
>> >> create a new pod (for example, so that you can use only instance storage),
>> >> you should be able to rsync that state from the leader. In general, the
>> >> invariant that we want to maintain is that the state should not "go back 
>> >> in
>> >> time" -- if controller 102 promised to hold all log data up to offset X, 
>> >> it
>> >> should come back with committed data at at least that offset.
>> >>
>> >> There are lots of new features we'd like to implement for KRaft, and Kafka
>> >> in general. If you have some you really would like to see, I think 
>> >> everyone
>> >> in the community would be happy to work with you. The flip side, of 
>> >> course,
>> >> is that since there are an unlimited number of features we could do, we
>> >> can't really block the release for any one feature.
>> >>
>> >> To circle back to KIP-853, I think it stands a good chance of making it
>> >> into AK 4.0. Jose, Alyssa, and some other people have worked on it. It
>> >> definitely won't make it into 3.7, since we have only a few weeks left
>> >> before that release happens.
>> >>
>> >> best,
>> >> Colin
>> >>
>> >>
>> >> On Thu, Nov 9, 2023, at 00:20, Anton Agestam wrote:
>> >> > Hi Luke,
>> >> >
>> >> > We have been looking into what switching from ZK to KRaft will mean for
>> >> > Aiven.
>> >> >
>> >> > We heavily depend on an “immutable infrastructure” model for 
>> >> > deployments.
>> >> > This means that, when we perform upgrades, we introduce new nodes to our
>> >> > clusters, scale the cluster up to incorporate the new nodes, and then
>> >> phase
>> >> > the old ones out once all partitions are moved to the new generation.
>> >> This
>> >> > allows us, and anyone else using a similar model, to do upgrades as well
>> >> as
>> >> > cluster resizing with zero downtime.
>> >> >
>> >> > Reading up on KRaft and the ZK-to-KRaft migration path, this is somewhat
>> >> > worrying for us. It seems like, if KIP-853 is not included prior to
>> >> > dropping support for ZK, we will essentially have no satisfying upgrade
>> >> > path. Even if KIP-853 is included in 4.0, I’m unsure if that would allow
>> >> a
>> >> > migration path for us, since a new cluster generation would not be able
>> >> to
>> >> > use ZK during the migration step.
>> >> > On the other hand, if KIP-853 was released in a version prior to 
>> >> > dropping
>> >> > ZK support, because it allows online resizing of KRaft clusters, this
>> >> would
>> >> > allow us and others that use an immutable infrastructure deployment
>> >> model,
>> >> > to provide a zero downtime migration path.
>> >> >
>> >> > For that reason, we’d like to raise awareness around this issue and
>> >> > encourage considering the implementation of KIP-853 or equivalent a
>> >> blocker
>> >> > not only for 4.0, but for the last version prior to 4.0.
>> >> >
>> >> > BR,
>> >> > Anton
>> >> >
>> >> > On 2023/10/11 12:17:23 Luke Chen wrote:
>> >> >> Hi all,
>> >> >>
>> >> >> While Kafka 3.6.0 is released, I’d like to start the discussion for the
>> >> >> “road to Kafka 4.0”. Based on the plan in KIP-833
>> >> >> <
>> >> >
>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-833%3A+Mark+KRaft+as+Production+Ready#KIP833:MarkKRaftasProductionReady-Kafka3.7
>> >> >>,
>> >> >> the next release 3.7 will be the final release before moving to Kafka
>> >> 4.0
>> >> >> to remove the Zookeeper from Kafka. Before making this major change, 
>> >> >> I'd
>> >> >> like to get consensus on the "must-have features/fixes for Kafka 4.0",
>> >> to
>> >> >> avoid some users being surprised when upgrading to Kafka 4.0. The 
>> >> >> intent
>> >> > is
>> >> >> to have a clear communication about what to expect in the following
>> >> > months.
>> >> >> In particular we should be signaling what features and configurations
>> >> are
>> >> >> not supported, or at risk (if no one is able to add support or fix 
>> >> >> known
>> >> >> bugs).
>> >> >>
>> >> >> Here is the JIRA tickets list
>> >> >> <https://issues.apache.org/jira/issues/?jql=labels%20%3D%204.0-blocker>
>> >> I
>> >> >> labeled for "4.0-blocker". The criteria I labeled as “4.0-blocker” are:
>> >> >> 1. The feature is supported in Zookeeper Mode, but not supported in
>> >> KRaft
>> >> >> mode, yet (ex: KIP-858: JBOD in KRaft)
>> >> >> 2. Critical bugs in KRaft, (ex: KAFKA-15489 : split brain in KRaft
>> >> >> controller quorum)
>> >> >>
>> >> >> If you disagree with my current list, welcome to have discussion in the
>> >> >> specific JIRA ticket. Or, if you think there are some tickets I missed,
>> >> >> welcome to start a discussion in the JIRA ticket and ping me or other
>> >> >> people. After we get the consensus, we can label/unlabel it afterwards.
>> >> >> Again, the goal is to have an open communication with the community
>> >> about
>> >> >> what will be coming in 4.0.
>> >> >>
>> >> >> Below is the high level category of the list content:
>> >> >>
>> >> >> 1. Recovery from disk failure
>> >> >> KIP-856
>> >> >> <
>> >> >
>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-856:+KRaft+Disk+Failure+Recovery
>> >> >>:
>> >> >> KRaft Disk Failure Recovery
>> >> >>
>> >> >> 2. Prevote to support controllers more than 3
>> >> >> KIP-650
>> >> >> <
>> >> >
>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-650%3A+Enhance+Kafkaesque+Raft+semantics
>> >> >>:
>> >> >> Enhance Kafkaesque Raft semantics
>> >> >>
>> >> >> 3. JBOD support
>> >> >> KIP-858
>> >> >> <
>> >> >
>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft
>> >> >>:
>> >> >> Handle
>> >> >> JBOD broker disk failure in KRaft
>> >> >>
>> >> >> 4. Scale up/down Controllers
>> >> >> KIP-853
>> >> >> <
>> >> >
>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes
>> >> >>:
>> >> >> KRaft Controller Membership Changes
>> >> >>
>> >> >> 5. Modifying dynamic configurations on the KRaft controller
>> >> >>
>> >> >> 6. Critical bugs in KRaft
>> >> >>
>> >> >> Does this make sense?
>> >> >> Any feedback is welcomed.
>> >> >>
>> >> >> Thank you.
>> >> >> Luke
>> >> >>
>> >>
>>

Re: [DISCUSS] Road to Kafka 4.0

Reply via email to