Hi Anton, It rarely makes sense to scale up and down the number of controller nodes in the cluster. Only one controller node will be active at any given time. The main reason to use 5 nodes would be to be able to tolerate 2 failures instead of 1.
At Confluent, we generally run KRaft with 3 controllers. We have not seen problems with this setup, even with thousands of clusters. We have discussed using 5 node controller clusters on certain very big clusters, but we haven't done that yet. This is all very similar to ZK, where most deployments were 3 nodes as well. KIP-853 is not a blocker for either 3.7 or 4.0. We discussed this in several KIPs that happened this year and last year. The most notable was probably KIP-866, which was approved in May 2022. Many users these days run in a Kubernetes environment where Kubernetes actually controls the DNS. This makes changing the set of voters less important than it was historically. For example, in a world with static DNS, you might have to change the controller.quorum.voters setting from: 100@a.local:9073,101@b.local:9073,102@c.local:9073 to: 100@a.local:9073,101@b.local:9073,102@d.local:9073 In a world with k8s controlling the DNS, you simply remap c.local to point ot the IP address of your new pod for controller 102, and you're done. No need to update controller.quorum.voters. Another question is whether you re-create the pod data from scratch every time you add a new node. If you store the controller data on an EBS volume (or cloud-specific equivalent), you really only have to detach it from the previous pod and re-attach it to the new pod. k8s also handles this automatically, of course. If you want to reconstruct the full controller pod state each time you create a new pod (for example, so that you can use only instance storage), you should be able to rsync that state from the leader. In general, the invariant that we want to maintain is that the state should not "go back in time" -- if controller 102 promised to hold all log data up to offset X, it should come back with committed data at at least that offset. There are lots of new features we'd like to implement for KRaft, and Kafka in general. If you have some you really would like to see, I think everyone in the community would be happy to work with you. The flip side, of course, is that since there are an unlimited number of features we could do, we can't really block the release for any one feature. To circle back to KIP-853, I think it stands a good chance of making it into AK 4.0. Jose, Alyssa, and some other people have worked on it. It definitely won't make it into 3.7, since we have only a few weeks left before that release happens. best, Colin On Thu, Nov 9, 2023, at 00:20, Anton Agestam wrote: > Hi Luke, > > We have been looking into what switching from ZK to KRaft will mean for > Aiven. > > We heavily depend on an “immutable infrastructure” model for deployments. > This means that, when we perform upgrades, we introduce new nodes to our > clusters, scale the cluster up to incorporate the new nodes, and then phase > the old ones out once all partitions are moved to the new generation. This > allows us, and anyone else using a similar model, to do upgrades as well as > cluster resizing with zero downtime. > > Reading up on KRaft and the ZK-to-KRaft migration path, this is somewhat > worrying for us. It seems like, if KIP-853 is not included prior to > dropping support for ZK, we will essentially have no satisfying upgrade > path. Even if KIP-853 is included in 4.0, I’m unsure if that would allow a > migration path for us, since a new cluster generation would not be able to > use ZK during the migration step. > On the other hand, if KIP-853 was released in a version prior to dropping > ZK support, because it allows online resizing of KRaft clusters, this would > allow us and others that use an immutable infrastructure deployment model, > to provide a zero downtime migration path. > > For that reason, we’d like to raise awareness around this issue and > encourage considering the implementation of KIP-853 or equivalent a blocker > not only for 4.0, but for the last version prior to 4.0. > > BR, > Anton > > On 2023/10/11 12:17:23 Luke Chen wrote: >> Hi all, >> >> While Kafka 3.6.0 is released, I’d like to start the discussion for the >> “road to Kafka 4.0”. Based on the plan in KIP-833 >> < > https://cwiki.apache.org/confluence/display/KAFKA/KIP-833%3A+Mark+KRaft+as+Production+Ready#KIP833:MarkKRaftasProductionReady-Kafka3.7 >>, >> the next release 3.7 will be the final release before moving to Kafka 4.0 >> to remove the Zookeeper from Kafka. Before making this major change, I'd >> like to get consensus on the "must-have features/fixes for Kafka 4.0", to >> avoid some users being surprised when upgrading to Kafka 4.0. The intent > is >> to have a clear communication about what to expect in the following > months. >> In particular we should be signaling what features and configurations are >> not supported, or at risk (if no one is able to add support or fix known >> bugs). >> >> Here is the JIRA tickets list >> <https://issues.apache.org/jira/issues/?jql=labels%20%3D%204.0-blocker> I >> labeled for "4.0-blocker". The criteria I labeled as “4.0-blocker” are: >> 1. The feature is supported in Zookeeper Mode, but not supported in KRaft >> mode, yet (ex: KIP-858: JBOD in KRaft) >> 2. Critical bugs in KRaft, (ex: KAFKA-15489 : split brain in KRaft >> controller quorum) >> >> If you disagree with my current list, welcome to have discussion in the >> specific JIRA ticket. Or, if you think there are some tickets I missed, >> welcome to start a discussion in the JIRA ticket and ping me or other >> people. After we get the consensus, we can label/unlabel it afterwards. >> Again, the goal is to have an open communication with the community about >> what will be coming in 4.0. >> >> Below is the high level category of the list content: >> >> 1. Recovery from disk failure >> KIP-856 >> < > https://cwiki.apache.org/confluence/display/KAFKA/KIP-856:+KRaft+Disk+Failure+Recovery >>: >> KRaft Disk Failure Recovery >> >> 2. Prevote to support controllers more than 3 >> KIP-650 >> < > https://cwiki.apache.org/confluence/display/KAFKA/KIP-650%3A+Enhance+Kafkaesque+Raft+semantics >>: >> Enhance Kafkaesque Raft semantics >> >> 3. JBOD support >> KIP-858 >> < > https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft >>: >> Handle >> JBOD broker disk failure in KRaft >> >> 4. Scale up/down Controllers >> KIP-853 >> < > https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes >>: >> KRaft Controller Membership Changes >> >> 5. Modifying dynamic configurations on the KRaft controller >> >> 6. Critical bugs in KRaft >> >> Does this make sense? >> Any feedback is welcomed. >> >> Thank you. >> Luke >>