Re: [DISCUSS] KIP-1347: Overriding voter set on storage formatting

Paolo Patierno Thu, 04 Jun 2026 00:55:00 -0700

Hi Luke,

> LC1


Good point I updated the KIP.

> LC2

I am not sure about the relationship with KIP-1262, even because as
mentioned in my KIP in the "Broker considerations" section, we would still
need to format the brokers by using the override to write the new VoterSet.
We could not leverage what KIP-1262 is bringing to us.

Thanks,
Paolo.

On Tue, 2 Jun 2026 at 09:28, Luke Chen <[email protected]> wrote:

> Hi Paolo,
>
> Thanks for the KIP.
>
> Regarding KW3:
> I was thinking if we allow users to override with [0,1,2] but the local
> voter set is just [0,1], then what does that mean when a controller doesn't
> have controller 2 registration record, but controller 2 is one of the
> voter? I think it should be similar to what we have when we format a voter
> with --initial-controllers [0,1,2], but when startup, the controller 2 has
> network partition or something, then only [0,1] forms the quorum, so the
> VoterSet will be [0,1] in the end. Is my understanding correct? But what if
> users overrides with [0,1,2,3,4], but local voter set is just [0,1]? The
> [0,1] can't form a quorum because the majority is 3.
>
> LC1: "The same issue doesn’t arise when using the static quorum mode,
> because as soon as the controller.quorum.voters parameter is updated by
> using the new DNS hostnames for the controllers and they restart, the
> KRaft quorum
> is formed as the controller.quorum.voters list is used as the source of
> truth."
> I think currently, the dynamic quorum can still recover from the DNS
> hostname change "if the changed nodes < the majority nodes", is that right?
> The description in the "motivation" section makes readers think that in
> dynamic quorum, it cannot recover when there is DNS change happened. It'd
> be better we make it clear.
>
> LC2: It'd be good to mention something about KIP-1262
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1262%3A+Enable+auto-formatting+directories
> >
> :
> > After this KIP, it is no longer required for nodes to run kafka-storage
> format in order to start kafka. Additionally, the --cluster-id argument for
> kafka-storage format will now be optional, rather than required.
>
> I think it doesn't change your current design because we're formatting the
> voters, not the brokers/observers. But it'd be great if you can mention it
> to show that you already consider it.
>
>
> Thank you,
> Luke
>
> On Mon, May 25, 2026 at 5:27 PM Paolo Patierno <[email protected]>
> wrote:
>
> > Hi Josè and Kevin, thank you both for the very useful feedback!
> > Following my thoughts ...
> >
> > > JS1
> > I will make it clearer what I meant by adding your suggestion.
> >
> > > JS2
> > While a separate tool is a viable alternative for manual operations, I
> > believe integrating this functionality into the existing kafka-storage
> tool
> > is essential for broader deployment scenarios, particularly automated
> > cloud-native environments.
> > When the Apache Kafka cluster is running on bare metal or VMs and it's
> > human operated, using a new tool works fine and there is no much
> difference
> > with using an existing tool. Anyway, from my perspective, the storage
> tool
> > has the goal of formatting (controller or broker) and within the
> formatting
> > operation there is also the VoterSet initialization. Having a flag to
> > override an existing VoterSet is just an additional option to recover
> from
> > a failure/disaster scenario. I can't see how much it's really useful
> having
> > a separate tool for that.
> > The kafka-storage tool's fundamental purpose is to prepare storage for
> > cluster operation, which naturally includes both initial voter set
> creation
> > and recovery scenarios where that voter set needs correction. Adding
> > --override-voters as an optional flag maintains this conceptual coherence
> > while enabling disaster recovery without introducing a separate tool that
> > would require different operational workflows and orchestration logic.
> > On the other side, Apache Kafka clusters are increasingly deployed on
> > Kubernetes via operators like Strimzi (where I'm a core maintainer).
> These
> > operators fundamentally rely on idempotent reconciliation loops where the
> > same logic runs repeatedly, comparing desired state with actual state and
> > making necessary corrections. Critically, these reconciliation loops
> cannot
> > distinguish between normal operations and disaster recovery scenarios
> > without introducing significant complexity and fragility.
> > Consider how Strimzi operates today: When a pod starts, whether it's the
> > first time (new cluster), a rolling restart (configuration change), or a
> > crash recovery, the exact same startup script runs on the node.
> > The script invokes the storage formatter with the ignore-formatted flag,
> > which makes the operation idempotent: if the storage is unformatted, it
> > formats it; if already formatted, it proceeds without error. This
> > simplicity is what makes the operator reliable across all scenarios.
> > With dynamic quorum and the proposed --override-voters flag, this same
> > pattern extends naturally: the operator always provides the current
> desired
> > voter set (from the cluster configuration) and enables override mode. If
> > the storage is unformatted, it formats with the provided voter set. If
> > already formatted and the voter endpoints match, it's a no-op. If already
> > formatted but the voter endpoints differ (DNS change), it performs the
> > override. The operator doesn't need to know which scenario it's in but
> the
> > tool behaves correctly in all cases.
> >
> > > JS3
> > If you think about the reasoning I provided in the previous answer, you
> can
> > imagine that the process you are describing here can't work in an
> automated
> > environment.
> > It's mostly a manual approach with also copy-pasting snapshots across
> > nodes.
> > In a cloud-native environment, there is no "stop the world" thing where
> you
> > can shutdown everything, because Kubernetes will restarts pods for you.
> The
> > recovery should happen with a seamless way on rolling the node.
> >
> > > KW1
> > Good point, you are right. You can always go to use dynamic quorum but
> then
> > changing the nodes configuration by replacing controllers bootstrap with
> > controller.quorum.voters; the cluster will continue to use dynamic quorum
> > for backward compatibility. I will think more about this. Only drawback
> is
> > that "failing fast" won't be an option anymore because the tool has to
> read
> > the log first to get KRaftVersionRecord and VotersRecord.
> >
> > > KW2
> > Not sure what you are trying to suggest here, maybe to remove all
> metadata
> > related concepts I wrote? I put them there to make it clearer how the
> > overall snapshot creation works, not just with control records but also
> by
> > adding the full metadata records. I think they are useful details for a
> > better KIP understanding.
> >
> > > KW3
> > I see your point here. What if we relax the topology change but checking
> > that the override voter set is a superset of the persisted one. So if we
> > are trying to override with [0,1,2] but the local voter set is just
> [0,1],
> > the tool could go through anyway not considering it a topology change and
> > allowing the override? Or we could add a "--force" flag but it could
> > potentially break the idempotence I am stressing about automated
> > environments, so it's not my preferred option here.
> >
> > Thanks,
> > Paolo
> >
> > On Wed, 20 May 2026 at 01:29, Kevin Wu <[email protected]> wrote:
> >
> > > Hi Paolo,
> > >
> > > Thanks for the KIP. I have a few questions/comments:
> > >
> > > KW1: In your validation rules section, you say "Requires dynamic
> quorum:
> > If
> > > controller.quorum.voters is configured (static quorum), the command
> > fails."
> > > Technically, the presence of this configuration does not always mean
> the
> > > cluster is using static quorum. What "really" determines if a given
> node
> > > knows the cluster is using dynamic quorum is if the KRaftVersionRecord
> > and
> > > VotersRecord control records are present in its local log. If both the
> > > config and the records are present, the config is ignored. Failing when
> > > that config is found is okay. However, I think a more ideal behavior is
> > > that the tool fails if `controller.quorum.voters` is defined, and if a
> > > `KRaftVersionRecord` + `VotersRecord` is found in the snapshot,
> instruct
> > > the caller to remove that config, and then try again. What do you
> think?
> > >
> > > KW2: I see in your "Implementation Overview" section there are a lot of
> > > references to metadata layer concepts. One thing that may simplify
> > things a
> > > lot is that your CLI command only needs to be aware of the control
> > records
> > > in a snapshot you are trying to recover, since the rest of the metadata
> > > state should stay completely the same during the proposed recovery
> > process.
> > >
> > > KW3: In the case of running this command on each node, what happens
> when
> > > nodes disagree on who the voters are (i.e. the voter set is not the
> same
> > > across all nodes)? This is a scenario that can happen (e.g. initial
> > > bootstrap voter set not on all nodes yet, or VotersRecord'' produced by
> > > removing or adding a voter from VotersRecord' has not replicated to
> every
> > > node). From reading the KIP, it sounds like the same command invocation
> > > would fail on some nodes, who consider this a "topology" change, but
> pass
> > > on others. I think this and KW1 are motivations for José's proposed
> > > workflow of copying the "recovered" snapshot from the longest log
> around
> > to
> > > all nodes. I do like the idea that voter topology cannot change as a
> > result
> > > of this CLI call though.
> > >
> > > Best,
> > > Kevin Wu
> > >
> > > On Tue, May 19, 2026 at 4:37 AM José Armando García Sancio via dev <
> > > [email protected]> wrote:
> > >
> > > > Hi Paolo,
> > > >
> > > > Thanks a lot for the KIP. This feature would be very helpful to let
> > > > users recover their Kafka clusters. This a partial review as I wanted
> > > > to give you some feedback as soon as possible.
> > > >
> > > > JS1
> > > > > Furthermore, there is no safe recovery from majority loss. For
> > example,
> > > > if 2 of 3 controllers are permanently gone, you cannot update the
> > > > VotersRecord and must re-bootstrap with data loss.
> > > >
> > > > If the user loses 2 out of 3 controllers, metadata loss is possible.
> > > > Kafka cannot recover from metadata loss. For example, if the metadata
> > > > loss includes the leader epoch or ISR/ELR, Kafka cannot recover from
> > > > those cases without additional data loss.
> > > >
> > > > JS2
> > > > I am wondering if we should have a tool specific to these use cases
> > > > instead of reusing the kafka-storage tool. I like etcd's CLI
> > > > organization. They have etcdctl which communicates with an active
> > > > cluster. They have etcdutl which recovers an inactive cluster. In our
> > > > cases it would beneficial to have a tool specific to recovering an
> > > > inactive cluster. How about naming it kafka-recovery? I will use the
> > > > CLI name in the rest of my response but I am open to name
> suggestions.
> > > >
> > > > JS3
> > > > What do you think of including a section on how to use the tool? When
> > > > we document this tool/feature, we can copy that section to the Kafka
> > > > documentation. From my perspective this is what they need to do to
> use
> > > > this tool.
> > > > 1. Shut down all controllers.
> > > > 2. Pick the controller that has the longest cluster metadata log. The
> > > > controller with the longest log is guaranteed to have all of the
> > > > committed data. They would need a command like "kafka-recovery
> > > > metadata log-length (--metadata-log-dir|--config)". This command
> would
> > > > print the log end epoch and offset so that the user can compare them
> > > > with the other controllers.
> > > > 3. On the controller with the longest cluster metadata log, generate
> > > > the latest snapshot if one doesn't already exist. The user can backup
> > > > this snapshot in case they incorrectly recover the snapshot. E.g.
> > > > "kafka-recovery metadata generate-checkpoint
> > > > (--metadata-log-dir|--config)".
> > > > 4. Recover the controller's default endpoint or listener. I think we
> > > > limit this functionality to recovering only the default controller
> > > > listener. The default controller listener is the first listener in
> > > > "controller.listener.names". This is the listener that Kafka uses for
> > > > outgoing connections and RPCs to the controllers. E.g.
> "kafka-recovery
> > > > metadata override-endpoint --endpoint 0@host:port --endpoint
> > > > 1@host:port ... --config ...". The command would only override the
> > > > endpoints specified. E.g. if there are 3 controllers but the user
> only
> > > > overrides one endpoint, the tool will only fix that one endpoint.
> What
> > > > are your thoughts?
> > > > 5. Copy the generated checkpoint to all the controllers and brokers.
> > > > Copying the generated checkpoint to all controllers and brokers is
> > > > slightly inconvenient. The issue is that KRaft won't replicate this
> > > > checkpoint if the replicas (controllers and brokers) have already
> > > > replicated up to the leader's log start offset.
> > > >
> > > > As an alternative to step 5, they must run "kafka-recovery metadata
> > > > override-endpoint --endpoint 0@host:port --endpoint 1@host:port ...
> > > > --config ..." on all of the replicas. Running this command on all
> > > > replicas is problematic because the voter set might differ across
> > > > nodes due to dynamic voters/controllers.
> > > >
> > > > Thanks,
> > > > --
> > > > -José
> > > >
> > >
> >
> >
> > --
> > Paolo Patierno
> >
> > *Senior Principal Software Engineer @ IBM**CNCF Ambassador*
> >
> > Twitter : @ppatierno <http://twitter.com/ppatierno>
> > Linkedin : paolopatierno <http://it.linkedin.com/in/paolopatierno>
> > GitHub : ppatierno <https://github.com/ppatierno>
> >
>


-- 
Paolo Patierno

*Senior Principal Software Engineer @ IBM**CNCF Ambassador*

Twitter : @ppatierno <http://twitter.com/ppatierno>
Linkedin : paolopatierno <http://it.linkedin.com/in/paolopatierno>
GitHub : ppatierno <https://github.com/ppatierno>

Re: [DISCUSS] KIP-1347: Overriding voter set on storage formatting

Reply via email to