Re: [DISCUSS] KIP-1347: Overriding voter set on storage formatting

Kevin Wu Tue, 19 May 2026 16:29:02 -0700

Hi Paolo,

Thanks for the KIP. I have a few questions/comments:

KW1: In your validation rules section, you say "Requires dynamic quorum: If
controller.quorum.voters is configured (static quorum), the command fails."
Technically, the presence of this configuration does not always mean the
cluster is using static quorum. What "really" determines if a given node
knows the cluster is using dynamic quorum is if the KRaftVersionRecord and
VotersRecord control records are present in its local log. If both the
config and the records are present, the config is ignored. Failing when
that config is found is okay. However, I think a more ideal behavior is
that the tool fails if `controller.quorum.voters` is defined, and if a
`KRaftVersionRecord` + `VotersRecord` is found in the snapshot, instruct
the caller to remove that config, and then try again. What do you think?

KW2: I see in your "Implementation Overview" section there are a lot of
references to metadata layer concepts. One thing that may simplify things a
lot is that your CLI command only needs to be aware of the control records
in a snapshot you are trying to recover, since the rest of the metadata
state should stay completely the same during the proposed recovery process.

KW3: In the case of running this command on each node, what happens when
nodes disagree on who the voters are (i.e. the voter set is not the same
across all nodes)? This is a scenario that can happen (e.g. initial
bootstrap voter set not on all nodes yet, or VotersRecord'' produced by
removing or adding a voter from VotersRecord' has not replicated to every
node). From reading the KIP, it sounds like the same command invocation
would fail on some nodes, who consider this a "topology" change, but pass
on others. I think this and KW1 are motivations for José's proposed
workflow of copying the "recovered" snapshot from the longest log around to
all nodes. I do like the idea that voter topology cannot change as a result
of this CLI call though.

Best,
Kevin Wu

On Tue, May 19, 2026 at 4:37 AM José Armando García Sancio via dev <
[email protected]> wrote:

> Hi Paolo,
>
> Thanks a lot for the KIP. This feature would be very helpful to let
> users recover their Kafka clusters. This a partial review as I wanted
> to give you some feedback as soon as possible.
>
> JS1
> > Furthermore, there is no safe recovery from majority loss. For example,
> if 2 of 3 controllers are permanently gone, you cannot update the
> VotersRecord and must re-bootstrap with data loss.
>
> If the user loses 2 out of 3 controllers, metadata loss is possible.
> Kafka cannot recover from metadata loss. For example, if the metadata
> loss includes the leader epoch or ISR/ELR, Kafka cannot recover from
> those cases without additional data loss.
>
> JS2
> I am wondering if we should have a tool specific to these use cases
> instead of reusing the kafka-storage tool. I like etcd's CLI
> organization. They have etcdctl which communicates with an active
> cluster. They have etcdutl which recovers an inactive cluster. In our
> cases it would beneficial to have a tool specific to recovering an
> inactive cluster. How about naming it kafka-recovery? I will use the
> CLI name in the rest of my response but I am open to name suggestions.
>
> JS3
> What do you think of including a section on how to use the tool? When
> we document this tool/feature, we can copy that section to the Kafka
> documentation. From my perspective this is what they need to do to use
> this tool.
> 1. Shut down all controllers.
> 2. Pick the controller that has the longest cluster metadata log. The
> controller with the longest log is guaranteed to have all of the
> committed data. They would need a command like "kafka-recovery
> metadata log-length (--metadata-log-dir|--config)". This command would
> print the log end epoch and offset so that the user can compare them
> with the other controllers.
> 3. On the controller with the longest cluster metadata log, generate
> the latest snapshot if one doesn't already exist. The user can backup
> this snapshot in case they incorrectly recover the snapshot. E.g.
> "kafka-recovery metadata generate-checkpoint
> (--metadata-log-dir|--config)".
> 4. Recover the controller's default endpoint or listener. I think we
> limit this functionality to recovering only the default controller
> listener. The default controller listener is the first listener in
> "controller.listener.names". This is the listener that Kafka uses for
> outgoing connections and RPCs to the controllers. E.g. "kafka-recovery
> metadata override-endpoint --endpoint 0@host:port --endpoint
> 1@host:port ... --config ...". The command would only override the
> endpoints specified. E.g. if there are 3 controllers but the user only
> overrides one endpoint, the tool will only fix that one endpoint. What
> are your thoughts?
> 5. Copy the generated checkpoint to all the controllers and brokers.
> Copying the generated checkpoint to all controllers and brokers is
> slightly inconvenient. The issue is that KRaft won't replicate this
> checkpoint if the replicas (controllers and brokers) have already
> replicated up to the leader's log start offset.
>
> As an alternative to step 5, they must run "kafka-recovery metadata
> override-endpoint --endpoint 0@host:port --endpoint 1@host:port ...
> --config ..." on all of the replicas. Running this command on all
> replicas is problematic because the voter set might differ across
> nodes due to dynamic voters/controllers.
>
> Thanks,
> --
> -José
>

Re: [DISCUSS] KIP-1347: Overriding voter set on storage formatting

Reply via email to