Hi all,

Thanks Juha for bringing this discussion here. To everyone else, I am
Juha's colleague at Aiven and am currently working on introducing the kind
of tooling discussed in this thread, to be used in worst case scenarios. I
have a proof of concept working. The various input and concerns raised here
is very valuable. I'm hopeful this discussion can lead to a path forward
for introducing some kind of emergency tooling for this scenario into
Apache Kafka itself.

José, on this point:

> I am hesitant to give users the impression that Kafka can tolerate and
recover from data loss in the cluster metadata partition.

That's a very fair point, but I believe it can rather feed into informing
naming decisions and documentation of this feature, as well as the addition
of safeguards to any tooling introduced. There are likely multiple venues
for teaching users here.

As for a technical solution.

> They need to configure a controller cluster that matches the voter set in
the cluster metadata partition.

Is there any preference to doing that vs "force inserting" new VoterRecords
on the surviving controller, to remove the lost voters?

> For example, partition leader epochs can decrease because of data loss in
the cluster metadata partition and Kafka brokers don't handle decreasing
partition leader epochs.

But, if we consider recovering the metadata log from any _observer_,
including brokers, making sure to pick the surviving process with the
highest log offset, can this situation still happen? In order for a broker
to experience the decrease, wouldn't it need to have a copy of the
increasing log record on disk locally? And potentially then, that would
also be the best copy to recover the cluster from?

Colin,

> It would be good to understand the scenario we're trying to solve a bit
more. I'm thinking it's something like "no controller nodes exist, and we
want to stand up a new quorum with some existing metadata"?

That would definitely cover the case we are trying to solve, although
Juha's description in the original message is closer: we are trying to
solve for a "last standing controller" when the rest of the quorum is
irrecoverably lost. We configure the cluster with IP addresses, hence I
think in order to bring up "fake clones" of the lost controllers we would
need to do something quite involved in order for a local process to act as
the lost controller.

Isn't it so that the end goal of such "process faking" would essentially be
to produce VotersRecord entries in the surviving controller that allows new
membership changes? It seems to me it might be preferable and have less
moving pieces to make that operation directly on the surviving node's
metadata log.

BR,
Anton

Den tis 8 apr. 2025 kl 22:27 skrev José Armando García Sancio
<jsan...@confluent.io.invalid>:

> Hi Luke and Colin,
>
> On Mon, Apr 7, 2025 at 10:29 PM Luke Chen <show...@gmail.com> wrote:
> > That's why we were discussing if there's any way to "force" recover the
> > scenario, even if it's possible to have data loss.
>
> Yes. There is a way. They need to configure a controller cluster that
> matches the voter set in the cluster metadata partition. That means a
> controller cluster that matches the node ids, directory ids, and the
> snapshot and log segments match with the consistent cluster metadata
> partition. They can do that manually today. I think that Colin is
> suggesting a tool to make this easier.
>
> The user should understand that these manual operations are extremely
> dangerous and can result in data loss in the cluster metadata
> partition. A Kafka cluster cannot recover from loss of data in the
> cluster metadata partition. For example, partition leader epochs can
> decrease because of data loss in the cluster metadata partition and
> Kafka brokers don't handle decreasing partition leader epochs.
>
> If the user doesn't understand kraft's protocol to some degree, it is
> unlikely that they can blindly follow some instruction and be
> successful in their recovery.
>
> I am hesitant to give users the impression that Kafka can tolerate and
> recover from data loss in the cluster metadata partition.
>
> What do you think?
> --
> -José
>

Reply via email to