Hi all, Thanks Juha for bringing this discussion here. To everyone else, I am Juha's colleague at Aiven and am currently working on introducing the kind of tooling discussed in this thread, to be used in worst case scenarios. I have a proof of concept working. The various input and concerns raised here is very valuable. I'm hopeful this discussion can lead to a path forward for introducing some kind of emergency tooling for this scenario into Apache Kafka itself.
José, on this point: > I am hesitant to give users the impression that Kafka can tolerate and recover from data loss in the cluster metadata partition. That's a very fair point, but I believe it can rather feed into informing naming decisions and documentation of this feature, as well as the addition of safeguards to any tooling introduced. There are likely multiple venues for teaching users here. As for a technical solution. > They need to configure a controller cluster that matches the voter set in the cluster metadata partition. Is there any preference to doing that vs "force inserting" new VoterRecords on the surviving controller, to remove the lost voters? > For example, partition leader epochs can decrease because of data loss in the cluster metadata partition and Kafka brokers don't handle decreasing partition leader epochs. But, if we consider recovering the metadata log from any _observer_, including brokers, making sure to pick the surviving process with the highest log offset, can this situation still happen? In order for a broker to experience the decrease, wouldn't it need to have a copy of the increasing log record on disk locally? And potentially then, that would also be the best copy to recover the cluster from? Colin, > It would be good to understand the scenario we're trying to solve a bit more. I'm thinking it's something like "no controller nodes exist, and we want to stand up a new quorum with some existing metadata"? That would definitely cover the case we are trying to solve, although Juha's description in the original message is closer: we are trying to solve for a "last standing controller" when the rest of the quorum is irrecoverably lost. We configure the cluster with IP addresses, hence I think in order to bring up "fake clones" of the lost controllers we would need to do something quite involved in order for a local process to act as the lost controller. Isn't it so that the end goal of such "process faking" would essentially be to produce VotersRecord entries in the surviving controller that allows new membership changes? It seems to me it might be preferable and have less moving pieces to make that operation directly on the surviving node's metadata log. BR, Anton Den tis 8 apr. 2025 kl 22:27 skrev José Armando García Sancio <jsan...@confluent.io.invalid>: > Hi Luke and Colin, > > On Mon, Apr 7, 2025 at 10:29 PM Luke Chen <show...@gmail.com> wrote: > > That's why we were discussing if there's any way to "force" recover the > > scenario, even if it's possible to have data loss. > > Yes. There is a way. They need to configure a controller cluster that > matches the voter set in the cluster metadata partition. That means a > controller cluster that matches the node ids, directory ids, and the > snapshot and log segments match with the consistent cluster metadata > partition. They can do that manually today. I think that Colin is > suggesting a tool to make this easier. > > The user should understand that these manual operations are extremely > dangerous and can result in data loss in the cluster metadata > partition. A Kafka cluster cannot recover from loss of data in the > cluster metadata partition. For example, partition leader epochs can > decrease because of data loss in the cluster metadata partition and > Kafka brokers don't handle decreasing partition leader epochs. > > If the user doesn't understand kraft's protocol to some degree, it is > unlikely that they can blindly follow some instruction and be > successful in their recovery. > > I am hesitant to give users the impression that Kafka can tolerate and > recover from data loss in the cluster metadata partition. > > What do you think? > -- > -José >