On Mon, Apr 7, 2025, at 19:29, Luke Chen wrote: > Hi Jose and Colin, > > Thanks for your explanation! > > Yes, we all agree that 3 node quorum can only tolerate 1 node down. > We just want to discuss, "what if" 2 out of 3 nodes are down at the same > time, what can we do? > Currently, the result is that the quorum will never form and all the kafka > cluster is basically unavailable. > That's why we were discussing if there's any way to "force" recover the > scenario, even if it's possible to have data loss. >
Hi Luke, Just for the benefit of those reading this... in a real-world scenario like this, I would expect the admin to copy the metadata to a second controller node and bring that node up, to restore the quorum. Maybe I'm missing some constraint here like "we can't reuse the same hostname / port ever again" which makes this impossible. I think that constraint would be extremely rare in practice. For example, if you use kubernetes, you control DNS yourself, so you can always give a new node the same DNS address again if needed. However, it is worth considering these unusual scenarios. > > Perhaps something like "format with existing metadata"? If we did > > something like that, we should probably make it a separate tool from the > > formatting tool, and explicitly make it interactive (requires you to type > > "YES" on the console or something), since I DON'T want the folks making > > docker images and so on to do this. > > Sounds like a good idea! > It would be good to understand the scenario we're trying to solve a bit more. I'm thinking it's something like "no controller nodes exist, and we want to stand up a new quorum with some existing metadata"? best, Colin > Thanks. > Luke > > > > On Tue, Apr 8, 2025 at 5:49 AM Colin McCabe <cmcc...@apache.org> wrote: > >> Hi José, >> >> I think you make a valid point that our guarantees here are not actually >> different from zookeeper. In both systems, if you lose quorum, you will >> probably lose some data. Of course, how much data you lose depends on luck. >> If the last node standing was the active controller / zookeeper, then you >> got lucky. >> >> This is why a lot of people run with 5 or 7 node quorums in both systems. >> The redundnancy is useful. >> >> I do think it would be nice to document some way of "forcing" the quorum >> into a specific configuration for data loss scenarios like this. This could >> also be used in the case where we lost 100% of the controllers. The brokers >> have a metadata snapshot and metadata log, so in an emergency you could >> grab the metadata from there. >> >> Perhaps something like "format with existing metadata"? If we did >> something like that, we should probably make it a separate tool from the >> formatting tool, and explicitly make it interactive (requires you to type >> "YES" on the console or something), since I DON'T want the folks making >> docker images and so on to do this. >> >> best, >> Colin >> >> >> On Mon, Apr 7, 2025, at 14:26, José Armando García Sancio wrote: >> > Thanks Luke. >> > >> > On Thu, Apr 3, 2025 at 7:14 AM Luke Chen <show...@gmail.com> wrote: >> >> In addition to the approaches you provided, maybe we can have a way to >> >> "force" KRaft to honor "controller.quorum.voters" config, instead of >> >> "controller.quorum.bootstrap.servers", even it's in kraft.version 1. >> > >> > Small clarification. In KIP-595, controller.quorum.voters was playing >> > two roles. 1) the set voters used by KRaft during HWM calculation and >> > leader election. 2) the set of endpoints used by observers (brokers) >> > to discover the leader (active controller). >> > >> > In KIP-853, I split that functionality. The set of voters was moved to >> > control records (VotersRecord) in the cluster metadata partition. The >> > bootstrap endpoints used by observer/brokers to discover the leader >> > were moved to the controller.quorum.bootstrap.servers property. >> > >> > I say that because Kafka supports using both configurations at the >> > same time. This is useful when upgrading from kraft.version 0 to >> > kraft.version 1. The upgrade process is roughly as follow: >> > 1. Add the controller.quorum.bootstrap.servers to all of the nodes >> > (controllers and brokers). >> > 2. Upgrade the kraft.version from 0 to 1. >> > 3. Monitor the "ignored-static-voters" metrics and remove the >> > controller.quorum.voters when the metric is 1. >> > >> > Thanks, >> > -- >> > -José >>