Hi, Jose, Thanks for the reply.
20.1. It seems that the PreferredSuccessors field didn't get fixed. It's still there together with PreferredCandidates. + { "name": "PreferredSuccessors", "type": "[]int32", "versions": "0", + "about": "A sorted list of preferred successors to start the election" }, + { "name": "PreferredCandidates", "type": "[]ReplicaInfo", "versions": "1+", + "about": "A sorted list of preferred candidates to start the election", "fields": [ 37. If we don't support batching in AddVoterResponse, RemoveVoterResponse and UpdateVoterResponse, should we combine CurrentLeader and NodeEndpoint into a single field? 42. We include VoterId and VoterUuid for the receiver in Vote and BeginQuorumEpoch requests, but not in EndQuorumEpoch, Fetch and FetchSnapshot. Could you explain how they are used? Jun On Wed, Mar 6, 2024 at 8:53 AM José Armando García Sancio <jsan...@confluent.io.invalid> wrote: > Hi Jun, > > See my comments below. > > On Tue, Mar 5, 2024 at 2:57 PM Jun Rao <j...@confluent.io.invalid> wrote: > > 37. We don't batch multiple topic partitions in AddVoter, RemoveVoter and > > UpdateVoter requests while other requests like Vote and BeginQuorumEpoch > > support batching. Should we make them consistent? > > Originally I had them as batched RPCs but decided to make them only > operate on one KRaft topic partition. I made this change primarily > because this is false sharing and batching. Topic partitions in KRaft > are independent; those operations will be handled independently and > committed independently but because of the batching the kraft node > will be required to wait on all the batched operations before it can > send the response. > > I know that this is inconsistent with the other RPCs but I am hesitant > to propagate this incorrect batching to new RPCs. > > > 38. BeginQuorumEpochRequest: It seems that we need to replace the name > > field with a nodeId field in LeaderEndpoints? > > The schema looks good to me. This struct has a different meaning from > the struct in the response of other RPCs. The BeginQuorumEpoch request > is sent by the leader so the expectation is that the sending > node/replica is the leader for all of the partitions sent. This also > means that the endpoints sent in LeaderEndpoints are all for the same > leader (or replica). The reason that BeginQuorumEpoch sends multiple > endpoints is because the leader may be binding to multiple listeners. > The leader sends a tuple (listener name, host name, host port) for > each of its own advertised controller listeners. > > > 39. VoteRequest: Will the Voter ever be different from the Candidate? I > > thought that in all the VoteRequests, the voter just votes for itself. > > Yes. This is a naming that always confuses me but makes sense to me > after further analysis. The voter (VoterId and VoterUuid) is the > replica receiving the Vote request and potentially voting for the > sender (candidate). The candidate (CandidateId and CandidateUuid) is > the replica sending the Vote request and asking for votes from the > receivers (voters). I tried to better document it in their respective > "about" schema fields. > > > 40. EndQuorumEpochRequest: Should we add a replicaUUID field to pair with > > LeaderId? > > I don't think so. We can add it for consistency and to help debugging > but I don't think it is needed for correctness. A leader cannot > (should not) change replica uuid and remain leader. In theory the only > way for a replica to change uuid is to lose their disk. If this > happens the expectation is that they will also lose their > QuorumStateData. > > > 41. Regarding including replica UUID to identify a voter: It adds a bit > of > > complexity. Could you explain whether it is truly needed? Before this > KIP, > > KRaft already supports replacing a disk on the voter node, right? > > Yes. This KIP attempts to solve two general problems. 1) How to > proactively change the voters set by increasing or decreasing the > replication factor; or replace voters in the voters set. 2) Identify > disk failures and recover from them safely. This is what I have in the > Motivation section: > "Consensus on the cluster metadata partition was achieved by the > voters (controllers). If the operator of a KRaft cluster wanted to > make changes to the set of voters, they would have to shutdown all of > the controllers nodes and manually make changes to the on-disk state > of the old controllers and new controllers. If the operator wanted to > replace an existing voter because of a disk failure or general > hardware failure, they would have to make sure that the new voter node > has a superset of the previous voter's on-disk state. Both of these > solutions are manual and error prone." > > directory.id (replica uuid) is needed to identify and resolve disk > failures in a voter. The section "Proposed change / User explanation / > Common scenarios / Disk failure recovery" documents this use case in > more detail. > > Thanks, > -- > -José >