Hi Jun,

See my comments below.

On Tue, Mar 5, 2024 at 2:57 PM Jun Rao <j...@confluent.io.invalid> wrote:
> 37. We don't batch multiple topic partitions in AddVoter, RemoveVoter and
> UpdateVoter requests while other requests like Vote and BeginQuorumEpoch
> support batching. Should we make them consistent?

Originally I had them as batched RPCs but decided to make them only
operate on one KRaft topic partition. I made this change primarily
because this is false sharing and batching. Topic partitions in KRaft
are independent; those operations will be handled independently and
committed independently but because of the batching the kraft node
will be required to wait on all the batched operations before it can
send the response.

I know that this is inconsistent with the other RPCs but I am hesitant
to propagate this incorrect batching to new RPCs.

> 38. BeginQuorumEpochRequest: It seems that we need to replace the name
> field with a nodeId field in LeaderEndpoints?

The schema looks good to me. This struct has a different meaning from
the struct in the response of other RPCs. The BeginQuorumEpoch request
is sent by the leader so the expectation is that the sending
node/replica is the leader for all of the partitions sent. This also
means that the endpoints sent in LeaderEndpoints are all for the same
leader (or replica). The reason that BeginQuorumEpoch sends multiple
endpoints is because the leader may be binding to multiple listeners.
The leader sends a tuple (listener name, host name, host port) for
each of its own advertised controller listeners.

> 39. VoteRequest: Will the Voter ever be different from the Candidate? I
> thought that in all the VoteRequests, the voter just votes for itself.

Yes. This is a naming that always confuses me but makes sense to me
after further analysis. The voter (VoterId and VoterUuid) is the
replica receiving the Vote request and potentially voting for the
sender (candidate). The candidate (CandidateId and CandidateUuid) is
the replica sending the Vote request and asking for votes from the
receivers (voters). I tried to better document it in their respective
"about" schema fields.

> 40. EndQuorumEpochRequest: Should we add a replicaUUID field to pair with
> LeaderId?

I don't think so. We can add it for consistency and to help debugging
but I don't think it is needed for correctness. A leader cannot
(should not) change replica uuid and remain leader. In theory the only
way for a replica to change uuid is to lose their disk. If this
happens the expectation is that they will also lose their
QuorumStateData.

> 41. Regarding including replica UUID to identify a voter: It adds a bit of
> complexity. Could you explain whether it is truly needed? Before this KIP,
> KRaft already supports replacing a disk on the voter node, right?

Yes. This KIP attempts to solve two general problems. 1) How to
proactively change the voters set by increasing or decreasing the
replication factor; or replace voters in the voters set. 2) Identify
disk failures and recover from them safely. This is what I have in the
Motivation section:
"Consensus on the cluster metadata partition was achieved by the
voters (controllers). If the operator of a KRaft cluster wanted to
make changes to the set of voters, they would have to shutdown all of
the controllers nodes and manually make changes to the on-disk state
of the old controllers and new controllers. If the operator wanted to
replace an existing voter because of a disk failure or general
hardware failure, they would have to make sure that the new voter node
has a superset of the previous voter's on-disk state. Both of these
solutions are manual and error prone."

directory.id (replica uuid) is needed to identify and resolve disk
failures in a voter. The section "Proposed change / User explanation /
Common scenarios / Disk failure recovery" documents this use case in
more detail.

Thanks,
-- 
-José

Reply via email to