Re: [DISCUSS] KIP-853: KRaft Voters Change

José Armando García Sancio Fri, 22 Jul 2022 08:55:53 -0700

Thanks Niket for your feedback. I have made changes to the KIP and
replied to your comments below.

Niket Goel wrote:
> > This UUID will be generated once and persisted as part of the quorum state 
> > for the topic partition
> Do we mean that it will be generated every time the disk on the replica is 
> primed (so every disk incarnation will have UUID). I think you describe it in 
> a section further below. Best to pull it up to here — “the replica uuid is 
> automatically generated once by the replica when persisting the quorum state 
> for the first time.”

Updated the Replica UUID section to better describe when it will be
generated and how it will be persisted.

> > If there are any pending voter change operations the leader will wait for 
> > them to finish.
> Will new requests be rejected or queued up behind the pending operation. (I 
> am assuming rejected, but want to confirm)

Either solution is correct but I think that the administrator would
prefer for the operation to get held until it can get processed or it
times out.

> > When this option is used the leader of the KRaft topic partition will not 
> > allow the AddVoter RPC to add replica IDs that are not describe in the 
> > configuration and it would not allow the RemoveVoter RPC to remove replica 
> > IDs that are described in the configuration.
> Bootstrapping is a little tricky I think. Is it safer/simpler to say that the 
> any add/remove RPC operations are blocked until all nodes in the config are 
> processed? The way it is worded above makes it seem like the leader will 
> accept adds of the same node from outside. Is that the case?

Updated the last sentence of that section to the following:

The KRaft leader will not perform writes from the state machine
(active controller) or client until is has written to the log an
AddVoterRecord for every replica id in the controller.quorum.voters
configuration.

>
> > The KRaft leader will not perform writes from the state machine (active 
> > controller) until is has written to the log an AddVoterRecord for every 
> > replica id in the controller.quorum.voters  configuration.
> Just thinking through - One of the safety requirements for the protocol is 
> for a leader to commit at least one write in an epoch before doing config 
> changes, right? In this special case we should be ok because the quorum has 
> no one but the leader in the beginning. Is that the thought process?

This should be safe because in this configuration the log will be
empty until all of the AddVoterRecords are persisted in a RecordBatch.
RecordBatches are atomic.

>
> > controller.quorum.bootstrap.servers vs controller.quorum.voters
> I understand the use of quorum.voters, but the bootstrap.servers is not 
> entirely clear to me. So in the example of starting the cluster with one 
> voter, will that one voter be listed here? And when using this configuration, 
> is the expectation that quorum.voters is empty, or will it eventually get 
> populated with the new quorum members?

These two configurations are mutually exclusive. The Kafka cluster is
expected to use one or the other. Kafka configuration validation will
fail if both are set. Kafka doesn't automatically update
configurations.

> e.g. further in the kip we say — “Replica 3 will discover the partition 
> leader using controller.quorum.voters”; so I guess it will be populated?

That example assumes that the cluster is configured to use
controller.quorum.voters: "Let's assume that the cluster is configured
to use  controller.quorum.voters and the value is
1@host1:9092,2@host2:9092,3@host3:9094."

>
> > This check will be removed and replicas will reply to votes request when 
> > the candidate is not in the voter set or the voting replica is not in the 
> > voter set.
> This is a major change IMO and I think it would be good if we could somehow 
> highlight it in the KIP to aid a future reader.

Hmm. All of the ideas and changes in the Proposed Changes section are
required and important for this feature to be correct and safe. The
Leader Election section highlights this change to the Vote RPC. The
Vote section later on in the document goes into more details.

> > This also means that the KRaft implementation needs to handle this 
> > uncommitted state getting truncated and reverted.
> Do we need to talk about the specific behavior a little more here? I mean how 
> does this affect any inflight messages with quorums moving between different 
> values. (Just a brief except to why it works)

I think that this requires going into how KafkaRaftClient is
implemented. I don't think we should do that in the KIP. I think that
it is better discussed during the implementation and PR review
process. The KIP and this section highlights that the implementation
needs to handle the voter set changing either because a log record was
read or because a log record was truncated.

> > This state can be discovered by a client by using the DescribeQuorum RPC, 
> > the Admin client or the kafka-quorum.sh CLI.
> The describeQuorum RPC does not respond with a list of observers today. We 
> would need a section to fix that.

The DescribeQuorum RPC does respond with the list of observers. See
field Observers in PartitionData. In KRaft an observer is any replica
that has an replica id and uuid that is not in the voter set.

The issue that you are highlighting is that currently brokers do not
set the replica id when configuring the KRaft peer so when the fetch
requests are sent to the leader it doesn't contain a replica id. The
leader is not able to keep replica state information for those
requests. Controllers whether they are voters or to-be-voters
(observers) will send the replica ID and UUID to the leader and the
leader will be able to keep track of them and return them in the
DescribeQuorum response.

> > The client can now decide to add replica (3, UUID3') to the set of voters 
> > using the AddVoter RPC, the Admin client or the kafka-quorum.sh CLI.
> Trying the understand the general thought process‚ the addition of this node 
> back into the quorum will be a manually triggered operation and not something 
> the node will attempt by itself?

This KIP supports the AddVoter or RemoveVoter requests coming from any
client. If it is sent to a broker it will get forward to the active
controller or it can be sent directly to the active controller. Voters
will be added and removed by an external client for example either the
Java Admin client or the kafka-metadata-quorum CLI.

>
> This is a general wonderment, and feel free to ignore it:
> Might be good to list some scenarios demonstrating the safety , e.g. how do 
> we ensure that there is no loss of availability during an addVoter operation 
> when the leader goes down. Or how a failed operation is safely removed from 
> the log and reverted.

Yes. I updated the Proposed Changes section to point the reader to
other documents that described these scenarios in general. I think it
is beneficial to keep this document as concise as possible for those
familiar with those documents.

-- 
-José

Re: [DISCUSS] KIP-853: KRaft Voters Change

Reply via email to