[
https://issues.apache.org/jira/browse/KAFKA-17966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17898694#comment-17898694
]
José Armando García Sancio commented on KAFKA-17966:
----------------------------------------------------
{quote}For example, to replace a controller in a three-controller cluster,
adding one controller and then removing the other allows the system to handle
one controller failure at all times throughout the whole process.
{quote}
Hi [~fvaleri] , thanks for the feature request.
The premise for your feature request is not entirely correct. The example you
used has 3 voters and already had a disk failure so the KRaft cluster already
experience the failure which it can tolerate. Any future failure of another
voter would cause unavailability (and maybe even data loss). If you want to
configure a cluster that can tolerate more than one failure, you need at least
5 voters no matter what.
In the example provided, which can only tolerate one failure, you need to first
remove the replica with the failed disk and then add a new replica with a new
disk. That's if you want to reuse the replica ids. You can also, create a new
node with a different replica id.
There are a few reasons for this restriction. One is to allow us to easily and
safely implement [automatic controller
addition|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217391519#KIP853:KRaftControllerMembershipChanges-Automaticjoiningcontrollers].
Another, is that Kafka network layer and protocol resolves node endpoints
using the node id.
> Controller replacement does not support scaling up before scaling down
> ----------------------------------------------------------------------
>
> Key: KAFKA-17966
> URL: https://issues.apache.org/jira/browse/KAFKA-17966
> Project: Kafka
> Issue Type: New Feature
> Components: kraft
> Affects Versions: 3.9.0
> Reporter: Federico Valeri
> Priority: Major
>
> In KRaft, complex quorum changes are implemented as a series of
> single-controller changes. In this case, it is preferable to add controllers
> before removing controllers. For example, to replace a controller in a
> three-controller cluster, adding one controller and then removing the other
> allows the system to handle one controller failure at all times throughout
> the whole process. This is currently not possible, as it leads to
> DuplicateVoterException, so you are forced to do a scale down, followed by a
> scale up.
> Example:
> The operator can replace a failed disk with a new one. The replaced disk
> needs to be formatted with a new directory ID.
> {code}
> $ CLUSTER_ID="$(bin/kafka-cluster.sh cluster-id --bootstrap-server
> localhost:9092 | awk -F': ' '{print $2}')"
> $ bin/kafka-storage.sh format \
> --config /opt/kafka/server2/config/server.properties \
> --cluster-id "$CLUSTER_ID" \
> --no-initial-controllers \
> --ignore-formatted
> Formatting metadata directory /opt/kafka/server2/metadata with
> metadata.version 3.9-IV0.
> {code}
> After restarting the controller, the quorum will have two nodes with ID two:
> the original incarnation with a failed disk and an ever growing lag and
> follower status, plus a new one with a different directory ID and observer
> status.
> {code}
> $ bin/kafka-metadata-quorum.sh --bootstrap-controller localhost:8000 describe
> --re --hu
> NodeId DirectoryId LogEndOffset Lag
> LastFetchTimestamp LastCaughtUpTimestamp Status
> 0 pbvuBlaTTwKRxS5NLJwRFQ 535 0 6 ms ago
> 6 ms ago Leader
> 1 QjRpFtVDTtCa8OLXiSbmmA 535 0 283 ms ago
> 283 ms ago Follower
> 2 slcsM5ZAR0SMIF_u__MAeg 407 128 63307 ms ago
> 63802 ms ago Follower
> 2 wrqMDI1WDsqaooVSOtlgYw 535 0 281 ms ago
> 281 ms ago Observer
> 8 aXLz3ixjqzXhCYqKHRD4WQ 535 0 284 ms ago
> 284 ms ago Observer
> 7 KCriHQZm3TlxvEVNgyWKJw 535 0 284 ms ago
> 284 ms ago Observer
> 9 v5nnIwK8r0XqjyqlIPW-aw 535 0 284 ms ago
> 284 ms ago Observer
> {code}
> Once the new controller is in sync with the leader, we try to do a scale up.
> {code}
> $ bin/kafka-metadata-quorum.sh \
> --bootstrap-controller localhost:8000 \
> --command-config /opt/kafka/server2/config/server.properties \
> add-controller
> org.apache.kafka.common.errors.DuplicateVoterException: The voter id for
> ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already
> part of the set of voters [ReplicaKey(id=0,
> directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1,
> directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2,
> directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])].
> java.util.concurrent.ExecutionException:
> org.apache.kafka.common.errors.DuplicateVoterException: The voter id for
> ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already
> part of the set of voters [ReplicaKey(id=0,
> directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1,
> directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2,
> directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])].
> at
> java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
> at
> java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
> at
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165)
> at
> org.apache.kafka.tools.MetadataQuorumCommand.handleAddController(MetadataQuorumCommand.java:431)
> at
> org.apache.kafka.tools.MetadataQuorumCommand.execute(MetadataQuorumCommand.java:147)
> at
> org.apache.kafka.tools.MetadataQuorumCommand.mainNoExit(MetadataQuorumCommand.java:81)
> at
> org.apache.kafka.tools.MetadataQuorumCommand.main(MetadataQuorumCommand.java:76)
> Caused by: org.apache.kafka.common.errors.DuplicateVoterException: The voter
> id for ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is
> already part of the set of voters [ReplicaKey(id=0,
> directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1,
> directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2,
> directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])].
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)