[ https://issues.apache.org/jira/browse/KAFKA-17966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Federico Valeri updated KAFKA-17966: ------------------------------------ Description: In KRaft, complex quorum changes are implemented as a series of single-controller changes. In this case, it is preferable to add controllers before removing controllers. For example, to replace a controller in a three-controller cluster, adding one controller and then removing the other allows the system to handle one controller failure at all times throughout the whole process. This is currently not possible, as it leads to DuplicateVoterException, so you are forced to do a scale down, followed by a scale up. Example: The operator can replace a failed disk with a new one. The replaced disk needs to be formatted with a new directory ID. {code} $ CLUSTER_ID="$(bin/kafka-cluster.sh cluster-id --bootstrap-server localhost:9092 | awk -F': ' '{print $2}')" $ bin/kafka-storage.sh format \ --config /opt/kafka/server2/config/server.properties \ --cluster-id "$CLUSTER_ID" \ --no-initial-controllers \ --ignore-formatted Formatting metadata directory /opt/kafka/server2/metadata with metadata.version 3.9-IV0. {code} After restarting the controller, the quorum will have two nodes with ID two: the original incarnation with a failed disk and an ever growing lag and follower status, plus a new one with a different directory ID and observer status. {code} $ bin/kafka-metadata-quorum.sh --bootstrap-controller localhost:8000 describe --re --hu NodeId DirectoryId LogEndOffset Lag LastFetchTimestamp LastCaughtUpTimestamp Status 0 pbvuBlaTTwKRxS5NLJwRFQ 535 0 6 ms ago 6 ms ago Leader 1 QjRpFtVDTtCa8OLXiSbmmA 535 0 283 ms ago 283 ms ago Follower 2 slcsM5ZAR0SMIF_u__MAeg 407 128 63307 ms ago 63802 ms ago Follower 2 wrqMDI1WDsqaooVSOtlgYw 535 0 281 ms ago 281 ms ago Observer 8 aXLz3ixjqzXhCYqKHRD4WQ 535 0 284 ms ago 284 ms ago Observer 7 KCriHQZm3TlxvEVNgyWKJw 535 0 284 ms ago 284 ms ago Observer 9 v5nnIwK8r0XqjyqlIPW-aw 535 0 284 ms ago 284 ms ago Observer {code} Once the new controller is in sync with the leader, we try to do a scale up. {code} $ bin/kafka-metadata-quorum.sh \ --bootstrap-controller localhost:8000 \ --command-config /opt/kafka/server2/config/server.properties" \ add-controller org.apache.kafka.common.errors.DuplicateVoterException: The voter id for ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already part of the set of voters [ReplicaKey(id=0, directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])]. java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.DuplicateVoterException: The voter id for ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already part of the set of voters [ReplicaKey(id=0, directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])]. at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396) at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073) at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165) at org.apache.kafka.tools.MetadataQuorumCommand.handleAddController(MetadataQuorumCommand.java:431) at org.apache.kafka.tools.MetadataQuorumCommand.execute(MetadataQuorumCommand.java:147) at org.apache.kafka.tools.MetadataQuorumCommand.mainNoExit(MetadataQuorumCommand.java:81) at org.apache.kafka.tools.MetadataQuorumCommand.main(MetadataQuorumCommand.java:76) Caused by: org.apache.kafka.common.errors.DuplicateVoterException: The voter id for ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already part of the set of voters [ReplicaKey(id=0, directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])]. {code} was: In KRaft, complex quorum changes are implemented as a series of single-controller changes. In this case, it is preferable to add controllers before removing controllers. For example, to replace a controller in a three-controller cluster, adding one controller and then removing the other allows the system to handle one controller failure at all times throughout the whole process. This is currently not possible, as it leads to DuplicateVoterException. Example: The operator can replace a failed disk with a new one. The replaced disk needs to be formatted with a new directory ID. {code} $ CLUSTER_ID="$(bin/kafka-cluster.sh cluster-id --bootstrap-server localhost:9092 | awk -F': ' '{print $2}')" $ bin/kafka-storage.sh format \ --config /opt/kafka/server2/config/server.properties \ --cluster-id "$CLUSTER_ID" \ --no-initial-controllers \ --ignore-formatted Formatting metadata directory /opt/kafka/server2/metadata with metadata.version 3.9-IV0. {code} After restarting the controller, the quorum will have two nodes with ID two: the original incarnation with a failed disk and an ever growing lag and follower status, plus a new one with a different directory ID and observer status. {code} $ bin/kafka-metadata-quorum.sh --bootstrap-controller localhost:8000 describe --re --hu NodeId DirectoryId LogEndOffset Lag LastFetchTimestamp LastCaughtUpTimestamp Status 0 pbvuBlaTTwKRxS5NLJwRFQ 535 0 6 ms ago 6 ms ago Leader 1 QjRpFtVDTtCa8OLXiSbmmA 535 0 283 ms ago 283 ms ago Follower 2 slcsM5ZAR0SMIF_u__MAeg 407 128 63307 ms ago 63802 ms ago Follower 2 wrqMDI1WDsqaooVSOtlgYw 535 0 281 ms ago 281 ms ago Observer 8 aXLz3ixjqzXhCYqKHRD4WQ 535 0 284 ms ago 284 ms ago Observer 7 KCriHQZm3TlxvEVNgyWKJw 535 0 284 ms ago 284 ms ago Observer 9 v5nnIwK8r0XqjyqlIPW-aw 535 0 284 ms ago 284 ms ago Observer {code} Once the new controller is in sync with the leader, we try to do a scale up. {code} $ bin/kafka-metadata-quorum.sh \ --bootstrap-controller localhost:8000 \ --command-config /opt/kafka/server2/config/server.properties" \ add-controller org.apache.kafka.common.errors.DuplicateVoterException: The voter id for ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already part of the set of voters [ReplicaKey(id=0, directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])]. java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.DuplicateVoterException: The voter id for ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already part of the set of voters [ReplicaKey(id=0, directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])]. at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396) at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073) at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165) at org.apache.kafka.tools.MetadataQuorumCommand.handleAddController(MetadataQuorumCommand.java:431) at org.apache.kafka.tools.MetadataQuorumCommand.execute(MetadataQuorumCommand.java:147) at org.apache.kafka.tools.MetadataQuorumCommand.mainNoExit(MetadataQuorumCommand.java:81) at org.apache.kafka.tools.MetadataQuorumCommand.main(MetadataQuorumCommand.java:76) Caused by: org.apache.kafka.common.errors.DuplicateVoterException: The voter id for ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already part of the set of voters [ReplicaKey(id=0, directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])]. {code} > Controller replacement does not support scaling up before scaling down > ---------------------------------------------------------------------- > > Key: KAFKA-17966 > URL: https://issues.apache.org/jira/browse/KAFKA-17966 > Project: Kafka > Issue Type: Bug > Components: kraft > Affects Versions: 3.9.0 > Reporter: Federico Valeri > Priority: Major > > In KRaft, complex quorum changes are implemented as a series of > single-controller changes. In this case, it is preferable to add controllers > before removing controllers. For example, to replace a controller in a > three-controller cluster, adding one controller and then removing the other > allows the system to handle one controller failure at all times throughout > the whole process. This is currently not possible, as it leads to > DuplicateVoterException, so you are forced to do a scale down, followed by a > scale up. > Example: > The operator can replace a failed disk with a new one. The replaced disk > needs to be formatted with a new directory ID. > {code} > $ CLUSTER_ID="$(bin/kafka-cluster.sh cluster-id --bootstrap-server > localhost:9092 | awk -F': ' '{print $2}')" > $ bin/kafka-storage.sh format \ > --config /opt/kafka/server2/config/server.properties \ > --cluster-id "$CLUSTER_ID" \ > --no-initial-controllers \ > --ignore-formatted > Formatting metadata directory /opt/kafka/server2/metadata with > metadata.version 3.9-IV0. > {code} > After restarting the controller, the quorum will have two nodes with ID two: > the original incarnation with a failed disk and an ever growing lag and > follower status, plus a new one with a different directory ID and observer > status. > {code} > $ bin/kafka-metadata-quorum.sh --bootstrap-controller localhost:8000 describe > --re --hu > NodeId DirectoryId LogEndOffset Lag > LastFetchTimestamp LastCaughtUpTimestamp Status > 0 pbvuBlaTTwKRxS5NLJwRFQ 535 0 6 ms ago > 6 ms ago Leader > 1 QjRpFtVDTtCa8OLXiSbmmA 535 0 283 ms ago > 283 ms ago Follower > 2 slcsM5ZAR0SMIF_u__MAeg 407 128 63307 ms ago > 63802 ms ago Follower > 2 wrqMDI1WDsqaooVSOtlgYw 535 0 281 ms ago > 281 ms ago Observer > 8 aXLz3ixjqzXhCYqKHRD4WQ 535 0 284 ms ago > 284 ms ago Observer > 7 KCriHQZm3TlxvEVNgyWKJw 535 0 284 ms ago > 284 ms ago Observer > 9 v5nnIwK8r0XqjyqlIPW-aw 535 0 284 ms ago > 284 ms ago Observer > {code} > Once the new controller is in sync with the leader, we try to do a scale up. > {code} > $ bin/kafka-metadata-quorum.sh \ > --bootstrap-controller localhost:8000 \ > --command-config /opt/kafka/server2/config/server.properties" \ > add-controller > org.apache.kafka.common.errors.DuplicateVoterException: The voter id for > ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already > part of the set of voters [ReplicaKey(id=0, > directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, > directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, > directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])]. > java.util.concurrent.ExecutionException: > org.apache.kafka.common.errors.DuplicateVoterException: The voter id for > ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already > part of the set of voters [ReplicaKey(id=0, > directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, > directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, > directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])]. > at > java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396) > at > java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073) > at > org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165) > at > org.apache.kafka.tools.MetadataQuorumCommand.handleAddController(MetadataQuorumCommand.java:431) > at > org.apache.kafka.tools.MetadataQuorumCommand.execute(MetadataQuorumCommand.java:147) > at > org.apache.kafka.tools.MetadataQuorumCommand.mainNoExit(MetadataQuorumCommand.java:81) > at > org.apache.kafka.tools.MetadataQuorumCommand.main(MetadataQuorumCommand.java:76) > Caused by: org.apache.kafka.common.errors.DuplicateVoterException: The voter > id for ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is > already part of the set of voters [ReplicaKey(id=0, > directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, > directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, > directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])]. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)