[jira] [Commented] (KAFKA-17966) Controller replacement does not support scaling up before scaling down

Jira Fri, 15 Nov 2024 09:41:29 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-17966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17898694#comment-17898694
 ]


José Armando García Sancio commented on KAFKA-17966:
----------------------------------------------------

{quote}For example, to replace a controller in a three-controller cluster, 
adding one controller and then removing the other allows the system to handle 
one controller failure at all times throughout the whole process.
{quote}
Hi [~fvaleri] , thanks for the feature request.

The premise for your feature request is not entirely correct. The example you 
used has 3 voters and already had a disk failure so the KRaft cluster already 
experience the failure which it can tolerate. Any future failure of another 
voter would cause unavailability (and maybe even data loss). If you want to 
configure a cluster that can tolerate more than one failure, you need at least 
5 voters no matter what.

In the example provided, which can only tolerate one failure, you need to first 
remove the replica with the failed disk and then add a new replica with a new 
disk. That's if you want to reuse the replica ids. You can also, create a new 
node with a different replica id.

There are a few reasons for this restriction. One is to allow us to easily and 
safely implement [automatic controller 
addition|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217391519#KIP853:KRaftControllerMembershipChanges-Automaticjoiningcontrollers].
 Another, is that Kafka network layer and protocol resolves node endpoints 
using the node id.

> Controller replacement does not support scaling up before scaling down
> ----------------------------------------------------------------------
>
>                 Key: KAFKA-17966
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17966
>             Project: Kafka
>          Issue Type: New Feature
>          Components: kraft
>    Affects Versions: 3.9.0
>            Reporter: Federico Valeri
>            Priority: Major
>
> In KRaft, complex quorum changes are implemented as a series of 
> single-controller changes. In this case, it is preferable to add controllers 
> before removing controllers. For example, to replace a controller in a 
> three-controller cluster, adding one controller and then removing the other 
> allows the system to handle one controller failure at all times throughout 
> the whole process. This is currently not possible, as it leads to 
> DuplicateVoterException, so you are forced to do a scale down, followed by a 
> scale up.
> Example:
> The operator can replace a failed disk with a new one. The replaced disk 
> needs to be formatted with a new directory ID.
> {code}
> $ CLUSTER_ID="$(bin/kafka-cluster.sh cluster-id --bootstrap-server 
> localhost:9092 | awk -F': ' '{print $2}')"
> $ bin/kafka-storage.sh format \
>   --config /opt/kafka/server2/config/server.properties \
>   --cluster-id "$CLUSTER_ID" \
>   --no-initial-controllers \
>   --ignore-formatted
> Formatting metadata directory /opt/kafka/server2/metadata with 
> metadata.version 3.9-IV0.
> {code}
> After restarting the controller, the quorum will have two nodes with ID two: 
> the original incarnation with a failed disk and an ever growing lag and 
> follower status, plus a new one with a different directory ID and observer 
> status.
> {code}
> $ bin/kafka-metadata-quorum.sh --bootstrap-controller localhost:8000 describe 
> --re --hu
> NodeId        DirectoryId             LogEndOffset    Lag     
> LastFetchTimestamp      LastCaughtUpTimestamp   Status   
> 0             pbvuBlaTTwKRxS5NLJwRFQ  535             0       6 ms ago        
>         6 ms ago                Leader   
> 1             QjRpFtVDTtCa8OLXiSbmmA  535             0       283 ms ago      
>         283 ms ago              Follower    
> 2             slcsM5ZAR0SMIF_u__MAeg  407             128     63307 ms ago    
>         63802 ms ago            Follower    
> 2             wrqMDI1WDsqaooVSOtlgYw  535             0       281 ms ago      
>         281 ms ago              Observer    
> 8             aXLz3ixjqzXhCYqKHRD4WQ  535             0       284 ms ago      
>         284 ms ago              Observer    
> 7             KCriHQZm3TlxvEVNgyWKJw  535             0       284 ms ago      
>         284 ms ago              Observer    
> 9             v5nnIwK8r0XqjyqlIPW-aw  535             0       284 ms ago      
>         284 ms ago              Observer
> {code}
> Once the new controller is in sync with the leader, we try to do a scale up.
> {code}
> $ bin/kafka-metadata-quorum.sh \
>   --bootstrap-controller localhost:8000 \
>   --command-config /opt/kafka/server2/config/server.properties \
>   add-controller
> org.apache.kafka.common.errors.DuplicateVoterException: The voter id for 
> ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already 
> part of the set of voters [ReplicaKey(id=0, 
> directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, 
> directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, 
> directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])].
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.DuplicateVoterException: The voter id for 
> ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already 
> part of the set of voters [ReplicaKey(id=0, 
> directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, 
> directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, 
> directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])].
>       at 
> java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
>       at 
> java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
>       at 
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165)
>       at 
> org.apache.kafka.tools.MetadataQuorumCommand.handleAddController(MetadataQuorumCommand.java:431)
>       at 
> org.apache.kafka.tools.MetadataQuorumCommand.execute(MetadataQuorumCommand.java:147)
>       at 
> org.apache.kafka.tools.MetadataQuorumCommand.mainNoExit(MetadataQuorumCommand.java:81)
>       at 
> org.apache.kafka.tools.MetadataQuorumCommand.main(MetadataQuorumCommand.java:76)
> Caused by: org.apache.kafka.common.errors.DuplicateVoterException: The voter 
> id for ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is 
> already part of the set of voters [ReplicaKey(id=0, 
> directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, 
> directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, 
> directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])].
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-17966) Controller replacement does not support scaling up before scaling down

Reply via email to