[jira] [Updated] (KAFKA-17966) Controller replacement does not support scaling up before scaling down

Federico Valeri (Jira) Fri, 08 Nov 2024 03:31:23 -0800


     [ 
https://issues.apache.org/jira/browse/KAFKA-17966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Federico Valeri updated KAFKA-17966:
------------------------------------
    Description: 
In KRaft, complex quorum changes are implemented as a series of 
single-controller changes. In this case, it is preferable to add controllers 
before removing controllers. For example, to replace a controller in a 
three-controller cluster, adding one controller and then removing the other 
allows the system to handle one controller failure at all times throughout the 
whole process. This is currently not possible, as it leads to 
DuplicateVoterException, so you are forced to do a scale down, followed by a 
scale up.

Example:

The operator can replace a failed disk with a new one. The replaced disk needs 
to be formatted with a new directory ID.

{code}
$ CLUSTER_ID="$(bin/kafka-cluster.sh cluster-id --bootstrap-server 
localhost:9092 | awk -F': ' '{print $2}')"

$ bin/kafka-storage.sh format \
  --config /opt/kafka/server2/config/server.properties \
  --cluster-id "$CLUSTER_ID" \
  --no-initial-controllers \
  --ignore-formatted
Formatting metadata directory /opt/kafka/server2/metadata with metadata.version 
3.9-IV0.
{code}

After restarting the controller, the quorum will have two nodes with ID two: 
the original incarnation with a failed disk and an ever growing lag and 
follower status, plus a new one with a different directory ID and observer 
status.

{code}
$ bin/kafka-metadata-quorum.sh --bootstrap-controller localhost:8000 describe 
--re --hu
NodeId  DirectoryId             LogEndOffset    Lag     LastFetchTimestamp      
LastCaughtUpTimestamp   Status   
0       pbvuBlaTTwKRxS5NLJwRFQ  535             0       6 ms ago                
6 ms ago                Leader   
1       QjRpFtVDTtCa8OLXiSbmmA  535             0       283 ms ago              
283 ms ago              Follower    
2       slcsM5ZAR0SMIF_u__MAeg  407             128     63307 ms ago            
63802 ms ago            Follower    
2       wrqMDI1WDsqaooVSOtlgYw  535             0       281 ms ago              
281 ms ago              Observer    
8       aXLz3ixjqzXhCYqKHRD4WQ  535             0       284 ms ago              
284 ms ago              Observer    
7       KCriHQZm3TlxvEVNgyWKJw  535             0       284 ms ago              
284 ms ago              Observer    
9       v5nnIwK8r0XqjyqlIPW-aw  535             0       284 ms ago              
284 ms ago              Observer
{code}

Once the new controller is in sync with the leader, we try to do a scale up.

{code}
$ bin/kafka-metadata-quorum.sh \
  --bootstrap-controller localhost:8000 \
  --command-config /opt/kafka/server2/config/server.properties" \
  add-controller
org.apache.kafka.common.errors.DuplicateVoterException: The voter id for 
ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already part 
of the set of voters [ReplicaKey(id=0, 
directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, 
directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, 
directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])].
java.util.concurrent.ExecutionException: 
org.apache.kafka.common.errors.DuplicateVoterException: The voter id for 
ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already part 
of the set of voters [ReplicaKey(id=0, 
directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, 
directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, 
directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])].
        at 
java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
        at 
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
        at 
org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165)
        at 
org.apache.kafka.tools.MetadataQuorumCommand.handleAddController(MetadataQuorumCommand.java:431)
        at 
org.apache.kafka.tools.MetadataQuorumCommand.execute(MetadataQuorumCommand.java:147)
        at 
org.apache.kafka.tools.MetadataQuorumCommand.mainNoExit(MetadataQuorumCommand.java:81)
        at 
org.apache.kafka.tools.MetadataQuorumCommand.main(MetadataQuorumCommand.java:76)
Caused by: org.apache.kafka.common.errors.DuplicateVoterException: The voter id 
for ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already 
part of the set of voters [ReplicaKey(id=0, 
directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, 
directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, 
directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])].
{code}



  was:
In KRaft, complex quorum changes are implemented as a series of 
single-controller changes. In this case, it is preferable to add controllers 
before removing controllers. For example, to replace a controller in a 
three-controller cluster, adding one controller and then removing the other 
allows the system to handle one controller failure at all times throughout the 
whole process. This is currently not possible, as it leads to 
DuplicateVoterException.

Example:

The operator can replace a failed disk with a new one. The replaced disk needs 
to be formatted with a new directory ID.

{code}
$ CLUSTER_ID="$(bin/kafka-cluster.sh cluster-id --bootstrap-server 
localhost:9092 | awk -F': ' '{print $2}')"

$ bin/kafka-storage.sh format \
  --config /opt/kafka/server2/config/server.properties \
  --cluster-id "$CLUSTER_ID" \
  --no-initial-controllers \
  --ignore-formatted
Formatting metadata directory /opt/kafka/server2/metadata with metadata.version 
3.9-IV0.
{code}

After restarting the controller, the quorum will have two nodes with ID two: 
the original incarnation with a failed disk and an ever growing lag and 
follower status, plus a new one with a different directory ID and observer 
status.

{code}
$ bin/kafka-metadata-quorum.sh --bootstrap-controller localhost:8000 describe 
--re --hu
NodeId  DirectoryId             LogEndOffset    Lag     LastFetchTimestamp      
LastCaughtUpTimestamp   Status   
0       pbvuBlaTTwKRxS5NLJwRFQ  535             0       6 ms ago                
6 ms ago                Leader   
1       QjRpFtVDTtCa8OLXiSbmmA  535             0       283 ms ago              
283 ms ago              Follower    
2       slcsM5ZAR0SMIF_u__MAeg  407             128     63307 ms ago            
63802 ms ago            Follower    
2       wrqMDI1WDsqaooVSOtlgYw  535             0       281 ms ago              
281 ms ago              Observer    
8       aXLz3ixjqzXhCYqKHRD4WQ  535             0       284 ms ago              
284 ms ago              Observer    
7       KCriHQZm3TlxvEVNgyWKJw  535             0       284 ms ago              
284 ms ago              Observer    
9       v5nnIwK8r0XqjyqlIPW-aw  535             0       284 ms ago              
284 ms ago              Observer
{code}

Once the new controller is in sync with the leader, we try to do a scale up.

{code}
$ bin/kafka-metadata-quorum.sh \
  --bootstrap-controller localhost:8000 \
  --command-config /opt/kafka/server2/config/server.properties" \
  add-controller
org.apache.kafka.common.errors.DuplicateVoterException: The voter id for 
ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already part 
of the set of voters [ReplicaKey(id=0, 
directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, 
directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, 
directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])].
java.util.concurrent.ExecutionException: 
org.apache.kafka.common.errors.DuplicateVoterException: The voter id for 
ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already part 
of the set of voters [ReplicaKey(id=0, 
directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, 
directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, 
directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])].
        at 
java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
        at 
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
        at 
org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165)
        at 
org.apache.kafka.tools.MetadataQuorumCommand.handleAddController(MetadataQuorumCommand.java:431)
        at 
org.apache.kafka.tools.MetadataQuorumCommand.execute(MetadataQuorumCommand.java:147)
        at 
org.apache.kafka.tools.MetadataQuorumCommand.mainNoExit(MetadataQuorumCommand.java:81)
        at 
org.apache.kafka.tools.MetadataQuorumCommand.main(MetadataQuorumCommand.java:76)
Caused by: org.apache.kafka.common.errors.DuplicateVoterException: The voter id 
for ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already 
part of the set of voters [ReplicaKey(id=0, 
directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, 
directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, 
directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])].
{code}



> Controller replacement does not support scaling up before scaling down
> ----------------------------------------------------------------------
>
>                 Key: KAFKA-17966
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17966
>             Project: Kafka
>          Issue Type: Bug
>          Components: kraft
>    Affects Versions: 3.9.0
>            Reporter: Federico Valeri
>            Priority: Major
>
> In KRaft, complex quorum changes are implemented as a series of 
> single-controller changes. In this case, it is preferable to add controllers 
> before removing controllers. For example, to replace a controller in a 
> three-controller cluster, adding one controller and then removing the other 
> allows the system to handle one controller failure at all times throughout 
> the whole process. This is currently not possible, as it leads to 
> DuplicateVoterException, so you are forced to do a scale down, followed by a 
> scale up.
> Example:
> The operator can replace a failed disk with a new one. The replaced disk 
> needs to be formatted with a new directory ID.
> {code}
> $ CLUSTER_ID="$(bin/kafka-cluster.sh cluster-id --bootstrap-server 
> localhost:9092 | awk -F': ' '{print $2}')"
> $ bin/kafka-storage.sh format \
>   --config /opt/kafka/server2/config/server.properties \
>   --cluster-id "$CLUSTER_ID" \
>   --no-initial-controllers \
>   --ignore-formatted
> Formatting metadata directory /opt/kafka/server2/metadata with 
> metadata.version 3.9-IV0.
> {code}
> After restarting the controller, the quorum will have two nodes with ID two: 
> the original incarnation with a failed disk and an ever growing lag and 
> follower status, plus a new one with a different directory ID and observer 
> status.
> {code}
> $ bin/kafka-metadata-quorum.sh --bootstrap-controller localhost:8000 describe 
> --re --hu
> NodeId        DirectoryId             LogEndOffset    Lag     
> LastFetchTimestamp      LastCaughtUpTimestamp   Status   
> 0             pbvuBlaTTwKRxS5NLJwRFQ  535             0       6 ms ago        
>         6 ms ago                Leader   
> 1             QjRpFtVDTtCa8OLXiSbmmA  535             0       283 ms ago      
>         283 ms ago              Follower    
> 2             slcsM5ZAR0SMIF_u__MAeg  407             128     63307 ms ago    
>         63802 ms ago            Follower    
> 2             wrqMDI1WDsqaooVSOtlgYw  535             0       281 ms ago      
>         281 ms ago              Observer    
> 8             aXLz3ixjqzXhCYqKHRD4WQ  535             0       284 ms ago      
>         284 ms ago              Observer    
> 7             KCriHQZm3TlxvEVNgyWKJw  535             0       284 ms ago      
>         284 ms ago              Observer    
> 9             v5nnIwK8r0XqjyqlIPW-aw  535             0       284 ms ago      
>         284 ms ago              Observer
> {code}
> Once the new controller is in sync with the leader, we try to do a scale up.
> {code}
> $ bin/kafka-metadata-quorum.sh \
>   --bootstrap-controller localhost:8000 \
>   --command-config /opt/kafka/server2/config/server.properties" \
>   add-controller
> org.apache.kafka.common.errors.DuplicateVoterException: The voter id for 
> ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already 
> part of the set of voters [ReplicaKey(id=0, 
> directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, 
> directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, 
> directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])].
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.DuplicateVoterException: The voter id for 
> ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is already 
> part of the set of voters [ReplicaKey(id=0, 
> directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, 
> directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, 
> directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])].
>       at 
> java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
>       at 
> java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
>       at 
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165)
>       at 
> org.apache.kafka.tools.MetadataQuorumCommand.handleAddController(MetadataQuorumCommand.java:431)
>       at 
> org.apache.kafka.tools.MetadataQuorumCommand.execute(MetadataQuorumCommand.java:147)
>       at 
> org.apache.kafka.tools.MetadataQuorumCommand.mainNoExit(MetadataQuorumCommand.java:81)
>       at 
> org.apache.kafka.tools.MetadataQuorumCommand.main(MetadataQuorumCommand.java:76)
> Caused by: org.apache.kafka.common.errors.DuplicateVoterException: The voter 
> id for ReplicaKey(id=2, directoryId=Optional[u7e_mCmg0VAIz0zuAOcraA]) is 
> already part of the set of voters [ReplicaKey(id=0, 
> directoryId=Optional[PbEthh6mR8iVNizvUTUVFw]), ReplicaKey(id=1, 
> directoryId=Optional[kIpbbU79QaCIIiOLOyCjJg]), ReplicaKey(id=2, 
> directoryId=Optional[2ab0gajpS5aUf5d-2Jw02w])].
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-17966) Controller replacement does not support scaling up before scaling down

Reply via email to