[ 
https://issues.apache.org/jira/browse/KAFKA-19867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18035826#comment-18035826
 ] 

Paolo Patierno edited comment on KAFKA-19867 at 11/6/25 1:09 PM:
-----------------------------------------------------------------

[~mimaison] [~showuon] 

After our offline discussion I was trying to apply the check on the 
{{canBecomeVoter}} flag within the {{KafkaRaftClient.pollFollower()}} method to 
avoid calling the {{pollFollowerAsVoter}} at all, so something like this:


{code:java}
private long pollFollower(long currentTimeMs) {
    FollowerState state = quorum.followerStateOrThrow();
    if (quorum.isVoter() && canBecomeVoter) {
        return pollFollowerAsVoter(state, currentTimeMs);
    } else {
        return pollFollowerAsObserver(state, currentTimeMs);
    }
}
{code}

In this case, this change breaks a bunch of tests where a voter is expecting to 
move in the {{Prospective}} state in order to take part at a new leader 
election.
The tests are the following:


{code:java}
testHandleEndQuorumRequest
testVoterBecomeProspectiveAfterFetchTimeout
testHandleVoteRequestAsProspective
testFetchResponseIgnoredAfterBecomingProspective
testHandleEndQuorumRequestWithLowerPriorityToBecomeLeader
{code}


Taking the first one as an example. It starts with the local node being part of 
voters and another node being the leader. The leader sends an 
{{EndQuorumRequest}} to step back being the leader and a new election should 
start so the local node is expected to move into {{Prospective}} state. But it 
happens because the voter sends a fetch request, doesn't get an answer (from 
leader) and based on the timeout it moves to {{Prospective}}. This happens 
within the {{pollFollowerAsVoter}} that is not called at all given that 
{{canBecomeVoter}} is false (by default in the tests).
Now, one aspect of this failure is that with the change of checking the 
{{canBecomeVoter}} flag, we should update the above tests by building the 
{{RaftClientTestContext}} instance with a {{withCanBecomeVoter(true)}} because 
the context has that flag set to false by default. It would allow recreating 
the expected conditions (the node can become a voter, because it's a 
controller, and makes sense to take part at the leader election) and the tests 
pass.
Despite that, I had a doubt.

In the main use case I described in this JIRA, the node was a combined one, and 
it's now restarting as broker only, so the {{canBecomeVoter}} is false (that 
flag indicates the presence of "controller" role in the node, when the 
{{KafkaRaftClient}} is created). It's still part of the voters list (because 
the remove-controller was not called yet) but it's not a controller anymore so 
it can't take part to a leader election.
With the above change, instead of polling as voter, it's going to poll as an 
observer (the {{pollFollowerAsObserver}} method is called) and I was wondering 
if it's really correct taking into account that we are in a sort of "hybrid" 
state. The overall cluster sees the node being in the list of the voters but 
it's not a controller anymore and can't vote. Polling as an observer means that 
it's going to call {{shouldSendAddOrRemoveVoterRequest}} which anyway avoid the 
node to make this request because the {{canBecomeVoter}} is false (and it's 
checked there). 

My question is ... is it really an observer? If in this timeframe you request 
the quorum status, you will see it as voter and not in the observers list until 
making the remove-controller. Is it acceptable?

Anyway, I tested that change by building a custom Apache Kafka 4.2 container 
image for Strimzi and it works fine.

In conclusion I think the change makes sense, together with adapting the tests 
(to recreate the expected conditions) but I wanted insights from you experts 
first.


was (Author: ppatierno):
[~mimaison] [~showuon] 

After our offline discussion I was trying to apply the check on the 
{{canBecomeVoter}} flag within the {{KafkaRaftClient.pollFollower()}} method to 
avoid calling the {{pollFollowerAsVoter}} at all, so something like this:


{code:java}
private long pollFollower(long currentTimeMs) {
    FollowerState state = quorum.followerStateOrThrow();
    if (quorum.isVoter() && canBecomeVoter) {
        return pollFollowerAsVoter(state, currentTimeMs);
    } else {
        return pollFollowerAsObserver(state, currentTimeMs);
    }
}
{code}

In this case, this change breaks a bunch of tests where a voter is expecting to 
move in the {{Prospective}} state in order to take part at a new leader 
election.
The tests are the following:


{code:java}
testHandleEndQuorumRequest
testVoterBecomeProspectiveAfterFetchTimeout
testHandleVoteRequestAsProspective
testFetchResponseIgnoredAfterBecomingProspective
testHandleEndQuorumRequestWithLowerPriorityToBecomeLeader
{code}


Taking the first one as an example. It starts with the local node being part of 
voters and another node being the leader. The leader sends an 
{{EndQuorumRequest}} to step back being the leader and a new election should 
start so the local node is expected to move into {{Prospective}} state. But it 
happens because the voter sends a fetch request, doesn't get an answer (from 
leader) and based on the timeout it moves to {{Prospective}}. This happens 
within the {{pollFollowerAsVoter}} that is not called at all given that 
{{canBecomeVoter}} is false (by default in the tests).
Now, one aspect of this failure is that with the change of checking the 
{{canBecomeVoter}} flag, we should update the above tests by building the 
{{RaftClientTestContext}} instance with a {{withCanBecomeVoter(true)}} because 
the context has that flag set to false by default. It would allow recreating 
the expected conditions (the node can become a voter, because it's a 
controller, and makes sense to take part at the leader election) and the tests 
pass.
Despite that, I had a doubt.

In the main use case I described in this JIRA, the node was a combined one, and 
it's now restarting as broker only, so the {{canBecomeVoter}} is false (that 
flag indicates the presence of "controller" role in the node, when the 
{{KafkaRaftClient}} is created). It's still part of the voters list (because 
the remove-controller was not called yet) but it's not a controller anymore so 
it can't take part to a leader election.
With the above change, instead of polling as voter, it's going to poll as an 
observer (the {{pollFollowerAsObserver}} method is called) and I was wondering 
if it's really correct taking into account that we are in a sort of "hybrid" 
state. The overall cluster sees the node being in the list of the voters but 
it's not a controller anymore and can't vote. Polling as an observer means that 
it's going to call {{shouldSendAddOrRemoveVoterRequest}} which anyway avoid the 
node to make this request because the {{canBecomeVoter}} is false (and it's 
checked there). 

My question is ... is it really an observer? If in this timeframe you request 
the quorum status, you will see it as voter and not in the observers list until 
making the remove-controller. Is it acceptable?

Anyway, I tested that change by building a custom Apache Kafka 4.2 container 
image for Strimzi and it works fine.

> Broker only node sending UpdateVoteRequest when it can't really become a voter
> ------------------------------------------------------------------------------
>
>                 Key: KAFKA-19867
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19867
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 4.2.0
>            Reporter: Paolo Patierno
>            Priority: Major
>
> I am in the process of implementing support for controllers scaling within 
> the Strimzi project (running Apache Kafka on Kubernetes) by also using the 
> Apache Kafka code in the current "trunk" branch so the future 4.2.0 release 
> because I want to leverage the auto-join feature.
> When scaling down controllers, the auto-join related documentation mentions 
> that you should first shutdown the controller and later running the 
> remove-controller (via the kafka-metadata-quorum tool, or programmatically in 
> a Kubernetes operator case by using the RemoveRaftVoter via the Admin Client 
> API), otherwise it's pretty clear the node enters in a loop where you remove 
> it but it rejoins automatically again. 
> When managing a Kafka cluster running on bare metal/VMs, this approach works 
> fine even in case the controller scale-down is happening by removing the 
> controller role from a mixed node (shutdown the node, run 
> kafka-metadata-quorum tool to remove-controller, restart the node as broker 
> only). But in a cloud-native environment like Kubernetes, the pod rolling is 
> driven by the platform so there is no way to run a RemoveRaftVoter admin call 
> in between the shutdown and restart. For this reason, the remove-controller 
> is done when the node restarts as broker only.
> The issue I am facing is that when such a node restarts as broker only, but 
> it's still in the quorum voter (because the remove-controller isn't happened 
> yet), I get the following exception:
>  
> {code:java}
> 2025-10-31 08:01:21 TRACE [kafka-1-raft-io-thread] KafkaRaftClient:2899 - 
> [RaftManager id=1] Sent outbound request: OutboundRequest(correlationId=13, 
> data=UpdateRaftVoterRequestData(clusterId='zsn8QaOzTICYZBhUYQpJBg', 
> currentLeaderEpoch=2, voterId=1, voterDirectoryId=ceZ1jCL9DirrUuCxwsv-jw, 
> listeners=[], kRaftVersionFeature=KRaftVersionFeature(minSupportedVersion=0, 
> maxSupportedVersion=1)), createdTimeMs=1761897681990, 
> destination=my-cluster-broker-0.my-cluster-kafka-brokers.myproject.svc:9090 
> (id: 0 rack: null isFenced: false))2025-10-31 08:01:21 TRACE 
> [kafka-1-raft-io-thread] KafkaRaftClient:2830 - [RaftManager id=1] Received 
> inbound message InboundResponse(correlationId=13, 
> data=UpdateRaftVoterResponseData(throttleTimeMs=0, errorCode=42, 
> currentLeader=CurrentLeader(leaderId=0, leaderEpoch=2, 
> host='my-cluster-broker-0.my-cluster-kafka-brokers.myproject.svc', 
> port=9090)), 
> source=my-cluster-broker-0.my-cluster-kafka-brokers.myproject.svc:9090 (id: 0 
> rack: null isFenced: false))2025-10-31 08:01:21 ERROR 
> [kafka-1-raft-io-thread] ProcessTerminatingFaultHandler:46 - Encountered 
> fatal fault: Unexpected error in raft IO 
> threadjava.lang.IllegalStateException: Received unexpected invalid request 
> error  at 
> org.apache.kafka.raft.KafkaRaftClient.maybeHandleCommonResponse(KafkaRaftClient.java:2679)
>    at 
> org.apache.kafka.raft.KafkaRaftClient.handleUpdateVoterResponse(KafkaRaftClient.java:2569)
>    at 
> org.apache.kafka.raft.KafkaRaftClient.handleResponse(KafkaRaftClient.java:2737)
>       at 
> org.apache.kafka.raft.KafkaRaftClient.handleInboundMessage(KafkaRaftClient.java:2836)
>         at 
> org.apache.kafka.raft.KafkaRaftClient.poll(KafkaRaftClient.java:3680)        
> at 
> org.apache.kafka.raft.KafkaRaftClientDriver.doWork(KafkaRaftClientDriver.java:64)
>     at 
> org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:136)
>  {code}
> It's happening because the node (which is now broker only) is sending a 
> UpdateRaftVoter request (because it sees itself still in the voters list) 
> even if it's not actually a controller and, of course, it's not able to 
> handle the response which is unexpected because it's a broker only node.
> I think, despite the remove-controller was not done yet, the broker-only node 
> should not send such a request even because in any case it's not able to 
> handle the response so it ends in a "broken" code path.
> The code where it's happening is within the 
> {{KafkaRaftClient.shouldSendUpdateVoteRequest}} where it's not checking the 
> {{canBecomeVoter}} flag before sending the request (here 
> https://github.com/apache/kafka/blob/trunk/raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java#L3299).
>  Such a check is available in the {{shouldSendAddOrRemoveVoterRequest}} 
> method instead (here 
> https://github.com/apache/kafka/blob/trunk/raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java#L3355).
> I think that adding the check would fix the issue because actually the node 
> is not a controller anymore and it can't really become a voter and that flag 
> is, of course, false avoiding the node to send the UpdateRaftVoter request.
> If accepted, I would be willing to open a PR to fix this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to