[
https://issues.apache.org/jira/browse/KAFKA-19867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18035761#comment-18035761
]
Luke Chen commented on KAFKA-19867:
-----------------------------------
Adding this canBecomeVoter check will not impact/break anything because you
should be a voter to enter the `pollFollowerAsVoter` method. If this is because
of the not cloud native friendly design for the auto.join voter feature, adding
this additional check to workaround the combined-node issue might be an option
IMO. Meanwhile, we should still find a better solution for the root cause of
KAFKA-19850.
> Broker only node sending UpdateVoteRequest when it can't really become a voter
> ------------------------------------------------------------------------------
>
> Key: KAFKA-19867
> URL: https://issues.apache.org/jira/browse/KAFKA-19867
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 4.2.0
> Reporter: Paolo Patierno
> Priority: Major
>
> I am in the process of implementing support for controllers scaling within
> the Strimzi project (running Apache Kafka on Kubernetes) by also using the
> Apache Kafka code in the current "trunk" branch so the future 4.2.0 release
> because I want to leverage the auto-join feature.
> When scaling down controllers, the auto-join related documentation mentions
> that you should first shutdown the controller and later running the
> remove-controller (via the kafka-metadata-quorum tool, or programmatically in
> a Kubernetes operator case by using the RemoveRaftVoter via the Admin Client
> API), otherwise it's pretty clear the node enters in a loop where you remove
> it but it rejoins automatically again.
> When managing a Kafka cluster running on bare metal/VMs, this approach works
> fine even in case the controller scale-down is happening by removing the
> controller role from a mixed node (shutdown the node, run
> kafka-metadata-quorum tool to remove-controller, restart the node as broker
> only). But in a cloud-native environment like Kubernetes, the pod rolling is
> driven by the platform so there is no way to run a RemoveRaftVoter admin call
> in between the shutdown and restart. For this reason, the remove-controller
> is done when the node restarts as broker only.
> The issue I am facing is that when such a node restarts as broker only, but
> it's still in the quorum voter (because the remove-controller isn't happened
> yet), I get the following exception:
>
> {code:java}
> 2025-10-31 08:01:21 TRACE [kafka-1-raft-io-thread] KafkaRaftClient:2899 -
> [RaftManager id=1] Sent outbound request: OutboundRequest(correlationId=13,
> data=UpdateRaftVoterRequestData(clusterId='zsn8QaOzTICYZBhUYQpJBg',
> currentLeaderEpoch=2, voterId=1, voterDirectoryId=ceZ1jCL9DirrUuCxwsv-jw,
> listeners=[], kRaftVersionFeature=KRaftVersionFeature(minSupportedVersion=0,
> maxSupportedVersion=1)), createdTimeMs=1761897681990,
> destination=my-cluster-broker-0.my-cluster-kafka-brokers.myproject.svc:9090
> (id: 0 rack: null isFenced: false))2025-10-31 08:01:21 TRACE
> [kafka-1-raft-io-thread] KafkaRaftClient:2830 - [RaftManager id=1] Received
> inbound message InboundResponse(correlationId=13,
> data=UpdateRaftVoterResponseData(throttleTimeMs=0, errorCode=42,
> currentLeader=CurrentLeader(leaderId=0, leaderEpoch=2,
> host='my-cluster-broker-0.my-cluster-kafka-brokers.myproject.svc',
> port=9090)),
> source=my-cluster-broker-0.my-cluster-kafka-brokers.myproject.svc:9090 (id: 0
> rack: null isFenced: false))2025-10-31 08:01:21 ERROR
> [kafka-1-raft-io-thread] ProcessTerminatingFaultHandler:46 - Encountered
> fatal fault: Unexpected error in raft IO
> threadjava.lang.IllegalStateException: Received unexpected invalid request
> error at
> org.apache.kafka.raft.KafkaRaftClient.maybeHandleCommonResponse(KafkaRaftClient.java:2679)
> at
> org.apache.kafka.raft.KafkaRaftClient.handleUpdateVoterResponse(KafkaRaftClient.java:2569)
> at
> org.apache.kafka.raft.KafkaRaftClient.handleResponse(KafkaRaftClient.java:2737)
> at
> org.apache.kafka.raft.KafkaRaftClient.handleInboundMessage(KafkaRaftClient.java:2836)
> at
> org.apache.kafka.raft.KafkaRaftClient.poll(KafkaRaftClient.java:3680)
> at
> org.apache.kafka.raft.KafkaRaftClientDriver.doWork(KafkaRaftClientDriver.java:64)
> at
> org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:136)
> {code}
> It's happening because the node (which is now broker only) is sending a
> UpdateRaftVoter request (because it sees itself still in the voters list)
> even if it's not actually a controller and, of course, it's not able to
> handle the response which is unexpected because it's a broker only node.
> I think, despite the remove-controller was not done yet, the broker-only node
> should not send such a request even because in any case it's not able to
> handle the response so it ends in a "broken" code path.
> The code where it's happening is within the
> {{KafkaRaftClient.shouldSendUpdateVoteRequest}} where it's not checking the
> {{canBecomeVoter}} flag before sending the request (here
> https://github.com/apache/kafka/blob/trunk/raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java#L3299).
> Such a check is available in the {{shouldSendAddOrRemoveVoterRequest}}
> method instead (here
> https://github.com/apache/kafka/blob/trunk/raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java#L3355).
> I think that adding the check would fix the issue because actually the node
> is not a controller anymore and it can't really become a voter and that flag
> is, of course, false avoiding the node to send the UpdateRaftVoter request.
> If accepted, I would be willing to open a PR to fix this.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)