Hello, In our Kafka 4.0 cluster (dynamic quorum, 5 controller nodes), we have observed that KRaft observers (process.roles=broker) that typically send FETCH requests to the quorum Leader node can enter a state of indefinitely re-bootstraping to a voter (follower) node, likely after some sort of request failure or timeout. Subsequently, the observer node’s high water mark/metadata offset would not update, causing issues such as out of sync replicas during partition reassignments. We also observe a high rate of NOT_LEADER_OR_FOLLOWER errors (kafka.network:type=RequestMetrics,name=ErrorsPerSec,request=FETCH,error=NOT_LEADER_OR_FOLLOWER ) by the voter node that is receiving the FETCH requests.
The observer would only be able to recover after restarting the voter it (re-)bootstrapped to, which causes another re-bootstrap to a random voter node. If by chance the observer connects to the correct leader node, the metadata replication would recover and errors would stop. With DEBUG logs enabled on the KRaft controllers, we repeatedly see the following log on the voter node that is incorrectly receiving the for FETCH requests: ``` Completed request:{"isForwarded":false,"requestHeader":{"requestApiKey":1,"requestApiVersion":17,"correlationId":11131463,"clientId":"raft-client-81","requestApiKeyName":"FETCH"},"request":{"clusterId":"vD8YJMbtQMyOnzTfZ5RK4g","replicaState":{"replicaId":81,"replicaEpoch":-1},"maxWaitMs":500,"minBytes":0,"maxBytes":8388608,"isolationLevel":0,"sessionId":0,"sessionEpoch":-1,"topics":[{"topicId":"AAAAAAAAAAAAAAAAAAAAAQ","partitions":[{"partition":0,"currentLeaderEpoch":69,"fetchOffset":4793550,"lastFetchedEpoch":69,"logStartOffset":-1,"partitionMaxBytes":0,"replicaDirectoryId":"sJjyY5zzN1XLxEmDoWwVig"}]}],"forgottenTopicsData":[],"rackId":""},"response":{"throttleTimeMs":0,"errorCode":0,"sessionId":0,"responses":[{"topicId":"AAAAAAAAAAAAAAAAAAAAAQ","partitions":[{"partitionIndex":0,"errorCode":6,"highWatermark":-1,"lastStableOffset":-1,"logStartOffset":4782613,"currentLeader":{"leaderId":7002,"leaderEpoch":69},"abortedTransactions":[],"preferredReadReplica":-1,"recordsSizeInBytes":0}]}],"nodeEndpoints":[{"nodeId":7002,"host":"kraftcontroller-7002.kafka.<redacted>.com.","port":9095,"rack":null}]},"connection":"<redacted_serverIp>:<serverPort>-<redacted_clientIp>:<clientPort>","totalTimeMs":0.458,"requestQueueTimeMs":0.113,"localTimeMs":0.086,"remoteTimeMs":0.203,"throttleTimeMs":0,"responseQueueTimeMs":0.021,"sendTimeMs":0.032,"securityProtocol":"PLAINTEXT","principal":"User:ANONYMOUS","listener":"KRAFT_CONTROLLER","clientInformation":{"softwareName":"apache-kafka-java","softwareVersion":"4.0.0"}} ``` Note that the top level error code is 0 (success), however the `response.partitions[0].errorCode` is 6 (NOT_LEADER_OR_FOLLOWER). Tracing through the FETCH logic of the KafkaRaftClient, it seems the response is handled “successfully” by the `maybeHandleCommonResponse` ( https://github.com/apache/kafka/blob/4.0.0/raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java#L1707-L1717) method, yet the correct leader in the response (node 7002) is not used for subsequent requests. The Raft client would continue sending to the incorrect voter node and never re-bootstrap/backoff ( https://github.com/apache/kafka/blob/4.0.0/raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java#L3224-L3225) until a voter node restart. We configure each node’s `controller.quorum.bootstrap.servers` to a single host that load balances to 1 of 5 KRaft controllers, but I do not believe that explicitly listing all 5 host:port strings would prevent this issue. I wanted to confirm if this is indeed a bug within KRaft, or a potential misconfiguration on our end. Thank you! Justin C