[
https://issues.apache.org/jira/browse/KAFKA-19785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081723#comment-18081723
]
Jerome Waibel commented on KAFKA-19785:
---------------------------------------
[~kevinwu2412]
Yes, it's yet a small system so we're running in combined mode.
I do have full logs from both nodes from the event, but I'm uncertain if they
contain sensitive data so I won't upload them unfiltered here. But I can give
you some lines of the quorum-state you mentioned.
What initially started the problem was the shutdown and restart of nodepool-1
(probably Kubernetes decided to move the pod to a new node).
The shutdown seems to be clean and lasted from
{{2026-05-02 00:54:13 INFO [SIGTERM handler] LoggingSignalHandler:93 -
Terminating process due to signal SIGTERM}}
to
{{2026-05-02 00:54:19 INFO [kafka-shutdown-hook] AppInfoParser:89 - App info
kafka.server for 1 unregistered}}
At
{{2026-05-02T00:55:35.367Z Starting Kafka with configuration:}}
the new instance of nodepool-1 started, which took until
{{2026-05-02 00:56:05 INFO [main] KafkaRaftServer:69 - [KafkaRaftServer
nodeId=1] Kafka Server started}}
During startup nodepool-1 says
{{2026-05-02 00:56:00 WARN [main] QuorumState:158 - [RaftManager id=1] Epoch
from quorum store file
(/var/lib/kafka/data-0/kafka-log1/__cluster_metadata-0/quorum-state) is 0,
which is smaller than last written epoch 66 in the log}}
At 00:59:31 the first exception at nodepool-0 occurs, crashing the server.
{{2026-05-02 00:59:31 ERROR [kafka-0-raft-io-thread]
ProcessTerminatingFaultHandler:46 - Encountered fatal fault: Unexpected error
in raft IO thread}}
{{ java.lang.IllegalStateException: Received request or response with leader
OptionalInt[0] and epoch 69 which is inconsistent with current leader
OptionalInt.empty and epoch 69}}
{{ at
org.apache.kafka.raft.KafkaRaftClient.maybeTransition(KafkaRaftClient.java:2565)
~[kafka-raft-4.1.1.jar:?]}}
{{ at
org.apache.kafka.raft.KafkaRaftClient.maybeHandleCommonResponse(KafkaRaftClient.java:2521)
~[kafka-raft-4.1.1.jar:?]}}
{{ at
org.apache.kafka.raft.KafkaRaftClient.handleFetchResponse(KafkaRaftClient.java:1723)
~[kafka-raft-4.1.1.jar:?]}}
{{ at
org.apache.kafka.raft.KafkaRaftClient.handleResponse(KafkaRaftClient.java:2605)
~[kafka-raft-4.1.1.jar:?]}}
{{ at
org.apache.kafka.raft.KafkaRaftClient.handleInboundMessage(KafkaRaftClient.java:2759)
~[kafka-raft-4.1.1.jar:?]}}
{{ at org.apache.kafka.raft.KafkaRaftClient.poll(KafkaRaftClient.java:3535)
~[kafka-raft-4.1.1.jar:?]}}
{{ at
org.apache.kafka.raft.KafkaRaftClientDriver.doWork(KafkaRaftClientDriver.java:64)
[kafka-raft-4.1.1.jar:?]}}
{{ at
org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:136)
[kafka-server-common-4.1.1.jar:?]}}
On one of the restarts later nodepool-0 says this once
{{2026-05-02 01:13:04 WARN [main] QuorumState:158 - [RaftManager id=0] Epoch
from quorum store file
(/var/lib/kafka/data-0/kafka-log0/__cluster_metadata-0/quorum-state) is 0,
which is smaller than last written epoch 69 in the log}}
{{After that crash, nodepool-0 was in a crash loop with that exception, while
nodepool-1 did not work any more, stating}}
{{2026-05-02 01:12:32 WARN [broker-1-to-controller-forwarding-channel-manager]
NetworkClient:899 - [NodeToControllerChannelManager id=1 name=forwarding]
Connection to node 0
(kafka-cluster-prod-kafka-cluster-prod-nodepool-0.kafka-cluster-prod-kafka-brokers.kafka.svc/10.76.10.22:9090)
could not be established. Node may not be available.}}
{{2026-05-02 01:12:32 INFO [broker-1-to-controller-forwarding-channel-manager]
NodeToControllerRequestThread:69 -
[broker-1-to-controller-forwarding-channel-manager]: Recorded new KRaft
controller, from now on will use node
kafka-cluster-prod-kafka-cluster-prod-nodepool-0.kafka-cluster-prod-kafka-brokers.kafka.svc:9090
(id: 0 rack: null isFenced: false)}}
{{2026-05-02 01:12:32 INFO [broker-1-to-controller-forwarding-channel-manager]
NodeToControllerRequestThread:69 -
[broker-1-to-controller-forwarding-channel-manager]: Recorded new KRaft
controller, from now on will use node
kafka-cluster-prod-kafka-cluster-prod-nodepool-0.kafka-cluster-prod-kafka-brokers.kafka.svc:9090
(id: 0 rack: null isFenced: false)}}
{{2026-05-02 01:12:32 INFO [broker-1-to-controller-forwarding-channel-manager]
NetworkClient:1072 - [NodeToControllerChannelManager id=1 name=forwarding] Node
0 disconnected.}}
and I did have no more working Kafka.
> Two Kafka brokers were not active in 3 node cluster setup
> ---------------------------------------------------------
>
> Key: KAFKA-19785
> URL: https://issues.apache.org/jira/browse/KAFKA-19785
> Project: Kafka
> Issue Type: Bug
> Components: core, kraft
> Affects Versions: 4.0.0
> Reporter: Sravani
> Priority: Major
> Labels: kraft
>
> Hi Team,
> We were facing kafka issue where two of the kafka brokers were fenced and
> Kafka was not able to process messages. We are using Kafka 4.0.0 version.
> Below are the errors.
>
> Sep 22 09:41:42 host kafka[42245]: [2025-09-22 07:41:42,419] ERROR
> Encountered fatal fault: Unexpected error in raft IO thread
> (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler)
> Sep 22 09:41:42 host kafka[42245]: java.lang.IllegalStateException: Received
> request or response with leader OptionalInt[3] and epoch 55 which is
> inconsistent with current leader OptionalInt.empty and epoch 55
> Sep 22 09:41:42 host kafka[42245]: #011at
> org.apache.kafka.raft.KafkaRaftClient.maybeTransition(KafkaRaftClient.java:2528)
> ~[kafka-raft-4.0.0.jar:?]
> Sep 22 09:41:42 host kafka[42245]: #011at
> org.apache.kafka.raft.KafkaRaftClient.maybeHandleCommonResponse(KafkaRaftClient.java:2484)
> ~[kafka-raft-4.0.0.jar:?]
> Sep 22 09:41:42 host kafka[42245]: #011at
> org.apache.kafka.raft.KafkaRaftClient.handleFetchResponse(KafkaRaftClient.java:1707)
> ~[kafka-raft-4.0.0.jar:?]
> Sep 22 09:41:42 host kafka[42245]: #011at
> org.apache.kafka.raft.KafkaRaftClient.handleResponse(KafkaRaftClient.java:2568)
> ~[kafka-raft-4.0.0.jar:?]
> Sep 22 09:41:42 host kafka[42245]: #011at
> org.apache.kafka.raft.KafkaRaftClient.handleInboundMessage(KafkaRaftClient.java:2724)
> ~[kafka-raft-4.0.0.jar:?]
> Sep 22 09:41:42 host kafka[42245]: #011at
> org.apache.kafka.raft.KafkaRaftClient.poll(KafkaRaftClient.java:3460)
> ~[kafka-raft-4.0.0.jar:?]
> Sep 22 09:41:42 host kafka[42245]: #011at
> org.apache.kafka.raft.KafkaRaftClientDriver.doWork(KafkaRaftClientDriver.java:64)
> [kafka-raft-4.0.0.jar:?]
> Sep 22 09:41:42 host kafka[42245]: #011at
> org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:136)
> [kafka-server-common-4.0.0.jar:?]
> Below metrics shows Fenceborker count as 2.0
> kafka_controller_KafkaController_Value\{name="ActiveBrokerCount",} 1.0
> kafka_controller_KafkaController_Value\{name="GlobalTopicCount",} 23.0
> kafka_controller_KafkaController_Value\{name="FencedBrokerCount",} 2.0
> Please help us to resolve this issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)