[ 
https://issues.apache.org/jira/browse/KAFKA-19785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081723#comment-18081723
 ] 

Jerome Waibel commented on KAFKA-19785:
---------------------------------------

[~kevinwu2412] 

Yes, it's yet a small system so we're running in combined mode.

I do have full logs from both nodes from the event, but I'm uncertain if they 
contain sensitive data so I won't upload them unfiltered here. But I can give 
you some lines of the quorum-state you mentioned.

 

What initially started the problem was the shutdown and restart of nodepool-1 
(probably Kubernetes decided to move the pod to a new node).

 

The shutdown seems to be clean and lasted from

{{2026-05-02 00:54:13 INFO  [SIGTERM handler] LoggingSignalHandler:93 - 
Terminating process due to signal SIGTERM}}

to

{{2026-05-02 00:54:19 INFO  [kafka-shutdown-hook] AppInfoParser:89 - App info 
kafka.server for 1 unregistered}}

 

At

{{2026-05-02T00:55:35.367Z    Starting Kafka with configuration:}}

the new instance of nodepool-1 started, which took until

{{2026-05-02 00:56:05 INFO  [main] KafkaRaftServer:69 - [KafkaRaftServer 
nodeId=1] Kafka Server started}}

 

During startup nodepool-1 says

{{2026-05-02 00:56:00 WARN [main] QuorumState:158 - [RaftManager id=1] Epoch 
from quorum store file 
(/var/lib/kafka/data-0/kafka-log1/__cluster_metadata-0/quorum-state) is 0, 
which is smaller than last written epoch 66 in the log}}

 

At 00:59:31 the first exception at nodepool-0 occurs, crashing the server.

 
{{2026-05-02 00:59:31 ERROR [kafka-0-raft-io-thread] 
ProcessTerminatingFaultHandler:46 - Encountered fatal fault: Unexpected error 
in raft IO thread}}
{{ java.lang.IllegalStateException: Received request or response with leader 
OptionalInt[0] and epoch 69 which is inconsistent with current leader 
OptionalInt.empty and epoch 69}}
{{  at 
org.apache.kafka.raft.KafkaRaftClient.maybeTransition(KafkaRaftClient.java:2565)
 ~[kafka-raft-4.1.1.jar:?]}}
{{  at 
org.apache.kafka.raft.KafkaRaftClient.maybeHandleCommonResponse(KafkaRaftClient.java:2521)
 ~[kafka-raft-4.1.1.jar:?]}}
{{  at 
org.apache.kafka.raft.KafkaRaftClient.handleFetchResponse(KafkaRaftClient.java:1723)
 ~[kafka-raft-4.1.1.jar:?]}}
{{  at 
org.apache.kafka.raft.KafkaRaftClient.handleResponse(KafkaRaftClient.java:2605) 
~[kafka-raft-4.1.1.jar:?]}}
{{  at 
org.apache.kafka.raft.KafkaRaftClient.handleInboundMessage(KafkaRaftClient.java:2759)
 ~[kafka-raft-4.1.1.jar:?]}}
{{  at org.apache.kafka.raft.KafkaRaftClient.poll(KafkaRaftClient.java:3535) 
~[kafka-raft-4.1.1.jar:?]}}
{{  at 
org.apache.kafka.raft.KafkaRaftClientDriver.doWork(KafkaRaftClientDriver.java:64)
 [kafka-raft-4.1.1.jar:?]}}
{{  at 
org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:136)
 [kafka-server-common-4.1.1.jar:?]}}
 

On one of the restarts later nodepool-0 says this once

{{2026-05-02 01:13:04 WARN [main] QuorumState:158 - [RaftManager id=0] Epoch 
from quorum store file 
(/var/lib/kafka/data-0/kafka-log0/__cluster_metadata-0/quorum-state) is 0, 
which is smaller than last written epoch 69 in the log}}

 

{{After that crash, nodepool-0 was in a crash loop with that exception, while 
nodepool-1 did not work any more, stating}}

{{2026-05-02 01:12:32 WARN  [broker-1-to-controller-forwarding-channel-manager] 
NetworkClient:899 - [NodeToControllerChannelManager id=1 name=forwarding] 
Connection to node 0 
(kafka-cluster-prod-kafka-cluster-prod-nodepool-0.kafka-cluster-prod-kafka-brokers.kafka.svc/10.76.10.22:9090)
 could not be established. Node may not be available.}}
{{2026-05-02 01:12:32 INFO  [broker-1-to-controller-forwarding-channel-manager] 
NodeToControllerRequestThread:69 - 
[broker-1-to-controller-forwarding-channel-manager]: Recorded new KRaft 
controller, from now on will use node 
kafka-cluster-prod-kafka-cluster-prod-nodepool-0.kafka-cluster-prod-kafka-brokers.kafka.svc:9090
 (id: 0 rack: null isFenced: false)}}
{{2026-05-02 01:12:32 INFO  [broker-1-to-controller-forwarding-channel-manager] 
NodeToControllerRequestThread:69 - 
[broker-1-to-controller-forwarding-channel-manager]: Recorded new KRaft 
controller, from now on will use node 
kafka-cluster-prod-kafka-cluster-prod-nodepool-0.kafka-cluster-prod-kafka-brokers.kafka.svc:9090
 (id: 0 rack: null isFenced: false)}}
{{2026-05-02 01:12:32 INFO  [broker-1-to-controller-forwarding-channel-manager] 
NetworkClient:1072 - [NodeToControllerChannelManager id=1 name=forwarding] Node 
0 disconnected.}}

and I did have no more working Kafka.

> Two Kafka brokers were not active in 3 node cluster setup
> ---------------------------------------------------------
>
>                 Key: KAFKA-19785
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19785
>             Project: Kafka
>          Issue Type: Bug
>          Components: core, kraft
>    Affects Versions: 4.0.0
>            Reporter: Sravani
>            Priority: Major
>              Labels: kraft
>
> Hi Team,
> We were facing kafka issue where two of the kafka brokers were fenced and 
> Kafka was not able to process messages. We are using Kafka 4.0.0 version. 
> Below are the errors.
>  
> Sep 22 09:41:42 host kafka[42245]: [2025-09-22 07:41:42,419] ERROR 
> Encountered fatal fault: Unexpected error in raft IO thread 
> (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler)
> Sep 22 09:41:42 host kafka[42245]: java.lang.IllegalStateException: Received 
> request or response with leader OptionalInt[3] and epoch 55 which is 
> inconsistent with current leader OptionalInt.empty and epoch 55
> Sep 22 09:41:42 host kafka[42245]: #011at 
> org.apache.kafka.raft.KafkaRaftClient.maybeTransition(KafkaRaftClient.java:2528)
>  ~[kafka-raft-4.0.0.jar:?]
> Sep 22 09:41:42 host kafka[42245]: #011at 
> org.apache.kafka.raft.KafkaRaftClient.maybeHandleCommonResponse(KafkaRaftClient.java:2484)
>  ~[kafka-raft-4.0.0.jar:?]
> Sep 22 09:41:42 host kafka[42245]: #011at 
> org.apache.kafka.raft.KafkaRaftClient.handleFetchResponse(KafkaRaftClient.java:1707)
>  ~[kafka-raft-4.0.0.jar:?]
> Sep 22 09:41:42 host kafka[42245]: #011at 
> org.apache.kafka.raft.KafkaRaftClient.handleResponse(KafkaRaftClient.java:2568)
>  ~[kafka-raft-4.0.0.jar:?]
> Sep 22 09:41:42 host kafka[42245]: #011at 
> org.apache.kafka.raft.KafkaRaftClient.handleInboundMessage(KafkaRaftClient.java:2724)
>  ~[kafka-raft-4.0.0.jar:?]
> Sep 22 09:41:42 host kafka[42245]: #011at 
> org.apache.kafka.raft.KafkaRaftClient.poll(KafkaRaftClient.java:3460) 
> ~[kafka-raft-4.0.0.jar:?]
> Sep 22 09:41:42 host kafka[42245]: #011at 
> org.apache.kafka.raft.KafkaRaftClientDriver.doWork(KafkaRaftClientDriver.java:64)
>  [kafka-raft-4.0.0.jar:?]
> Sep 22 09:41:42 host kafka[42245]: #011at 
> org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:136)
>  [kafka-server-common-4.0.0.jar:?]
> Below metrics shows Fenceborker count as 2.0
> kafka_controller_KafkaController_Value\{name="ActiveBrokerCount",} 1.0
> kafka_controller_KafkaController_Value\{name="GlobalTopicCount",} 23.0
> kafka_controller_KafkaController_Value\{name="FencedBrokerCount",} 2.0
> Please help us to resolve this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to