Matej Pucihar created KAFKA-19593:
-------------------------------------

             Summary: Stuck __consumer_offsets partition (kafka streams app)
                 Key: KAFKA-19593
                 URL: https://issues.apache.org/jira/browse/KAFKA-19593
             Project: Kafka
          Issue Type: Bug
          Components: consumer, streams
    Affects Versions: 4.0.0
            Reporter: Matej Pucihar


h3. Problem Summary

My Kafka Streams application cannot move its {{state_store}} from {{STARTING}} 
to {{{}RUNNING{}}}.

I'm using a *Strimzi Kafka cluster* with:
 * 3 *controller nodes*

 * 4 *broker nodes*

h3. Observations
h4. Partition {{__consumer_offsets-35}} is {*}stuck{*}.

>From AKHQ, partition details:
 * *Broker 10* is the *leader* of {{__consumer_offsets-35}}

 * There are *no interesting logs* on broker 10

 * However, logs are *spamming every 10ms* from broker 11 (a {*}replica{*}):

2025-08-11 04:05:50 INFO  [TxnMarkerSenderThread-11] 
TransactionMarkerRequestCompletionHandler:66 
[Transaction Marker Request Completion Handler 10]: Sending 
irm_r_sbm2_web-backend-web-1-6cad35c7-2be9-4ed2-9849-9c059cc8c409-4's 
transaction marker for partition __consumer_offsets-35 has failed with error 
org.apache.kafka.common.errors.NotLeaderOrFollowerException, retrying with 
current coordinator epoch 38
h4. Brokers 20 and 21 — neither leaders nor replicas — also spamming the same 
error:

*Broker 20:*
2025-08-11 04:39:45 INFO  [TxnMarkerSenderThread-20] 
TransactionMarkerRequestCompletionHandler:66 
Sending irm_r_sbm2_web-backend-web-1-6cad35c7-2be9-4ed2-9849-9c059cc8c409-3's 
transaction marker for partition __consumer_offsets-35 has failed with error 
org.apache.kafka.common.errors.NotLeaderOrFollowerException, retrying with 
current coordinator epoch 54
 
*Broker 21:*
2025-08-11 04:39:58 INFO  [TxnMarkerSenderThread-21] 
TransactionMarkerRequestCompletionHandler:66 
Sending irm_r_sbm2_web-backend-web-1-6cad35c7-2be9-4ed2-9849-9c059cc8c409-2's 
transaction marker for partition __consumer_offsets-35 has failed with error 
org.apache.kafka.common.errors.NotLeaderOrFollowerException, retrying with 
current coordinator epoch 28
 
----
h3. Kafka Streams App Behavior

Logs from the Kafka Streams app (at debug level) repeat continuously. The 
{{state_store}} *never transitions* from {{STARTING}} to {{{}RUNNING{}}}.

Key repeated logs (debug log level):
 * Polling main consumer repeatedly

 * SASL/SCRAM authentication succeeds

 * 0 records fetched

 * 0 records processed

 * Punctuators run, but nothing gets committed

 * Fails to commit due to {*}rebalance in progress{*}, retrying…

{{}}
----
h3. Workarounds Considered

The *only thing that temporarily resolves the issue* is:
 * Physically deleting the partition files for {{__consumer_offsets-35}} from 
both the leader and replica brokers

Other drastic options:
 * Deleting the entire {{__consumer_offsets}} topic

 * Re-creating the entire Kafka cluster

----
h3. Additional Info
 * I cannot reproduce this in a *clean git project*

 * The issue is isolated to a {*}"corrupt" cluster{*}, which is still available 
for inspection

 * This problem has occurred *4 times* in the *past month*

 * It *started happening after upgrading from Strimzi 3.9 to 4.0*

 * I'm using quarkus (kafka-stream version is 4.0.0) with default 
configuration, the only config worth mentioning is that I'm using 
exactly_once_v2 processing guarantee.

----
h3. Help Needed

I'm hoping someone can {*}make sense of this issue{*}.

Please feel free to *reach out.*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to