Matej Pucihar created KAFKA-19593: ------------------------------------- Summary: Stuck __consumer_offsets partition (kafka streams app) Key: KAFKA-19593 URL: https://issues.apache.org/jira/browse/KAFKA-19593 Project: Kafka Issue Type: Bug Components: consumer, streams Affects Versions: 4.0.0 Reporter: Matej Pucihar
h3. Problem Summary My Kafka Streams application cannot move its {{state_store}} from {{STARTING}} to {{{}RUNNING{}}}. I'm using a *Strimzi Kafka cluster* with: * 3 *controller nodes* * 4 *broker nodes* h3. Observations h4. Partition {{__consumer_offsets-35}} is {*}stuck{*}. >From AKHQ, partition details: * *Broker 10* is the *leader* of {{__consumer_offsets-35}} * There are *no interesting logs* on broker 10 * However, logs are *spamming every 10ms* from broker 11 (a {*}replica{*}): 2025-08-11 04:05:50 INFO [TxnMarkerSenderThread-11] TransactionMarkerRequestCompletionHandler:66 [Transaction Marker Request Completion Handler 10]: Sending irm_r_sbm2_web-backend-web-1-6cad35c7-2be9-4ed2-9849-9c059cc8c409-4's transaction marker for partition __consumer_offsets-35 has failed with error org.apache.kafka.common.errors.NotLeaderOrFollowerException, retrying with current coordinator epoch 38 h4. Brokers 20 and 21 — neither leaders nor replicas — also spamming the same error: *Broker 20:* 2025-08-11 04:39:45 INFO [TxnMarkerSenderThread-20] TransactionMarkerRequestCompletionHandler:66 Sending irm_r_sbm2_web-backend-web-1-6cad35c7-2be9-4ed2-9849-9c059cc8c409-3's transaction marker for partition __consumer_offsets-35 has failed with error org.apache.kafka.common.errors.NotLeaderOrFollowerException, retrying with current coordinator epoch 54 *Broker 21:* 2025-08-11 04:39:58 INFO [TxnMarkerSenderThread-21] TransactionMarkerRequestCompletionHandler:66 Sending irm_r_sbm2_web-backend-web-1-6cad35c7-2be9-4ed2-9849-9c059cc8c409-2's transaction marker for partition __consumer_offsets-35 has failed with error org.apache.kafka.common.errors.NotLeaderOrFollowerException, retrying with current coordinator epoch 28 ---- h3. Kafka Streams App Behavior Logs from the Kafka Streams app (at debug level) repeat continuously. The {{state_store}} *never transitions* from {{STARTING}} to {{{}RUNNING{}}}. Key repeated logs (debug log level): * Polling main consumer repeatedly * SASL/SCRAM authentication succeeds * 0 records fetched * 0 records processed * Punctuators run, but nothing gets committed * Fails to commit due to {*}rebalance in progress{*}, retrying… {{}} ---- h3. Workarounds Considered The *only thing that temporarily resolves the issue* is: * Physically deleting the partition files for {{__consumer_offsets-35}} from both the leader and replica brokers Other drastic options: * Deleting the entire {{__consumer_offsets}} topic * Re-creating the entire Kafka cluster ---- h3. Additional Info * I cannot reproduce this in a *clean git project* * The issue is isolated to a {*}"corrupt" cluster{*}, which is still available for inspection * This problem has occurred *4 times* in the *past month* * It *started happening after upgrading from Strimzi 3.9 to 4.0* * I'm using quarkus (kafka-stream version is 4.0.0) with default configuration, the only config worth mentioning is that I'm using exactly_once_v2 processing guarantee. ---- h3. Help Needed I'm hoping someone can {*}make sense of this issue{*}. Please feel free to *reach out.* -- This message was sent by Atlassian Jira (v8.20.10#820010)