[
https://issues.apache.org/jira/browse/KAFKA-19593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18041389#comment-18041389
]
Stef Noten commented on KAFKA-19593:
------------------------------------
After debugging through Kafka, I've found my issue to have been caused by
KAFKA-19716 : "OOM when loading large uncompacted __consumer_offsets partitions
with transactional workload".
* My problematic __consumer_offsets partition took very long to load
* While it was loading, the NotLeaderOrFollowerException is triggered in a
loop while transaction markers are attempted to be resent
* The OOM was actually logged, however it is _buried_ among the
NotLeaderOrFollowerExceptions:
** Before the OOM: more than 200.000 repeats in 1.5 minutes → 6/ms !
** After the OOM, exceptions keep going (the coordinators keeps reloading and
going OOM)
** With a limited number of rotating log files and massive exception spamming,
the error could no longer be found
*Actionable insights:*
* I would've expected an OOM to crash the process
* Log flooding could be prevented with e.g. exponential backoff (even max 10ms
would help)
* The error could be clearer, e.g. indicating that the coordinator for the
partition is still loading. NotLeaderOrFollowerException makes sense from a
client's perspective I guess, but here the broker was initializing itself so it
seemed to indicate some kind of corrupted state (e.g. conflicts on broker epoch
after recovery?)
> Stuck __consumer_offsets partition (kafka streams app)
> ------------------------------------------------------
>
> Key: KAFKA-19593
> URL: https://issues.apache.org/jira/browse/KAFKA-19593
> Project: Kafka
> Issue Type: Bug
> Components: consumer, streams
> Affects Versions: 4.0.0
> Reporter: Matej Pucihar
> Priority: Major
> Labels: kafka-streams
>
> h3. Problem Summary
> My Kafka Streams application cannot move its {{state_store}} from
> {{STARTING}} to {{{}RUNNING{}}}.
> I'm using a *Strimzi Kafka cluster* with:
> * 3 *controller nodes*
> * 4 *broker nodes*
> h3. Observations
> h4. Partition {{__consumer_offsets-35}} is {*}stuck{*}.
> From AKHQ, partition details:
> * *Broker 10* is the *leader* of {{__consumer_offsets-35}}
> * There are *no interesting logs* on broker 10
> * However, logs are *spamming every 10ms* from broker 11 (a {*}replica{*}):
> 2025-08-11 04:05:50 INFO [TxnMarkerSenderThread-11]
> TransactionMarkerRequestCompletionHandler:66
> [Transaction Marker Request Completion Handler 10]: Sending
> irm_r_sbm2_web-backend-web-1-6cad35c7-2be9-4ed2-9849-9c059cc8c409-4's
> transaction marker for partition __consumer_offsets-35 has failed with error
> org.apache.kafka.common.errors.NotLeaderOrFollowerException, retrying with
> current coordinator epoch 38
> h4. Brokers 20 and 21 — neither leaders nor replicas — also spamming the same
> error:
> *Broker 20:*
> 2025-08-11 04:39:45 INFO [TxnMarkerSenderThread-20]
> TransactionMarkerRequestCompletionHandler:66
> Sending irm_r_sbm2_web-backend-web-1-6cad35c7-2be9-4ed2-9849-9c059cc8c409-3's
> transaction marker for partition __consumer_offsets-35 has failed with error
> org.apache.kafka.common.errors.NotLeaderOrFollowerException, retrying with
> current coordinator epoch 54
>
> *Broker 21:*
> 2025-08-11 04:39:58 INFO [TxnMarkerSenderThread-21]
> TransactionMarkerRequestCompletionHandler:66
> Sending irm_r_sbm2_web-backend-web-1-6cad35c7-2be9-4ed2-9849-9c059cc8c409-2's
> transaction marker for partition __consumer_offsets-35 has failed with error
> org.apache.kafka.common.errors.NotLeaderOrFollowerException, retrying with
> current coordinator epoch 28
>
> ----
> h3. Kafka Streams App Behavior
> Logs from the Kafka Streams app (at debug level) repeat continuously. The
> {{state_store}} *never transitions* from {{STARTING}} to {{{}RUNNING{}}}.
> Key repeated logs (debug log level):
> * Polling main consumer repeatedly
> * SASL/SCRAM authentication succeeds
> * 0 records fetched
> * 0 records processed
> * Punctuators run, but nothing gets committed
> * Fails to commit due to {*}rebalance in progress{*}, retrying…
> {{}}
> ----
> h3. Workarounds Considered
> The *only thing that temporarily resolves the issue* is:
> * Physically deleting the partition files for {{__consumer_offsets-35}} from
> both the leader and replica brokers
> Other drastic options:
> * Deleting the entire {{__consumer_offsets}} topic
> * Re-creating the entire Kafka cluster
> ----
> h3. Additional Info
> * I cannot reproduce this in a *clean git project*
> * The issue is isolated to a {*}"corrupt" cluster{*}, which is still
> available for inspection
> * This problem has occurred *4 times* in the *past month*
> * It *started happening after upgrading from Strimzi 3.9 to 4.0*
> * I'm using quarkus (kafka-stream version is 4.0.0) with default
> configuration, the only config worth mentioning is that I'm using
> exactly_once_v2 processing guarantee.
> ----
> h3. Help Needed
> I'm hoping someone can {*}make sense of this issue{*}.
> Please feel free to *reach out.*
--
This message was sent by Atlassian Jira
(v8.20.10#820010)