[ 
https://issues.apache.org/jira/browse/KAFKA-19593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18041389#comment-18041389
 ] 

Stef Noten commented on KAFKA-19593:
------------------------------------

After debugging through Kafka, I've found my issue to have been caused by 
KAFKA-19716 : "OOM when loading large uncompacted __consumer_offsets partitions 
with transactional workload".
 * My problematic __consumer_offsets partition took very long to load
 * While it was loading, the NotLeaderOrFollowerException is triggered in a 
loop while transaction markers are attempted to be resent
 * The OOM was actually logged, however it is _buried_ among the 
NotLeaderOrFollowerExceptions:
 ** Before the OOM: more than 200.000 repeats in 1.5 minutes → 6/ms !
 ** After the OOM, exceptions keep going (the coordinators keeps reloading and 
going OOM)
 ** With a limited number of rotating log files and massive exception spamming, 
the error could no longer be found 

*Actionable insights:*
 * I would've expected an OOM to crash the process
 * Log flooding could be prevented with e.g. exponential backoff (even max 10ms 
would help)
 * The error could be clearer, e.g. indicating that the coordinator for the 
partition is still loading. NotLeaderOrFollowerException makes sense from a 
client's perspective I guess, but here the broker was initializing itself so it 
seemed to indicate some kind of corrupted state (e.g. conflicts on broker epoch 
after recovery?)

> Stuck __consumer_offsets partition (kafka streams app)
> ------------------------------------------------------
>
>                 Key: KAFKA-19593
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19593
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer, streams
>    Affects Versions: 4.0.0
>            Reporter: Matej Pucihar
>            Priority: Major
>              Labels: kafka-streams
>
> h3. Problem Summary
> My Kafka Streams application cannot move its {{state_store}} from 
> {{STARTING}} to {{{}RUNNING{}}}.
> I'm using a *Strimzi Kafka cluster* with:
>  * 3 *controller nodes*
>  * 4 *broker nodes*
> h3. Observations
> h4. Partition {{__consumer_offsets-35}} is {*}stuck{*}.
> From AKHQ, partition details:
>  * *Broker 10* is the *leader* of {{__consumer_offsets-35}}
>  * There are *no interesting logs* on broker 10
>  * However, logs are *spamming every 10ms* from broker 11 (a {*}replica{*}):
> 2025-08-11 04:05:50 INFO  [TxnMarkerSenderThread-11] 
> TransactionMarkerRequestCompletionHandler:66 
> [Transaction Marker Request Completion Handler 10]: Sending 
> irm_r_sbm2_web-backend-web-1-6cad35c7-2be9-4ed2-9849-9c059cc8c409-4's 
> transaction marker for partition __consumer_offsets-35 has failed with error 
> org.apache.kafka.common.errors.NotLeaderOrFollowerException, retrying with 
> current coordinator epoch 38
> h4. Brokers 20 and 21 — neither leaders nor replicas — also spamming the same 
> error:
> *Broker 20:*
> 2025-08-11 04:39:45 INFO  [TxnMarkerSenderThread-20] 
> TransactionMarkerRequestCompletionHandler:66 
> Sending irm_r_sbm2_web-backend-web-1-6cad35c7-2be9-4ed2-9849-9c059cc8c409-3's 
> transaction marker for partition __consumer_offsets-35 has failed with error 
> org.apache.kafka.common.errors.NotLeaderOrFollowerException, retrying with 
> current coordinator epoch 54
>  
> *Broker 21:*
> 2025-08-11 04:39:58 INFO  [TxnMarkerSenderThread-21] 
> TransactionMarkerRequestCompletionHandler:66 
> Sending irm_r_sbm2_web-backend-web-1-6cad35c7-2be9-4ed2-9849-9c059cc8c409-2's 
> transaction marker for partition __consumer_offsets-35 has failed with error 
> org.apache.kafka.common.errors.NotLeaderOrFollowerException, retrying with 
> current coordinator epoch 28
>  
> ----
> h3. Kafka Streams App Behavior
> Logs from the Kafka Streams app (at debug level) repeat continuously. The 
> {{state_store}} *never transitions* from {{STARTING}} to {{{}RUNNING{}}}.
> Key repeated logs (debug log level):
>  * Polling main consumer repeatedly
>  * SASL/SCRAM authentication succeeds
>  * 0 records fetched
>  * 0 records processed
>  * Punctuators run, but nothing gets committed
>  * Fails to commit due to {*}rebalance in progress{*}, retrying…
> {{}}
> ----
> h3. Workarounds Considered
> The *only thing that temporarily resolves the issue* is:
>  * Physically deleting the partition files for {{__consumer_offsets-35}} from 
> both the leader and replica brokers
> Other drastic options:
>  * Deleting the entire {{__consumer_offsets}} topic
>  * Re-creating the entire Kafka cluster
> ----
> h3. Additional Info
>  * I cannot reproduce this in a *clean git project*
>  * The issue is isolated to a {*}"corrupt" cluster{*}, which is still 
> available for inspection
>  * This problem has occurred *4 times* in the *past month*
>  * It *started happening after upgrading from Strimzi 3.9 to 4.0*
>  * I'm using quarkus (kafka-stream version is 4.0.0) with default 
> configuration, the only config worth mentioning is that I'm using 
> exactly_once_v2 processing guarantee.
> ----
> h3. Help Needed
> I'm hoping someone can {*}make sense of this issue{*}.
> Please feel free to *reach out.*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to