[ 
https://issues.apache.org/jira/browse/KAFKA-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643642#comment-17643642
 ] 

Matthias J. Sax commented on KAFKA-14440:
-----------------------------------------

What you observe is behavior by-design (the design is no ideal...). Note that 
the local checkpoint files only contain metadata... And for EOS they are not 
updated regularly, but actually read on startup and deleted afterwards, and 
only written again on a clean stop.

It's a known issue, but not a bug, and it's tracked via 
https://issues.apache.org/jira/browse/KAFKA-12549 – I am closing this ticket as 
duplicate. Note: there is WIP to address K12549 already.

> Local state wipeout with EOS
> ----------------------------
>
>                 Key: KAFKA-14440
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14440
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.2.3
>            Reporter: Abdullah alkhawatrah
>            Priority: Major
>         Attachments: Screenshot 2022-12-02 at 09.26.27.png
>
>
> Hey,
> I have a kafka streams service that aggregates events from multiple input 
> topics (running in a k8s cluster). The topology has multiple FKJs. The input 
> topics have around 7 billion events when the service was started from 
> `earliest`.
> The service has EOS enabled and 
> {code:java}
> transaction.timeout.ms: 600000{code}
> The problem I am having is with frequent local state wipe-outs, this leads to 
> very long rebalances. As can be seen from the attached images, local disk 
> sizes go to ~ 0 very often. These wipe out are part of the EOS guarantee 
> based on this log message: 
> {code:java}
> State store transfer-store did not find checkpoint offsets while stores are 
> not empty, since under EOS it has the risk of getting uncommitted data in 
> stores we have to treat it as a task corruption error and wipe out the local 
> state of task 1_8 before re-bootstrapping{code}
>  
> I noticed that this happens as a result of one of the following:
>  * Process gets sigkill when running out of memory or on failure to shutdown 
> gracefully on pod rotation for example, this explains the missing local 
> checkpoint file, but for some reason I thought local checkpoint updates are 
> frequent, so I expected to get part of the state to be reset but not the 
> whole local state.
>  * Although we have a  long transaction timeout config, this appears many 
> times in the logs, after which kafka streams gets into error state. On 
> startup, local checkpoint file is not found:
> {code:java}
> Transiting to abortable error state due to 
> org.apache.kafka.common.errors.InvalidProducerEpochException: Producer 
> attempted to produce with an old epoch.{code}
> The service has 10 instances all having the same behaviour. The issue 
> disappears when EOS is disabled.
> The kafka cluster runs kafka 2.6, with minimum isr of 3.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to