[
https://issues.apache.org/jira/browse/KAFKA-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthias J. Sax resolved KAFKA-14440.
-------------------------------------
Resolution: Duplicate
> Local state wipeout with EOS
> ----------------------------
>
> Key: KAFKA-14440
> URL: https://issues.apache.org/jira/browse/KAFKA-14440
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Affects Versions: 3.2.3
> Reporter: Abdullah alkhawatrah
> Priority: Major
> Attachments: Screenshot 2022-12-02 at 09.26.27.png
>
>
> Hey,
> I have a kafka streams service that aggregates events from multiple input
> topics (running in a k8s cluster). The topology has multiple FKJs. The input
> topics have around 7 billion events when the service was started from
> `earliest`.
> The service has EOS enabled and
> {code:java}
> transaction.timeout.ms: 600000{code}
> The problem I am having is with frequent local state wipe-outs, this leads to
> very long rebalances. As can be seen from the attached images, local disk
> sizes go to ~ 0 very often. These wipe out are part of the EOS guarantee
> based on this log message:
> {code:java}
> State store transfer-store did not find checkpoint offsets while stores are
> not empty, since under EOS it has the risk of getting uncommitted data in
> stores we have to treat it as a task corruption error and wipe out the local
> state of task 1_8 before re-bootstrapping{code}
>
> I noticed that this happens as a result of one of the following:
> * Process gets sigkill when running out of memory or on failure to shutdown
> gracefully on pod rotation for example, this explains the missing local
> checkpoint file, but for some reason I thought local checkpoint updates are
> frequent, so I expected to get part of the state to be reset but not the
> whole local state.
> * Although we have a long transaction timeout config, this appears many
> times in the logs, after which kafka streams gets into error state. On
> startup, local checkpoint file is not found:
> {code:java}
> Transiting to abortable error state due to
> org.apache.kafka.common.errors.InvalidProducerEpochException: Producer
> attempted to produce with an old epoch.{code}
> The service has 10 instances all having the same behaviour. The issue
> disappears when EOS is disabled.
> The kafka cluster runs kafka 2.6, with minimum isr of 3.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)