[jira] [Updated] (KAFKA-16017) Checkpointed offset is incorrect when task is revived and restoring

Bruno Cadonna (Jira) Fri, 15 Dec 2023 06:39:22 -0800


     [ 
https://issues.apache.org/jira/browse/KAFKA-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bruno Cadonna updated KAFKA-16017:
----------------------------------
    Description: 
Streams checkpoints the wrong offset when a task is revived after a 
{{TaskCorruptedException}} and the task is then migrated to another stream 
thread during restoration.

This might happen in a situation like the following if the Streams application 
runs under EOS:

1. Streams encounters a Network error which triggers a 
{{TaskCorruptedException}}
2. The task that encountered the exception is closed dirty and revived. The 
state store directory is wiped out and a rebalance is triggered.
3. Until the sync of the rebalance is received the revived task is restoring.
4. When the sync is received the revived task is revoked and a new rebalance is 
triggered. During the revocation the task is closed cleanly and a checkpoint 
file is written.
5. With the next rebalance the task moves back to stream thread from which it 
was revoked, read the checkpoint and starts restoring. (I might be enough if 
the task moves to a stream thread on the same Streams client that shares the 
same state directory).
6. The state of the task misses some records

To mitigate the issue one can restart the the stream thread and delete of the 
state on disk. After that the state restores completely from the changelog 
topic and the state does not miss any records anymore.

The root cause is that the checkpoint that is written in step 4 contains the 
offset of the 

https://github.com/cadonna/kafka/tree/KAFKA-16017

  was:Streams checkpoints the wrong offset when a task is revived after a 
{{TaskCorruptedException}} and the task is then migrated to another stream 
thread during restoration.


> Checkpointed offset is incorrect when task is revived and restoring 
> --------------------------------------------------------------------
>
>                 Key: KAFKA-16017
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16017
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.3.1
>            Reporter: Bruno Cadonna
>            Priority: Major
>
> Streams checkpoints the wrong offset when a task is revived after a 
> {{TaskCorruptedException}} and the task is then migrated to another stream 
> thread during restoration.
> This might happen in a situation like the following if the Streams 
> application runs under EOS:
> 1. Streams encounters a Network error which triggers a 
> {{TaskCorruptedException}}
> 2. The task that encountered the exception is closed dirty and revived. The 
> state store directory is wiped out and a rebalance is triggered.
> 3. Until the sync of the rebalance is received the revived task is restoring.
> 4. When the sync is received the revived task is revoked and a new rebalance 
> is triggered. During the revocation the task is closed cleanly and a 
> checkpoint file is written.
> 5. With the next rebalance the task moves back to stream thread from which it 
> was revoked, read the checkpoint and starts restoring. (I might be enough if 
> the task moves to a stream thread on the same Streams client that shares the 
> same state directory).
> 6. The state of the task misses some records
> To mitigate the issue one can restart the the stream thread and delete of the 
> state on disk. After that the state restores completely from the changelog 
> topic and the state does not miss any records anymore.
> The root cause is that the checkpoint that is written in step 4 contains the 
> offset of the 
> https://github.com/cadonna/kafka/tree/KAFKA-16017



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16017) Checkpointed offset is incorrect when task is revived and restoring

Reply via email to