[jira] [Comment Edited] (FLINK-21351) Incremental checkpoint data would be lost once a non-stop savepoint completed

Yun Tang (Jira) Wed, 10 Feb 2021 21:57:07 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282867#comment-17282867
 ]


Yun Tang edited comment on FLINK-21351 at 2/11/21, 5:56 AM:
------------------------------------------------------------

[~dwysakowicz] yes, once a savepoint completed, the previous added incremental 
checkpoint would be subsumed. If the job is cancelled and savepoint would be 
used to restore for next running, everything is fine. However, if the job is 
still running and still depends on checkpoint mechanism for failover, this 
would cause unrecoverable data lost.


was (Author: yunta):
[~dwysakowicz] yes, once a savepoint completed, the previous added incremental 
checkpoint would be subsumed. If the job is cancelled and savepoint would be 
used to restore for next running, everything is fine. However, if the job is 
still running and still depend checkpoint mechanism for failover, this would 
cause unrecoverable data lost.

> Incremental checkpoint data would be lost once a non-stop savepoint completed
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-21351
>                 URL: https://issues.apache.org/jira/browse/FLINK-21351
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.3, 1.12.1, 1.13.0
>            Reporter: Yun Tang
>            Priority: Blocker
>             Fix For: 1.11.4, 1.12.2, 1.13.0
>
>
> FLINK-10354 counted savepoint as retained checkpoint so that job could 
> failover from latest position. I think this operation is reasonable, however, 
> current implementation would let incremental checkpoint data lost immediately 
> once a non-stop savepoint completed.
> Current general phase of incremental checkpoints: once a newer checkpoint 
> completed, it would be added to checkpoint store. And if the size of 
> completed checkpoints larger than max retained limit, it would subsume the 
> oldest one. This lead to the reference of incremental data decrease one and 
> data would be deleted once reference reached to zero. As we always ensure to 
> register newer checkpoint and then unregister older checkpoint, current phase 
> works fine as expected.
> However, if a non-stop savepoint (a median manual trigger savepoint) is 
> completed, it would be also added into checkpoint store and just subsume 
> previous added checkpoint (in default retain one checkpoint case), which 
> would unregister older checkpoint without newer checkpoint registered, 
> leading to data lost.
> Thanks for [~banmoy] reporting this problem first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-21351) Incremental checkpoint data would be lost once a non-stop savepoint completed

Reply via email to