[jira] [Created] (FLINK-21351) Incremental checkpoint data would be lost once a non-stop savepoint completed

Yun Tang (Jira) Wed, 10 Feb 2021 07:17:05 -0800

Yun Tang created FLINK-21351:
--------------------------------

             Summary: Incremental checkpoint data would be lost once a non-stop 
savepoint completed
                 Key: FLINK-21351
                 URL: https://issues.apache.org/jira/browse/FLINK-21351
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.12.1, 1.11.3, 1.13.0
            Reporter: Yun Tang
             Fix For: 1.11.4, 1.12.2, 1.13.0



FLINK-10354 counted savepoint as retained checkpoint so that job could failover 
from latest position. I think this operation is reasonable, however, current 
implementation would let incremental checkpoint data lost immediately once a 
non-stop savepoint completed.

Current general phase of incremental checkpoints: once a newer checkpoint 
completed, it would be added to checkpoint store. And if the size of completed 
checkpoints larger than max retained limit, it would subsume the oldest one. 
This lead to the reference of incremental data decrease one and data would be 
deleted once reference reached to zero. As we always ensure to register newer 
checkpoint and then unregister older checkpoint, current phase works fine as 
expected.

However, if a non-stop savepoint (a median manual trigger savepoint) is 
completed, it would be also added into checkpoint store and just subsume 
previous added checkpoint (in default retain one checkpoint case), which would 
unregister older checkpoint without newer checkpoint registered, leading to 
data lost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-21351) Incremental checkpoint data would be lost once a non-stop savepoint completed

Reply via email to