[jira] [Updated] (FLINK-9661) TTL state should support to do time shift after restoring from checkpoint (savepoint).

Flink Jira Bot (Jira) Thu, 22 Apr 2021 07:03:22 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-9661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Flink Jira Bot updated FLINK-9661:
----------------------------------
    Labels: stale-major  (was: )

> TTL state should support to do time shift after restoring from checkpoint 
> (savepoint).
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-9661
>                 URL: https://issues.apache.org/jira/browse/FLINK-9661
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.6.0
>            Reporter: Sihua Zhou
>            Priority: Major
>              Labels: stale-major
>
> The initial version of the TTL-state appends the expired timestamp along the 
> state record, and check the expired timestamp with the condition 
> {{expired_timestamp <= current_time}} when accessing the state, if it is true 
> then the record is expired, otherwise it is still alive. This could works 
> pretty fine in the most cases, but in some case, we need to do time shift, 
> otherwise it may cause some unexpected result when using the ProccessTime, I 
> roughly describe two case as follow.
> - when restoring the job from the savepoint
> For example, the user set the TTL to 2h for the state, if he trigger a 
> savepoint and restore the job from the savepoint after 2h(maybe some reason 
> that delay he to restore the job quickly), then the restored job's previous 
> state data are all expired.
> - when the job spend a long time to recover from a failure
> For example, there are many jobs running on a yarn session cluster, and the 
> cluster configured to use the DFS to store the checkpoint data, but 
> unfortunately, the DFS meet a strange problem which makes the jobs on the 
> cluster begin to loop in recovery-fail-recovery-fail... the devs spend some 
> time to address the issue of DFS and the jobs start working properly, but if 
> the "{{system down time >= TTL}}" then the job's previous state data will be 
> expired in this case.
> To avoid the problems as above, we need to do time shift after the job 
> recovering from checkpoint & savepoint. A possible approach is outlined in 
> [6186|https://github.com/apache/flink/pull/6186].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-9661) TTL state should support to do time shift after restoring from checkpoint (savepoint).

Reply via email to