[
https://issues.apache.org/jira/browse/FLINK-9661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-9661:
----------------------------------
Labels: stale-major (was: )
> TTL state should support to do time shift after restoring from checkpoint
> (savepoint).
> --------------------------------------------------------------------------------------
>
> Key: FLINK-9661
> URL: https://issues.apache.org/jira/browse/FLINK-9661
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing
> Affects Versions: 1.6.0
> Reporter: Sihua Zhou
> Priority: Major
> Labels: stale-major
>
> The initial version of the TTL-state appends the expired timestamp along the
> state record, and check the expired timestamp with the condition
> {{expired_timestamp <= current_time}} when accessing the state, if it is true
> then the record is expired, otherwise it is still alive. This could works
> pretty fine in the most cases, but in some case, we need to do time shift,
> otherwise it may cause some unexpected result when using the ProccessTime, I
> roughly describe two case as follow.
> - when restoring the job from the savepoint
> For example, the user set the TTL to 2h for the state, if he trigger a
> savepoint and restore the job from the savepoint after 2h(maybe some reason
> that delay he to restore the job quickly), then the restored job's previous
> state data are all expired.
> - when the job spend a long time to recover from a failure
> For example, there are many jobs running on a yarn session cluster, and the
> cluster configured to use the DFS to store the checkpoint data, but
> unfortunately, the DFS meet a strange problem which makes the jobs on the
> cluster begin to loop in recovery-fail-recovery-fail... the devs spend some
> time to address the issue of DFS and the jobs start working properly, but if
> the "{{system down time >= TTL}}" then the job's previous state data will be
> expired in this case.
> To avoid the problems as above, we need to do time shift after the job
> recovering from checkpoint & savepoint. A possible approach is outlined in
> [6186|https://github.com/apache/flink/pull/6186].
--
This message was sent by Atlassian Jira
(v8.3.4#803005)