[
https://issues.apache.org/jira/browse/FLINK-20872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yun Tang closed FLINK-20872.
----------------------------
Resolution: Won't Do
> Job resume from history savepoint when failover if checkpoint is disabled
> -------------------------------------------------------------------------
>
> Key: FLINK-20872
> URL: https://issues.apache.org/jira/browse/FLINK-20872
> Project: Flink
> Issue Type: Improvement
> Affects Versions: 1.10.0, 1.12.0
> Reporter: Liu
> Priority: Minor
>
> I have a long running job. Its checkpoint is disabled and restartStrategy is
> set. One time I upgrade the job through savepoint. One day later, the job is
> failed and restart automatically. But it is resumed from the previous
> savepoint so that the job is heavily lagged.
>
> I have checked the code and find that the job will first try to resume from
> checkpoint and then savepoint.
> {code:java}
> if (checkpointCoordinator != null) {
> // check whether we find a valid checkpoint
> if (!checkpointCoordinator.restoreInitialCheckpointIfPresent(
> new HashSet<>(newExecutionGraph.getAllVertices().values()))) {
> // check whether we can restore from a savepoint
> tryRestoreExecutionGraphFromSavepoint(
> newExecutionGraph, jobGraph.getSavepointRestoreSettings());
> }
> }
> {code}
> For job which checkpoint is disabled, internal failover should not resume
> from previous savepoint, especially the savepoint is done long long ago. In
> this situation, state loss is acceptable but lag is not acceptable.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)