Gyula Fora created FLINK-12144:
----------------------------------
Summary: Option to prefer checkpoints on recovery
Key: FLINK-12144
URL: https://issues.apache.org/jira/browse/FLINK-12144
Project: Flink
Issue Type: New Feature
Components: Runtime / Checkpointing
Reporter: Gyula Fora
When a streaming job fails the getLatestCheckpoint() of the CheckpointStore is
used to determine which checkpoint or savepoint is going to be used for
recovery.
This behaviour is perfectly fine for jobs with relatively small states or where
there are no strong SLAs but it some cases it can be problematic.
For jobs with a very large state size, the difference between recovery times
from savepoints and checkpoints can be substantial to the point where it might
break a use-case. So we would like to avoid ever recovering from a savepoint if
a not too old checkpoint is also readily available.
This cannot be avoided right now if a job fails after we took a savepoint maybe
for backup purposes (maybe it is scheduled multiple times a day).
I suggest we add a configuration option to allow the job to fall back to an
earlier checkpoint (within maybe a certain age limit) even if there is a newer
savepoint available to avoid lengthy downtimes.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)