[jira] [Commented] (FLINK-12619) Support TERMINATE/SUSPEND Job with Checkpoint

Aljoscha Krettek (JIRA) Thu, 06 Jun 2019 04:45:14 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-12619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16857562#comment-16857562
 ]


Aljoscha Krettek commented on FLINK-12619:
------------------------------------------

Hi,

I had a brief discussion with Stephan that helped me sort my thoughts on the 
broader topics of checkpoints, savepoints, binary formats, user-triggered 
checkpoints, and periodic savepoints. I’ll try to summarise my stance on this 
and also comment with the same message on the other relevant Jira Issues and 
threads.

For reference, the relevant FLIP and Jira issues are these:

- 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-41%3A+Unify+Keyed+State+Snapshot+Binary+Format+for+Savepoints:
 Unified Savepoint Format
- FLINK-12619: Add support for stop-with-checkpoint
- FLINK-6755: User-triggered checkpoints
- FLINK-4620: Automatically creating savepoints
- FLINK-4511: Schedule periodic savepoints

There are roughly two different dimensions in the topic of 
savepoints/checkpoints (I’ll use snapshot as the generic term for both):
 1) who controls the snapshot
 2) what’s the (binary) format of the snapshot

For 1), we currently have checkpoints and savepoints. Checkpoints are created 
by the system for fault tolerance. They are managed by the system and the 
system is free to discard them when it sees fit. Savepoints are in the control 
of the user. A user can choose to create a save point, they can delete them, 
they can restore from them at will. The system will not clean up savepoints. We 
should try and keep this separation and not muddle the two concepts.

For 2), we currently have various different formats between the different state 
backends and also for the same backend. I.e. RocksDB can do full or incremental 
snapshots, local snapshots, and probably more.

FLIP-41 aims at introducing a unified “savepoint" format that is 
interchangeable between the different state backends. In light of the above 
points, we should say that FLIP-41 aims to introduce a canonical format that is 
interchangeable between different backends. This doesn’t mean that we should 
tie this format strictly to savepoints, though. For performance reasons, users 
might choose to do savepoints that use one of the optimised formats that the 
backends offer, for example incremental snapshots. Or they might choose to use 
the canonical format for regular checkpoints so that they can always switch 
between backends using periodically created externalised checkpoints.

The motivation behind FLINK-12619 is to have a more lightweight alternative for 
stop-with-savepoint, for example using the incremental snapshot format that 
RocksDB has. With the above in mind, however, this becomes “Add support for 
choosing the snapshot format for stop-with-savepoint”. It should not be 
stop-with-checkpoint, because checkpoints are something that the system manages 
and not something that the user should trigger. The same is true for 
FLINK-6755, the motivation is the same I think. The change should be called 
“Add support for choosing the snapshot format for savepoints”, however.

For the last two Jira issues mentioned above it should be quite clear what I 
think. I do, however, see a need for potentially different overlapping 
checkpoint periods or intervals. Users might want to have their regular 
checkpoints use an optimised format but they also want to have a “canonical 
format” checkpoint every no and then so that the lineage of incremental 
checkpoints does not become too unwieldy.

Please let me know what you think!

Aljoscha

> Support TERMINATE/SUSPEND Job with Checkpoint
> ---------------------------------------------
>
>                 Key: FLINK-12619
>                 URL: https://issues.apache.org/jira/browse/FLINK-12619
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / State Backends
>            Reporter: Congxian Qiu(klion26)
>            Assignee: Congxian Qiu(klion26)
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Inspired by the idea of FLINK-11458, we propose to support terminate/suspend 
> a job with checkpoint. This improvement cooperates with incremental and 
> external checkpoint features, that if checkpoint is retained and this feature 
> is configured, we will trigger a checkpoint before the job stops. It could 
> accelarate job recovery a lot since:
> 1. No source rewinding required any more.
> 2. It's much faster than taking a savepoint since incremental checkpoint is 
> enabled.
> Please note that conceptually savepoints is different from checkpoint in a 
> similar way that backups are different from recovery logs in traditional 
> database systems. So we suggest using this feature only for job recovery, 
> while stick with FLINK-11458 for the 
> upgrading/cross-cluster-job-migration/state-backend-switch cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-12619) Support TERMINATE/SUSPEND Job with Checkpoint

Reply via email to