[jira] [Commented] (FLINK-12619) Support TERMINATE/SUSPEND Job with Checkpoint

Yu Li (JIRA) Thu, 06 Jun 2019 09:58:13 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-12619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16857891#comment-16857891
 ]


Yu Li commented on FLINK-12619:
-------------------------------

bq.  But users can choose to do a stop-with-savepoint using the optimized 
incremental format because they know that they don't want to switch to a 
different state backend and would like the speed and size benefit of the faster 
format.
Let me confirm, that do you mean: a) to generate savepoint in the checkpoint 
format or b) to support incremental savepoint?

If the former case, for the time being I don't think it possible since 
currently the checkpoint format is completely different from savepoint in both 
RocksDB and Heap. While I agree It might be a good direction in the long run to 
unify checkpoint and savepoint format just like supporting log-based backup in 
relational database system, but I'm not sure whether we're discussing about 
such a big plan now.

If the later case, on one hand there's no incremental savepoint format yet and 
we need to implement one, which might be pretty hard since we plan to unify the 
savepoint format but HeapKeyedStateBackend doesn't support incremental 
checkpoint. On the other hand incremental savepoint requires user to trigger 
savepoint frequently (or else along with time the "incremental" savepoint will 
actually become "full" when each key-value is updated), which will interfere 
with the normal checkpoint process.

bq. The "canonical format" that (in my opinion) FLIP-41 will introduce can be 
used to create savepoints that are compatible between backends. What I'm 
saying, however, is that we should not strictly tie this to only savepoints.
IMHO the current design of FLIP-41 is on the opposite, that from the document 
we could see this statement in the motivation section: "while for checkpoints 
it is completely reasonable to have state backend specific formats for more 
efficient snapshots and restores, savepoints should be designed with 
interoperability in mind and allow for operational flexibilities such as 
swapping state backends across restores", which indicates to only unify the 
savepoint format across backends while leaving checkpoint to be flexible and 
allowing different backends to have different formats.

bq. That's why I think efforts such as allowing user-triggered checkpoints 
(which includes a user-triggered stop-with-checkpoint) break that distinction 
between user control and system control
FWIW, I don't think supporting stop-with-checkpoint means user is *controlling* 
the checkpoint. It doesn't necessarily trigger a checkpoint since if there's an 
on-going checkpoint, we could ask the job wait for it before stopping. Even if 
it indeed triggers a checkpoint, that only happens along with job suspend/stop, 
so the thing user really controls is suspending/stopping the job, instead of 
checkpoint.

Or if we really hate this "intrusion", we could make it a configuration, that 
if user set it to true, every job stop/suspend action will be accompanied by a 
checkpoint unless triggered by the stop-with-savepoint command.

Please let me know your thoughts sir [~aljoscha], thanks!

> Support TERMINATE/SUSPEND Job with Checkpoint
> ---------------------------------------------
>
>                 Key: FLINK-12619
>                 URL: https://issues.apache.org/jira/browse/FLINK-12619
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / State Backends
>            Reporter: Congxian Qiu(klion26)
>            Assignee: Congxian Qiu(klion26)
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Inspired by the idea of FLINK-11458, we propose to support terminate/suspend 
> a job with checkpoint. This improvement cooperates with incremental and 
> external checkpoint features, that if checkpoint is retained and this feature 
> is configured, we will trigger a checkpoint before the job stops. It could 
> accelarate job recovery a lot since:
> 1. No source rewinding required any more.
> 2. It's much faster than taking a savepoint since incremental checkpoint is 
> enabled.
> Please note that conceptually savepoints is different from checkpoint in a 
> similar way that backups are different from recovery logs in traditional 
> database systems. So we suggest using this feature only for job recovery, 
> while stick with FLINK-11458 for the 
> upgrading/cross-cluster-job-migration/state-backend-switch cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-12619) Support TERMINATE/SUSPEND Job with Checkpoint

Reply via email to