[jira] [Commented] (FLINK-12619) Support TERMINATE/SUSPEND Job with Checkpoint

Yu Li (JIRA) Fri, 07 Jun 2019 08:42:15 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-12619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858747#comment-16858747
 ]


Yu Li commented on FLINK-12619:
-------------------------------

Thanks for the clarification about your thoughts [~aljoscha], but I still have 
some questions.

First of all, if we stick to the savepoint solution (regardless how we 
implement the "optimized" format), how do we resolve the below issue?: It 
requires user to trigger savepoint frequently (or else along with time the 
"incremental" savepoint will actually become "full" when each key-value is 
updated), which will interfere with the normal system-triggered checkpoint 
process.

bq. I mentioned incremental savepoints only as a possible future development... 
I think the solution for that is to allow savepoints to be in various different 
formats... which keeps the clear distinction between checkpoints and savepoints 
but allows an optimized format for the savepoint which is what users want in 
some cases.
Sorry but I'm a little bit confused here, if not a unified "incremental 
savepoint format", what this "optimized" or "canonical/unified" format 
would/could be?

bq. My main point is that the distinction between checkpoints and savepoints is 
that the former are system controlled while the latter are user controlled and 
that we should keep that distinction.
I think we could resolve the concern in the following way, wdyt?: Introducing a 
configuration like {{job.stop.with.checkpoint}} and if user set it to true, 
every job stop/suspend action will be accompanied by a checkpoint unless 
triggered by the stop-with-savepoint command.

Thanks.

> Support TERMINATE/SUSPEND Job with Checkpoint
> ---------------------------------------------
>
>                 Key: FLINK-12619
>                 URL: https://issues.apache.org/jira/browse/FLINK-12619
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / State Backends
>            Reporter: Congxian Qiu(klion26)
>            Assignee: Congxian Qiu(klion26)
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Inspired by the idea of FLINK-11458, we propose to support terminate/suspend 
> a job with checkpoint. This improvement cooperates with incremental and 
> external checkpoint features, that if checkpoint is retained and this feature 
> is configured, we will trigger a checkpoint before the job stops. It could 
> accelarate job recovery a lot since:
> 1. No source rewinding required any more.
> 2. It's much faster than taking a savepoint since incremental checkpoint is 
> enabled.
> Please note that conceptually savepoints is different from checkpoint in a 
> similar way that backups are different from recovery logs in traditional 
> database systems. So we suggest using this feature only for job recovery, 
> while stick with FLINK-11458 for the 
> upgrading/cross-cluster-job-migration/state-backend-switch cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-12619) Support TERMINATE/SUSPEND Job with Checkpoint

Reply via email to