[jira] [Commented] (FLINK-12619) Support TERMINATE/SUSPEND Job with Checkpoint

Aljoscha Krettek (JIRA) Fri, 07 Jun 2019 04:56:07 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-12619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858550#comment-16858550
 ]


Aljoscha Krettek commented on FLINK-12619:
------------------------------------------

I think there might be some misunderstanding. For the short term, my suggestion 
only means that we slightly adjust how we think about new feature proposals 
like FLIP-41, this feature and FLIP-6755. I mentioned incremental savepoints 
only as a possible future development.

My main point is that the distinction between checkpoints and savepoints is 
that the former are system controlled while the latter are user controlled and 
that we should keep that distinction. The motivation for this issue and for 
FLINK-6755 is to have a more light-weight alternative to savepoints. I think 
the solution for that is to allow savepoints to be in various different 
formats, for example the format that is nowadays used by checkpoints, which 
includes incremental checkpoints on the RocksDB backend.

For the user, the difference is really just in the command they use. Previously 
they did
{code}
bin/flink stop --withSavepoint hdfs:///path/to/savepoint
{code}

This issue wishes to introduce 
{code}
bin/flink stop --withCheckpoint 
{code}

With my suggestion it would be
{code}
bin/flink stop --withSavepoint hdfs:///path/to/savepoint --snapshotFormat 
canonical|optimized|incremental|whatever
{code}
which keeps the clear distinction between checkpoints and savepoints but allows 
an optimized format for the savepoint which is what users want in some cases.

For the FLIP-41 effort, this means that the new format is not a "savepoint 
format" but rather a canonical (or unified) format. You could almost do a 
search-and-replace in the FLIP but there are some other changes like specific 
class hierarchies that are suggested in the doc. Savepoints would by default 
use this format so that they are compatible between backends but users can 
choose to do a savepoint in a different format.

Does this description help?

> Support TERMINATE/SUSPEND Job with Checkpoint
> ---------------------------------------------
>
>                 Key: FLINK-12619
>                 URL: https://issues.apache.org/jira/browse/FLINK-12619
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / State Backends
>            Reporter: Congxian Qiu(klion26)
>            Assignee: Congxian Qiu(klion26)
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Inspired by the idea of FLINK-11458, we propose to support terminate/suspend 
> a job with checkpoint. This improvement cooperates with incremental and 
> external checkpoint features, that if checkpoint is retained and this feature 
> is configured, we will trigger a checkpoint before the job stops. It could 
> accelarate job recovery a lot since:
> 1. No source rewinding required any more.
> 2. It's much faster than taking a savepoint since incremental checkpoint is 
> enabled.
> Please note that conceptually savepoints is different from checkpoint in a 
> similar way that backups are different from recovery logs in traditional 
> database systems. So we suggest using this feature only for job recovery, 
> while stick with FLINK-11458 for the 
> upgrading/cross-cluster-job-migration/state-backend-switch cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-12619) Support TERMINATE/SUSPEND Job with Checkpoint

Reply via email to