[
https://issues.apache.org/jira/browse/FLINK-12619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16857891#comment-16857891
]
Yu Li commented on FLINK-12619:
-------------------------------
bq. But users can choose to do a stop-with-savepoint using the optimized
incremental format because they know that they don't want to switch to a
different state backend and would like the speed and size benefit of the faster
format.
Let me confirm, that do you mean: a) to generate savepoint in the checkpoint
format or b) to support incremental savepoint?
If the former case, for the time being I don't think it possible since
currently the checkpoint format is completely different from savepoint in both
RocksDB and Heap. While I agree It might be a good direction in the long run to
unify checkpoint and savepoint format just like supporting log-based backup in
relational database system, but I'm not sure whether we're discussing about
such a big plan now.
If the later case, on one hand there's no incremental savepoint format yet and
we need to implement one, which might be pretty hard since we plan to unify the
savepoint format but HeapKeyedStateBackend doesn't support incremental
checkpoint. On the other hand incremental savepoint requires user to trigger
savepoint frequently (or else along with time the "incremental" savepoint will
actually become "full" when each key-value is updated), which will interfere
with the normal checkpoint process.
bq. The "canonical format" that (in my opinion) FLIP-41 will introduce can be
used to create savepoints that are compatible between backends. What I'm
saying, however, is that we should not strictly tie this to only savepoints.
IMHO the current design of FLIP-41 is on the opposite, that from the document
we could see this statement in the motivation section: "while for checkpoints
it is completely reasonable to have state backend specific formats for more
efficient snapshots and restores, savepoints should be designed with
interoperability in mind and allow for operational flexibilities such as
swapping state backends across restores", which indicates to only unify the
savepoint format across backends while leaving checkpoint to be flexible and
allowing different backends to have different formats.
bq. That's why I think efforts such as allowing user-triggered checkpoints
(which includes a user-triggered stop-with-checkpoint) break that distinction
between user control and system control
FWIW, I don't think supporting stop-with-checkpoint means user is *controlling*
the checkpoint. It doesn't necessarily trigger a checkpoint since if there's an
on-going checkpoint, we could ask the job wait for it before stopping. Even if
it indeed triggers a checkpoint, that only happens along with job suspend/stop,
so the thing user really controls is suspending/stopping the job, instead of
checkpoint.
Or if we really hate this "intrusion", we could make it a configuration, that
if user set it to true, every job stop/suspend action will be accompanied by a
checkpoint unless triggered by the stop-with-savepoint command.
Please let me know your thoughts sir [~aljoscha], thanks!
> Support TERMINATE/SUSPEND Job with Checkpoint
> ---------------------------------------------
>
> Key: FLINK-12619
> URL: https://issues.apache.org/jira/browse/FLINK-12619
> Project: Flink
> Issue Type: New Feature
> Components: Runtime / State Backends
> Reporter: Congxian Qiu(klion26)
> Assignee: Congxian Qiu(klion26)
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Inspired by the idea of FLINK-11458, we propose to support terminate/suspend
> a job with checkpoint. This improvement cooperates with incremental and
> external checkpoint features, that if checkpoint is retained and this feature
> is configured, we will trigger a checkpoint before the job stops. It could
> accelarate job recovery a lot since:
> 1. No source rewinding required any more.
> 2. It's much faster than taking a savepoint since incremental checkpoint is
> enabled.
> Please note that conceptually savepoints is different from checkpoint in a
> similar way that backups are different from recovery logs in traditional
> database systems. So we suggest using this feature only for job recovery,
> while stick with FLINK-11458 for the
> upgrading/cross-cluster-job-migration/state-backend-switch cases.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)