[
https://issues.apache.org/jira/browse/FLINK-6755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856887#comment-16856887
]
Gyula Fora commented on FLINK-6755:
-----------------------------------
Imagine you are running a production streaming job with a few TBs of KV states.
Your average checkpoint time with incremental checkpoints is 1-5 minutes, your
average savepoint time could be anything from 10 minutes to 30 or more and
might cost you a lot of money depending on where you are running the job.
You try to keep a strong SLA and have to redeploy the job (maybe you hit a bug).
At the moment you have 2 options:
- Trigger a savepoint, wait for it to complete, stop the job then restore ->
might take an hour total (actual production numbers)
- Look at the flink UI and wait for the next checkpoint, hopefully you are
lucky enough so its taken at a frequent interval. Stop the job, search for the
latest checkpoint and recover the job
If you have done any of these things under pressure in a production I guarantee
that you broke some sweat :D
I think we can risk leaking a bit more of an already user controlled mechanism.
The user controls interval, number of concurrent checkpoints, etc. all part of
the public API. Triggering one manually is not gonna change this by much in my
opinion.
> Allow triggering Checkpoints through command line client
> --------------------------------------------------------
>
> Key: FLINK-6755
> URL: https://issues.apache.org/jira/browse/FLINK-6755
> Project: Flink
> Issue Type: New Feature
> Components: Command Line Client, Runtime / Checkpointing
> Affects Versions: 1.3.0
> Reporter: Gyula Fora
> Assignee: vinoyang
> Priority: Major
>
> The command line client currently only allows triggering (and canceling with)
> Savepoints.
> While this is good if we want to fork or modify the pipelines in a
> non-checkpoint compatible way, now with incremental checkpoints this becomes
> wasteful for simple job restarts/pipeline updates.
> I suggest we add a new command:
> ./bin/flink checkpoint <jobID> [checkpointDirectory]
> and a new flag -c for the cancel command to indicate we want to trigger a
> checkpoint:
> ./bin/flink cancel -c [targetDirectory] <jobID>
> Otherwise this can work similar to the current savepoint taking logic, we
> could probably even piggyback on the current messages by adding boolean flag
> indicating whether it should be a savepoint or a checkpoint.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)