[
https://issues.apache.org/jira/browse/FLINK-4717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565708#comment-15565708
]
ASF GitHub Bot commented on FLINK-4717:
---------------------------------------
Github user tillrohrmann commented on a diff in the pull request:
https://github.com/apache/flink/pull/2609#discussion_r82814608
--- Diff:
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
---
@@ -581,6 +581,62 @@ class JobManager(
)
}
+ case CancelJobWithSavepoint(jobId, savepointDirectory) =>
+ try {
+ val targetDirectory = if (savepointDirectory != null) {
+ savepointDirectory
+ } else {
+ defaultSavepointDir
+ }
+
+ log.info(s"Trying to cancel job $jobId with savepoint to
$targetDirectory")
+
+ currentJobs.get(jobId) match {
+ case Some((executionGraph, _)) =>
+ // We don't want any checkpoint between the savepoint and
cancellation
+ val coord = executionGraph.getCheckpointCoordinator
+ coord.stopCheckpointScheduler()
--- End diff --
I think it's not enough to simply call `stopCheckpointScheduler`. If I'm
not mistaken, then the following could happen: You call
`stopCheckpointScheduler` which will try to `cancel` the last
`currentPeriodicTrigger`. Now assume that the last `TimerTask` to trigger the
next checkpoint has just been triggered but not executed (just before
cancelling it). Now the `stopCheckpointScheduler` finishes without the
`TimerTask` having completed. Now the `TimerTask` can still trigger a
checkpoint even though we've stopped the checkpoint scheduler.
The way to fix this (admittedly academic corner case), is to filter out
outdated `TimerTask` calls in the `CheckpointCoordinator` by having a kind of
fencing tokens for the trigger checkpoint calls.
> Naive version of atomic stop signal with savepoint
> --------------------------------------------------
>
> Key: FLINK-4717
> URL: https://issues.apache.org/jira/browse/FLINK-4717
> Project: Flink
> Issue Type: New Feature
> Components: State Backends, Checkpointing
> Affects Versions: 1.2.0
> Reporter: Till Rohrmann
> Priority: Minor
> Fix For: 1.2.0
>
>
> As a first step towards atomic stopping with savepoints we should implement a
> cancel command which prior to cancelling takes a savepoint. Additionally, it
> should turn off the periodic checkpointing so that there won't be checkpoints
> executed between the savepoint and the cancel command.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)