StephanEwen commented on a change in pull request #14186:
URL: https://github.com/apache/flink/pull/14186#discussion_r530326270
##########
File path:
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
##########
@@ -1304,12 +1312,34 @@ public boolean restoreLatestCheckpointedStateToAll(
final Set<ExecutionJobVertex> tasks,
final boolean allowNonRestoredState) throws Exception {
- return restoreLatestCheckpointedStateInternal(tasks, true,
false, allowNonRestoredState);
+ return restoreLatestCheckpointedStateInternal(
+ tasks,
+ CoordinatorRestore.ALWAYS, // global recovery
restores coordinators, or resets them to empty
+ false, // recovery might come before first
successful checkpoint
+ allowNonRestoredState);
+ }
+
+ /**
+ * Restores the latest checkpointed at the beginning of the job
execution.
+ * If there is a checkpoint, this method acts like a "global
restore"-style
+ * operation where all stateful tasks and coordinators from the given
+ * set of Job Vertices are restored.
+ *
+ * @param tasks Set of job vertices to restore. State for these
vertices is
+ * restored via {@link
Execution#setInitialState(JobManagerTaskRestore)}.
+ * @return True, if a checkpoint was found and its state was restored,
false otherwise.
+ */
+ public boolean restoreInitialCheckpointIfPresent(final
Set<ExecutionJobVertex> tasks) throws Exception {
+ return restoreLatestCheckpointedStateInternal(
+ tasks,
+ CoordinatorRestore.ONLY_FOR_EXISTING_CHECKPOINT,
Review comment:
This is a kind of optimization. ExecutionGraph / Scheduler startup calls
"recoverCheckpoint" and is fine if no checkpoint exists. Tasks have not been
scheduled, so there is no effect on tasks.
Coordinators have been created at this point already, so they go through a
"double creation": Create with executionGraph, restore empty shortly after.
Some Coordinators do quite a bit of work on restore, especially on an empty
restore (which is like new instantiation), like connecting to Kafka, reading
all relevant Metadata. We want to avoid that this happens twice on JobStartup.
It leads to a weird experience.
Coordinator failovers are generally expected to be expensive, and they are
also rare: They happen only on JM failover, or on an explicit global failover
(not on a regional failover that happens to include all tasks).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]