[GitHub] [flink] StephanEwen commented on a change in pull request #14186: [FLINK-20222][checkpointing] Operator Coordinators are reset with null state when no checkpoint or state available.

GitBox Wed, 25 Nov 2020 04:11:01 -0800


StephanEwen commented on a change in pull request #14186:
URL: https://github.com/apache/flink/pull/14186#discussion_r530326270




##########
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
##########
@@ -1304,12 +1312,34 @@ public boolean restoreLatestCheckpointedStateToAll(
                        final Set<ExecutionJobVertex> tasks,
                        final boolean allowNonRestoredState) throws Exception {
 
-               return restoreLatestCheckpointedStateInternal(tasks, true, 
false, allowNonRestoredState);
+               return restoreLatestCheckpointedStateInternal(
+                               tasks,
+                               CoordinatorRestore.ALWAYS, // global recovery 
restores coordinators, or resets them to empty
+                               false,   // recovery might come before first 
successful checkpoint
+                               allowNonRestoredState);
+       }
+
+       /**
+        * Restores the latest checkpointed at the beginning of the job 
execution.
+        * If there is a checkpoint, this method acts like a "global 
restore"-style
+        * operation where all stateful tasks and coordinators from the given
+        * set of Job Vertices are restored.
+        *
+        * @param tasks Set of job vertices to restore. State for these 
vertices is
+        *              restored via {@link 
Execution#setInitialState(JobManagerTaskRestore)}.
+        * @return True, if a checkpoint was found and its state was restored, 
false otherwise.
+        */
+       public boolean restoreInitialCheckpointIfPresent(final 
Set<ExecutionJobVertex> tasks) throws Exception {
+               return restoreLatestCheckpointedStateInternal(
+                       tasks,
+                       CoordinatorRestore.ONLY_FOR_EXISTING_CHECKPOINT,

Review comment:
       This is a kind of optimization. ExecutionGraph / Scheduler startup calls 
"recoverCheckpoint" and is fine if no checkpoint exists. Tasks have not been 
scheduled, so there is no effect on tasks.
   
   Coordinators have been created at this point already, so they go through a 
"double creation": Create with executionGraph, restore empty shortly after. 
Some Coordinators do quite a bit of work on restore, especially on an empty 
restore (which is like new instantiation), like connecting to Kafka, reading 
all relevant Metadata. We want to avoid that this happens twice on JobStartup. 
It leads to a weird experience.
   
   Coordinator failovers are generally expected to be expensive, and they are 
also rare: They happen only on JM failover, or on an explicit global failover 
(not on a regional failover that happens to include all tasks).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] StephanEwen commented on a change in pull request #14186: [FLINK-20222][checkpointing] Operator Coordinators are reset with null state when no checkpoint or state available.

Reply via email to