Re: [PR] [FLINK-37701][flink-runtime] Fix AdaptiveScheduler ignoring checkpoint states sizes for local recovery adjustment. [flink]

via GitHub Thu, 19 Jun 2025 03:11:52 -0700


Izeren commented on code in PR #26663:
URL: https://github.com/apache/flink/pull/26663#discussion_r2156496787



##########
flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerTest.java:
##########
@@ -2072,6 +2097,125 @@ void 
testTryToAssignSlotsReturnsNotPossibleIfExpectedResourcesAreNotAvailable()
         assertThat(assignmentResult.isSuccess()).isFalse();
     }
 
+    @Test
+    void testStateSizeIsConsideredForLocalRecoveryOnRestart() throws Exception 
{
+        final JobGraph jobGraph = 
getCheckpointingSingleVertexJobGraph(JOB_VERTEX);
+        final DeclarativeSlotPool slotPool = 
getSlotPoolWithFreeSlots(PARALLELISM);
+        final List<JobAllocationsInformation> capturedAllocations = new 
ArrayList<>();
+        final boolean localRecoveryEnabled = true;
+        final String executionTarget = "local";
+        final boolean minimalTaskManagerPreferred = false;
+        final SlotAllocator slotAllocator =
+                getArgumentCapturingDelegatingSlotAllocator(
+                        
AdaptiveSchedulerFactory.createSlotSharingSlotAllocator(
+                                slotPool,
+                                localRecoveryEnabled,
+                                executionTarget,
+                                minimalTaskManagerPreferred),
+                        capturedAllocations);
+
+        scheduler =
+                new AdaptiveSchedulerBuilder(
+                                jobGraph,
+                                singleThreadMainThreadExecutor,
+                                EXECUTOR_RESOURCE.getExecutor())
+                        .setDeclarativeSlotPool(slotPool)
+                        .setSlotAllocator(slotAllocator)
+                        .setStateTransitionManagerFactory(
+                                getAutoAdvanceStateTransitionManagerFactory())
+                        .setRestartBackoffTimeStrategy(new 
TestRestartBackoffTimeStrategy(true, 0L))
+                        .build();
+
+        // Start scheduler
+        startTestInstanceInMainThread();
+
+        // Transition job and all subtasks to RUNNING state.
+        waitForJobStatusRunning(scheduler);
+        runInMainThread(() -> setAllExecutionsToRunning(scheduler));
+
+        // Trigger a checkpoint
+        CompletableFuture<CompletedCheckpoint> completedCheckpointFuture =
+                supplyInMainThread(() -> 
scheduler.triggerCheckpoint(CheckpointType.FULL));
+
+        // Verify that checkpoint was registered by scheduler.
+        waitForCheckpointInProgress(scheduler);
+
+        // Acknowledge the checkpoint for all tasks with the fake state.
+        final Map<OperatorID, OperatorSubtaskState> operatorStates =
+                getFakeKeyedManagedStateForAllOperators(jobGraph);
+        runInMainThread(() -> acknowledgePendingCheckpoint(scheduler, 1, 
operatorStates));
+
+        // Wait for the checkpoint to complete.
+        final CompletedCheckpoint completedCheckpoint =
+                completedCheckpointFuture.get(CHECKPOINT_TIMEOUT_SECONDS, 
TimeUnit.SECONDS);
+
+        // Checkpoint stats must show completed checkpoint before the job is 
restarted.
+        waitForCompletedCheckpoint(scheduler);

Review Comment:
   I vaguely recall some race condition leading to `getLatestCheckpoint` being 
null on the call. Let me run the test 1000 times to see if I can reproduce it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-37701][flink-runtime] Fix AdaptiveScheduler ignoring checkpoint states sizes for local recovery adjustment. [flink]

Reply via email to