Izeren commented on code in PR #26663: URL: https://github.com/apache/flink/pull/26663#discussion_r2156496787
########## flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerTest.java: ########## @@ -2072,6 +2097,125 @@ void testTryToAssignSlotsReturnsNotPossibleIfExpectedResourcesAreNotAvailable() assertThat(assignmentResult.isSuccess()).isFalse(); } + @Test + void testStateSizeIsConsideredForLocalRecoveryOnRestart() throws Exception { + final JobGraph jobGraph = getCheckpointingSingleVertexJobGraph(JOB_VERTEX); + final DeclarativeSlotPool slotPool = getSlotPoolWithFreeSlots(PARALLELISM); + final List<JobAllocationsInformation> capturedAllocations = new ArrayList<>(); + final boolean localRecoveryEnabled = true; + final String executionTarget = "local"; + final boolean minimalTaskManagerPreferred = false; + final SlotAllocator slotAllocator = + getArgumentCapturingDelegatingSlotAllocator( + AdaptiveSchedulerFactory.createSlotSharingSlotAllocator( + slotPool, + localRecoveryEnabled, + executionTarget, + minimalTaskManagerPreferred), + capturedAllocations); + + scheduler = + new AdaptiveSchedulerBuilder( + jobGraph, + singleThreadMainThreadExecutor, + EXECUTOR_RESOURCE.getExecutor()) + .setDeclarativeSlotPool(slotPool) + .setSlotAllocator(slotAllocator) + .setStateTransitionManagerFactory( + getAutoAdvanceStateTransitionManagerFactory()) + .setRestartBackoffTimeStrategy(new TestRestartBackoffTimeStrategy(true, 0L)) + .build(); + + // Start scheduler + startTestInstanceInMainThread(); + + // Transition job and all subtasks to RUNNING state. + waitForJobStatusRunning(scheduler); + runInMainThread(() -> setAllExecutionsToRunning(scheduler)); + + // Trigger a checkpoint + CompletableFuture<CompletedCheckpoint> completedCheckpointFuture = + supplyInMainThread(() -> scheduler.triggerCheckpoint(CheckpointType.FULL)); + + // Verify that checkpoint was registered by scheduler. + waitForCheckpointInProgress(scheduler); + + // Acknowledge the checkpoint for all tasks with the fake state. + final Map<OperatorID, OperatorSubtaskState> operatorStates = + getFakeKeyedManagedStateForAllOperators(jobGraph); + runInMainThread(() -> acknowledgePendingCheckpoint(scheduler, 1, operatorStates)); + + // Wait for the checkpoint to complete. + final CompletedCheckpoint completedCheckpoint = + completedCheckpointFuture.get(CHECKPOINT_TIMEOUT_SECONDS, TimeUnit.SECONDS); + + // Checkpoint stats must show completed checkpoint before the job is restarted. + waitForCompletedCheckpoint(scheduler); Review Comment: I vaguely recall some race condition leading to `getLatestCheckpoint` being null on the call. Let me run the test 1000 times to see if I can reproduce it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org