Martijn Visser created FLINK-39902:
--------------------------------------

             Summary: RescaleTimelineITCase.testRescaleTerminatedByJobFinished 
fails due to race between task unblock and recorded rescale
                 Key: FLINK-39902
                 URL: https://issues.apache.org/jira/browse/FLINK-39902
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
            Reporter: Martijn Visser
            Assignee: Martijn Visser


testRescaleTerminatedByJobFinished is flaky on slow/loaded CI and has failed on
both the default and adaptive scheduler legs of the master mirror:

- 20260609.1 (buildId 75795), test_cron_adaptive_scheduler core 
https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=75795
- 20260604.4 (buildId 75621), test_ci core 
https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=75621

Failure:

{noformat}
  RescaleTimelineITCase.testRescaleTerminatedByJobFinished:284
      ->waitUntilConditionWithTimeout:660 ยป Timeout
{noformat}

Root cause:

The test submits a blocking job at parallelism 4 (full cluster capacity) and
requests an upscale to parallelism 8. Because 8 exceeds the available slots, the
rescale never changes the running parallelism and is only observable as a second
entry in the recorded rescale history (added by DefaultRescaleTimeline when the
rescale starts). The test calls OnceBlockingNoOpInvokable.unblock() immediately
after the requirement update, racing the scheduler's reaction to that update. On
a slow machine the no-op task finishes before the second rescale is started and
recorded, so the history stays at size 1 and the size-2 / JOB_FINISHED condition
times out after 10s. Sibling tests avoid this by waiting for the new parallelism
via waitForVertexParallelismReachedAndJobRunning before unblocking, but that
helper cannot be used here since parallelism 8 is unreachable.

Proposed fix (test-only, assertion-preserving):

Wait until the second rescale has been recorded (history size == 2) before
unblocking the task, so the in-progress rescale resolves to JOB_FINISHED once 
the
job finishes. Move the assumeThat(enabledRescaleHistory(...)) ahead of the
requirement update so the disabled-history variant skips cleanly.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to