[
https://issues.apache.org/jira/browse/FLINK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yuepeng Pan updated FLINK-39902:
--------------------------------
Fix Version/s: 2.3.0
2.4.0
> RescaleTimelineITCase.testRescaleTerminatedByJobFinished fails due to race
> between task unblock and recorded rescale
> --------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-39902
> URL: https://issues.apache.org/jira/browse/FLINK-39902
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Reporter: Martijn Visser
> Assignee: Martijn Visser
> Priority: Major
> Labels: pull-request-available, test-stability
> Fix For: 2.3.0, 2.4.0
>
>
> testRescaleTerminatedByJobFinished is flaky on slow/loaded CI and has failed
> on
> both the default and adaptive scheduler legs of the master mirror:
> - 20260609.1 (buildId 75795), test_cron_adaptive_scheduler core
> https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=75795
> - 20260604.4 (buildId 75621), test_ci core
> https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=75621
> Failure:
> {noformat}
> RescaleTimelineITCase.testRescaleTerminatedByJobFinished:284
> ->waitUntilConditionWithTimeout:660 ยป Timeout
> {noformat}
> Root cause:
> The test submits a blocking job at parallelism 4 (full cluster capacity) and
> requests an upscale to parallelism 8. Because 8 exceeds the available slots,
> the
> rescale never changes the running parallelism and is only observable as a
> second
> entry in the recorded rescale history (added by DefaultRescaleTimeline when
> the
> rescale starts). The test calls OnceBlockingNoOpInvokable.unblock()
> immediately
> after the requirement update, racing the scheduler's reaction to that update.
> On
> a slow machine the no-op task finishes before the second rescale is started
> and
> recorded, so the history stays at size 1 and the size-2 / JOB_FINISHED
> condition
> times out after 10s. Sibling tests avoid this by waiting for the new
> parallelism
> via waitForVertexParallelismReachedAndJobRunning before unblocking, but that
> helper cannot be used here since parallelism 8 is unreachable.
> Proposed fix (test-only, assertion-preserving):
> Wait until the second rescale has been recorded (history size == 2) before
> unblocking the task, so the in-progress rescale resolves to JOB_FINISHED once
> the
> job finishes. Move the assumeThat(enabledRescaleHistory(...)) ahead of the
> requirement update so the disabled-history variant skips cleanly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)