[ 
https://issues.apache.org/jira/browse/FLINK-39902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuepeng Pan updated FLINK-39902:
--------------------------------
    Fix Version/s: 2.3.0
                   2.4.0

> RescaleTimelineITCase.testRescaleTerminatedByJobFinished fails due to race 
> between task unblock and recorded rescale
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39902
>                 URL: https://issues.apache.org/jira/browse/FLINK-39902
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Martijn Visser
>            Assignee: Martijn Visser
>            Priority: Major
>              Labels: pull-request-available, test-stability
>             Fix For: 2.3.0, 2.4.0
>
>
> testRescaleTerminatedByJobFinished is flaky on slow/loaded CI and has failed 
> on
> both the default and adaptive scheduler legs of the master mirror:
> - 20260609.1 (buildId 75795), test_cron_adaptive_scheduler core 
> https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=75795
> - 20260604.4 (buildId 75621), test_ci core 
> https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=75621
> Failure:
> {noformat}
>   RescaleTimelineITCase.testRescaleTerminatedByJobFinished:284
>       ->waitUntilConditionWithTimeout:660 ยป Timeout
> {noformat}
> Root cause:
> The test submits a blocking job at parallelism 4 (full cluster capacity) and
> requests an upscale to parallelism 8. Because 8 exceeds the available slots, 
> the
> rescale never changes the running parallelism and is only observable as a 
> second
> entry in the recorded rescale history (added by DefaultRescaleTimeline when 
> the
> rescale starts). The test calls OnceBlockingNoOpInvokable.unblock() 
> immediately
> after the requirement update, racing the scheduler's reaction to that update. 
> On
> a slow machine the no-op task finishes before the second rescale is started 
> and
> recorded, so the history stays at size 1 and the size-2 / JOB_FINISHED 
> condition
> times out after 10s. Sibling tests avoid this by waiting for the new 
> parallelism
> via waitForVertexParallelismReachedAndJobRunning before unblocking, but that
> helper cannot be used here since parallelism 8 is unreachable.
> Proposed fix (test-only, assertion-preserving):
> Wait until the second rescale has been recorded (history size == 2) before
> unblocking the task, so the in-progress rescale resolves to JOB_FINISHED once 
> the
> job finishes. Move the assumeThat(enabledRescaleHistory(...)) ahead of the
> requirement update so the disabled-history variant skips cleanly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to