[ 
https://issues.apache.org/jira/browse/FLINK-39903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088695#comment-18088695
 ] 

Yuepeng Pan commented on FLINK-39903:
-------------------------------------

Merged into master(2.4.0) via: aa52739877badaca3f9afd6efbec975c66476197

> RescaleTimelineITCase.testRescaleTerminatedByResourceRequirementsUpdated is 
> flaky: second resource-requirements update can miss the in-progress rescale
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39903
>                 URL: https://issues.apache.org/jira/browse/FLINK-39903
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Martijn Visser
>            Assignee: Martijn Visser
>            Priority: Major
>              Labels: pull-request-available
>
> See 
> https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=75621
> testRescaleTerminatedByResourceRequirementsUpdated asserts that the second
> updateJobResourceRequirements RPC terminates the in-progress rescale started 
> by
> the first update with terminal reason RESOURCE_REQUIREMENTS_UPDATED. That 
> setter
> (AdaptiveScheduler#recordRescaleForNewResourceRequirements via
> RescaleTimeline#updateRescale) is a no-op once the current rescale is already
> terminated (DefaultRescaleTimeline#isIdling).
> The requested upper bound exceeds available slots, so the first rescale cannot
> change parallelism. With the short cooldown (100 ms) and 
> resource-stabilization
> (50 ms) timeouts shared by the parameterized configuration, the
> DefaultStateTransitionManager re-enters Idling and terminates the in-progress
> rescale with NO_RESOURCES_OR_PARALLELISMS_CHANGE. Those are wall-clock timers
> that start when the first rescale is recorded, so on a slow machine the 
> rescale
> is terminated before the second update RPC is processed, and the second update
> finds it already terminated, producing the flaky assertion failure.
> This is a test-side timing assumption, not a product bug; re-entering Idling 
> and
> recording NO_RESOURCES_OR_PARALLELISMS_CHANGE is correct behaviour.
> Proposed fix (test-only): for this case only, rebuild the fixture cluster in
> place with widened cooldown/stabilization (60 s) so the in-progress rescale
> stays alive across the single synchronous RPC round trip between the two
> updates. The shared parameterized configuration used by the other cases is 
> left
> untouched; the disabled-history parameter is skipped up front.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to