[
https://issues.apache.org/jira/browse/FLINK-39903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088695#comment-18088695
]
Yuepeng Pan commented on FLINK-39903:
-------------------------------------
Merged into master(2.4.0) via: aa52739877badaca3f9afd6efbec975c66476197
> RescaleTimelineITCase.testRescaleTerminatedByResourceRequirementsUpdated is
> flaky: second resource-requirements update can miss the in-progress rescale
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-39903
> URL: https://issues.apache.org/jira/browse/FLINK-39903
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Reporter: Martijn Visser
> Assignee: Martijn Visser
> Priority: Major
> Labels: pull-request-available
>
> See
> https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=75621
> testRescaleTerminatedByResourceRequirementsUpdated asserts that the second
> updateJobResourceRequirements RPC terminates the in-progress rescale started
> by
> the first update with terminal reason RESOURCE_REQUIREMENTS_UPDATED. That
> setter
> (AdaptiveScheduler#recordRescaleForNewResourceRequirements via
> RescaleTimeline#updateRescale) is a no-op once the current rescale is already
> terminated (DefaultRescaleTimeline#isIdling).
> The requested upper bound exceeds available slots, so the first rescale cannot
> change parallelism. With the short cooldown (100 ms) and
> resource-stabilization
> (50 ms) timeouts shared by the parameterized configuration, the
> DefaultStateTransitionManager re-enters Idling and terminates the in-progress
> rescale with NO_RESOURCES_OR_PARALLELISMS_CHANGE. Those are wall-clock timers
> that start when the first rescale is recorded, so on a slow machine the
> rescale
> is terminated before the second update RPC is processed, and the second update
> finds it already terminated, producing the flaky assertion failure.
> This is a test-side timing assumption, not a product bug; re-entering Idling
> and
> recording NO_RESOURCES_OR_PARALLELISMS_CHANGE is correct behaviour.
> Proposed fix (test-only): for this case only, rebuild the fixture cluster in
> place with widened cooldown/stabilization (60 s) so the in-progress rescale
> stays alive across the single synchronous RPC round trip between the two
> updates. The shared parameterized configuration used by the other cases is
> left
> untouched; the disabled-history parameter is skipped up front.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)