Martijn Visser created FLINK-39903:
--------------------------------------

             Summary: 
RescaleTimelineITCase.testRescaleTerminatedByResourceRequirementsUpdated is 
flaky: second resource-requirements update can miss the in-progress rescale
                 Key: FLINK-39903
                 URL: https://issues.apache.org/jira/browse/FLINK-39903
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
            Reporter: Martijn Visser
            Assignee: Martijn Visser


See 
https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=75621

testRescaleTerminatedByResourceRequirementsUpdated asserts that the second
updateJobResourceRequirements RPC terminates the in-progress rescale started by
the first update with terminal reason RESOURCE_REQUIREMENTS_UPDATED. That setter
(AdaptiveScheduler#recordRescaleForNewResourceRequirements via
RescaleTimeline#updateRescale) is a no-op once the current rescale is already
terminated (DefaultRescaleTimeline#isIdling).

The requested upper bound exceeds available slots, so the first rescale cannot
change parallelism. With the short cooldown (100 ms) and resource-stabilization
(50 ms) timeouts shared by the parameterized configuration, the
DefaultStateTransitionManager re-enters Idling and terminates the in-progress
rescale with NO_RESOURCES_OR_PARALLELISMS_CHANGE. Those are wall-clock timers
that start when the first rescale is recorded, so on a slow machine the rescale
is terminated before the second update RPC is processed, and the second update
finds it already terminated, producing the flaky assertion failure.

This is a test-side timing assumption, not a product bug; re-entering Idling and
recording NO_RESOURCES_OR_PARALLELISMS_CHANGE is correct behaviour.

Proposed fix (test-only): for this case only, rebuild the fixture cluster in
place with widened cooldown/stabilization (60 s) so the in-progress rescale
stays alive across the single synchronous RPC round trip between the two
updates. The shared parameterized configuration used by the other cases is left
untouched; the disabled-history parameter is skipped up front.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to