Martijn Visser created FLINK-40010:
--------------------------------------
Summary:
RescaleTimelineITCase.testRescaleTerminatedByNoResourcesOrNoParallelismsChange
is flaky: requirements-update can miss the in-progress rescale
Key: FLINK-40010
URL: https://issues.apache.org/jira/browse/FLINK-40010
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination, Tests
Affects Versions: 2.4.0
Reporter: Martijn Visser
Assignee: Martijn Visser
testRescaleTerminatedByNoResourcesOrNoParallelismsChange fails on CI: the
awaited
terminal reason NO_RESOURCES_OR_PARALLELISMS_CHANGE is never recorded, so the
wait times out (or, before https://issues.apache.org/jira/browse/FLINK-40009,
hangs).
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76350&view=results
(leg: test_cron_azure core)
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76242&view=results
(leg: test_cron_jdk11 core)
Root cause: NO_RESOURCES_OR_PARALLELISMS_CHANGE is stamped by
DefaultStateTransitionManager only on the rescale tracked when the manager
(re-)enters its Idling phase. With the short shared cooldown, the cooldown can
elapse
and the manager can reach Idling before the requirements-update RPC is
processed, so the UPDATE_REQUIREMENT rescale is created after Idling was
entered and never receives the terminal reason; it stays in-progress until
teardown cancels it (JOB_CANCELED).
Fix: rebuild the fixture cluster with a cooldown (10s) that comfortably
outlasts the
synchronous update RPC, so the update is processed in Cooldown and routed back
through Idling where the reason is stamped. Unlike
testRescaleTerminatedByResourceRequirementsUpdated (FLINK-39903), this case
must wait out the whole cooldown before the condition can be met, so the
cooldown is kept modest (10s) and the wait budget is widened to 60s.
Related: FLINK-39902, FLINK-39903 (sibling races in the same class).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)