[
https://issues.apache.org/jira/browse/FLINK-38534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088520#comment-18088520
]
Martijn Visser commented on FLINK-38534:
----------------------------------------
This recurred on master in
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75865&view=results
(leg: test_cron_azure core), at a *different* wait than the one fixed here:
{code}
06:11:38.195 [ERROR]
org.apache.flink.runtime.scheduler.adaptive.LocalRecoveryTest.testStateSizeIsConsideredForLocalRecoveryOnRestart
-- Time elapsed: 65.14 s <<< ERROR!
org.apache.flink.util.FlinkException: Exhausted retry attempts.
at
org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:175)
at
org.apache.flink.runtime.scheduler.SchedulerTestingUtils.waitForCheckpointInProgress(SchedulerTestingUtils.java:320)
at
org.apache.flink.runtime.scheduler.adaptive.LocalRecoveryTest.testStateSizeIsConsideredForLocalRecoveryOnRestart(LocalRecoveryTest.java:126)
{code}
Root cause: the test forces all executions to RUNNING via
{{setAllExecutionsToRunning}} while the AdaptiveScheduler is still deploying
them. Deployment builds the TaskDeploymentDescriptor on an I/O thread and
applies it back on the main thread, where
{{Execution.tryGetTaskDeploymentDescriptorForSlot}} rejects deployment once the
execution has left DEPLOYING. The run log shows the collision: "Cannot deploy
v1 (1/4) ... because execution state has switched to RUNNING during task
restore offload". The vertex fails, the job restarts, the manually triggered
checkpoint never registers, and {{waitForCheckpointInProgress}} exhausts its
retries.
The earlier fix here added {{waitForAllTasksRunning}} *after* the forced
transition, which only observes the forced state and does not prevent this
deployment race. Fix incoming: wait for TDD creation to complete before forcing
RUNNING (new
{{SchedulerTestingUtils#waitForAllTasksDeploymentDescriptorsCreated}}). This is
a test-only race; the production {{state != DEPLOYING}} check is correct.
> Fix flaky LocalRecoveryTest by waiting for tasks to reach RUNNING state
> -----------------------------------------------------------------------
>
> Key: FLINK-38534
> URL: https://issues.apache.org/jira/browse/FLINK-38534
> Project: Flink
> Issue Type: Bug
> Components: Tests
> Affects Versions: 2.2.0
> Reporter: Ruan Hang
> Assignee: mukul mustikar
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.3.0
>
>
> {code:java}
> Feb 27 04:21:50 04:21:50.067 [INFO] Results:
> Feb 27 04:21:50 04:21:50.068 [INFO]
> Feb 27 04:21:50 04:21:50.069 [ERROR] Errors:
> Feb 27 04:21:50 04:21:50.070 [ERROR]
> LocalRecoveryTest.testStateSizeIsConsideredForLocalRecoveryOnRestart:113 ยป
> Flink Exhausted retry attempts.
> Feb 27 04:21:50 04:21:50.071 [INFO]
> Feb 27 04:21:50 04:21:50.071 [ERROR] Tests run: 109715, Failures: 0, Errors:
> 1, Skipped: 354
> Feb 27 04:21:50 04:21:50.071 [INFO]
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=70334&view=logs&j=77a9d8e1-d610-59b3-fc2a-4766541e0e33&t=25baecb7-cea0-597a-6b01-188b1478210d
--
This message was sent by Atlassian Jira
(v8.20.10#820010)