[ 
https://issues.apache.org/jira/browse/FLINK-38534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088520#comment-18088520
 ] 

Martijn Visser commented on FLINK-38534:
----------------------------------------

This recurred on master in 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75865&view=results
 (leg: test_cron_azure core), at a *different* wait than the one fixed here:

{code}
  06:11:38.195 [ERROR] 
org.apache.flink.runtime.scheduler.adaptive.LocalRecoveryTest.testStateSizeIsConsideredForLocalRecoveryOnRestart
 -- Time elapsed: 65.14 s <<< ERROR!
  org.apache.flink.util.FlinkException: Exhausted retry attempts.
        at 
org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:175)
        at 
org.apache.flink.runtime.scheduler.SchedulerTestingUtils.waitForCheckpointInProgress(SchedulerTestingUtils.java:320)
        at 
org.apache.flink.runtime.scheduler.adaptive.LocalRecoveryTest.testStateSizeIsConsideredForLocalRecoveryOnRestart(LocalRecoveryTest.java:126)
  {code}

Root cause: the test forces all executions to RUNNING via 
{{setAllExecutionsToRunning}} while the AdaptiveScheduler is still deploying 
them. Deployment builds the TaskDeploymentDescriptor on an I/O thread and 
applies it back on the main thread, where 
{{Execution.tryGetTaskDeploymentDescriptorForSlot}} rejects deployment once the 
execution has left DEPLOYING. The run log shows the collision: "Cannot deploy 
v1 (1/4) ... because execution state has switched to RUNNING during task 
restore offload". The vertex fails, the job restarts, the manually triggered 
checkpoint never registers, and {{waitForCheckpointInProgress}} exhausts its 
retries.

The earlier fix here added {{waitForAllTasksRunning}} *after* the forced 
transition, which only observes the forced state and does not prevent this 
deployment race. Fix incoming: wait for TDD creation to complete before forcing 
RUNNING (new 
{{SchedulerTestingUtils#waitForAllTasksDeploymentDescriptorsCreated}}). This is 
a test-only race; the production {{state != DEPLOYING}} check is correct.

> Fix flaky LocalRecoveryTest by waiting for tasks to reach RUNNING state
> -----------------------------------------------------------------------
>
>                 Key: FLINK-38534
>                 URL: https://issues.apache.org/jira/browse/FLINK-38534
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 2.2.0
>            Reporter: Ruan Hang
>            Assignee: mukul mustikar
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.3.0
>
>
> {code:java}
> Feb 27 04:21:50 04:21:50.067 [INFO] Results:
> Feb 27 04:21:50 04:21:50.068 [INFO] 
> Feb 27 04:21:50 04:21:50.069 [ERROR] Errors: 
> Feb 27 04:21:50 04:21:50.070 [ERROR]   
> LocalRecoveryTest.testStateSizeIsConsideredForLocalRecoveryOnRestart:113 ยป 
> Flink Exhausted retry attempts.
> Feb 27 04:21:50 04:21:50.071 [INFO] 
> Feb 27 04:21:50 04:21:50.071 [ERROR] Tests run: 109715, Failures: 0, Errors: 
> 1, Skipped: 354
> Feb 27 04:21:50 04:21:50.071 [INFO] 
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=70334&view=logs&j=77a9d8e1-d610-59b3-fc2a-4766541e0e33&t=25baecb7-cea0-597a-6b01-188b1478210d



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to