[ 
https://issues.apache.org/jira/browse/FLINK-39918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuepeng Pan updated FLINK-39918:
--------------------------------
    Fix Version/s: 2.3.0
                   2.4.0

> KeyedComplexChainTest hangs until the CI watchdog kills the fork: 
> AbstractOperatorRestoreTestBase waits ~2.7h for a job status that can never 
> arrive
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39918
>                 URL: https://issues.apache.org/jira/browse/FLINK-39918
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 2.4.0
>            Reporter: Martijn Visser
>            Assignee: Martijn Visser
>            Priority: Major
>              Labels: pull-request-available, test-stability
>             Fix For: 2.3.0, 2.4.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75865&view=results
>  (leg: test_ci tests)
> {code}
>   04:23:42 Process produced no output for 900 seconds.
> {code}
> {{org.apache.flink.test.state.operator.restore.keyed.KeyedComplexChainTest}} 
> started but never completed; the watchdog killed the surefire fork (exit code 
> 143) after 900 s of silence. The surefire dump shows the test thread blocked 
> at {{AbstractOperatorRestoreTestBase.restoreJob:257}} on the 1.10 savepoint, 
> with no task threads alive.
> Root cause: {{migrateJob}}/{{restoreJob}} wait for one specific terminal 
> {{JobStatus}} (RUNNING then CANCELED, resp. FINISHED) via 
> {{retrySuccessfulWithDelay}} against {{TEST_TIMEOUT = 
> Duration.ofSeconds(10000L)}} (~2.7 hours). If the job reaches a *different* 
> globally terminal state (e.g. FAILED), the predicate never matches and the 
> wait spins far beyond the 900 s CI watchdog, killing the entire fork and 
> hiding both the offending test and the actual job failure.
> Historic hang tickets for this test (FLINK-18138, FLINK-12916) are long 
> closed and unrelated.
> Proposed fix (pattern of FLINK-39879): a {{waitForJobStatus}} helper that 
> fails fast when the job reaches a globally terminal state other than the 
> target (surfacing the unexpected state), {{TEST_TIMEOUT}} reduced to 5 
> minutes, and {{@Timeout(10, MINUTES)}} on the test template as a hard 
> anti-hang guard. This converts the fork-killing hang into a localized, 
> diagnosable failure; whether the job legitimately reaches FAILED in these 
> restore scenarios may warrant a separate runtime investigation once one is 
> captured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to