[
https://issues.apache.org/jira/browse/FLINK-39918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088700#comment-18088700
]
Yuepeng Pan commented on FLINK-39918:
-------------------------------------
Merged into master(2.4.0) via: 0156b330949f22d118c00ef7dc8026b1e22ad217
> KeyedComplexChainTest hangs until the CI watchdog kills the fork:
> AbstractOperatorRestoreTestBase waits ~2.7h for a job status that can never
> arrive
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-39918
> URL: https://issues.apache.org/jira/browse/FLINK-39918
> Project: Flink
> Issue Type: Bug
> Components: Tests
> Affects Versions: 2.4.0
> Reporter: Martijn Visser
> Assignee: Martijn Visser
> Priority: Major
> Labels: pull-request-available, test-stability
> Fix For: 2.3.0, 2.4.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75865&view=results
> (leg: test_ci tests)
> {code}
> 04:23:42 Process produced no output for 900 seconds.
> {code}
> {{org.apache.flink.test.state.operator.restore.keyed.KeyedComplexChainTest}}
> started but never completed; the watchdog killed the surefire fork (exit code
> 143) after 900 s of silence. The surefire dump shows the test thread blocked
> at {{AbstractOperatorRestoreTestBase.restoreJob:257}} on the 1.10 savepoint,
> with no task threads alive.
> Root cause: {{migrateJob}}/{{restoreJob}} wait for one specific terminal
> {{JobStatus}} (RUNNING then CANCELED, resp. FINISHED) via
> {{retrySuccessfulWithDelay}} against {{TEST_TIMEOUT =
> Duration.ofSeconds(10000L)}} (~2.7 hours). If the job reaches a *different*
> globally terminal state (e.g. FAILED), the predicate never matches and the
> wait spins far beyond the 900 s CI watchdog, killing the entire fork and
> hiding both the offending test and the actual job failure.
> Historic hang tickets for this test (FLINK-18138, FLINK-12916) are long
> closed and unrelated.
> Proposed fix (pattern of FLINK-39879): a {{waitForJobStatus}} helper that
> fails fast when the job reaches a globally terminal state other than the
> target (surfacing the unexpected state), {{TEST_TIMEOUT}} reduced to 5
> minutes, and {{@Timeout(10, MINUTES)}} on the test template as a hard
> anti-hang guard. This converts the fork-killing hang into a localized,
> diagnosable failure; whether the job legitimately reaches FAILED in these
> restore scenarios may warrant a separate runtime investigation once one is
> captured.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)