MartijnVisser opened a new pull request, #28408: URL: https://github.com/apache/flink/pull/28408
## What is the purpose of the change `KeyedComplexChainTest` (flink-tests) started but never completed in [build 75865](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75865) (leg `test_ci tests`); the CI watchdog killed the surefire fork after 900s of no output. The surefire dump shows the test thread blocked at `AbstractOperatorRestoreTestBase.restoreJob` on the 1.10 savepoint. Root cause: `migrateJob`/`restoreJob` wait for one specific job status (RUNNING, then CANCELED while migrating; FINISHED while restoring) with `FutureUtils.retrySuccessfulWithDelay` against the long `TEST_TIMEOUT` deadline (10000s, ~2.7 hours). If the job instead reaches a *different* globally terminal state, for example FAILED, the predicate never matches and the wait spins until that deadline, far past the CI per-fork watchdog, so the fork is killed with no usable error and the actual job failure is lost. Tracked in FLINK-39918; the historic hang tickets for this test (FLINK-18138, FLINK-12916) are long closed and unrelated. ## Brief change log - Replace the three duplicated wait/assert blocks with a shared `waitForJobStatus` helper whose status supplier fails fast when the job reaches a globally terminal state other than the target, surfacing the unexpected state (with the job id and target) instead of retrying until the deadline. `FutureUtils.retrySuccessfulWithDelay` aborts immediately on a thrown exception, so the failure propagates without further retries. - No timeout values are changed and no local `@Timeout` is added; the fail-fast behavior alone prevents the multi-hour spin. ## Verifying this change This change is already covered by existing tests: `KeyedComplexChainTest` (16 parameterized cases, 0 failures). The shared base also covers the other operator-restore tests. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers: no - The runtime per-record code paths (performance sensitive): no - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no (test-only change) - The S3 file system connector: no ## Documentation - Does this pull request introduce a new feature? no --- ##### Was generative AI tooling used to co-author this PR? - [X] Yes (Claude Opus 4.8 via Claude Code) Generated-by: Claude Opus 4.8 (1M context) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
