MartijnVisser opened a new pull request, #28408:
URL: https://github.com/apache/flink/pull/28408

   ## What is the purpose of the change
   
   `KeyedComplexChainTest` (flink-tests) started but never completed in [build 
75865](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75865)
 (leg `test_ci tests`); the CI watchdog killed the surefire fork after 900s of 
no output. The surefire dump shows the test thread blocked at 
`AbstractOperatorRestoreTestBase.restoreJob` on the 1.10 savepoint.
   
   Root cause: `migrateJob`/`restoreJob` wait for one specific job status 
(RUNNING, then CANCELED while migrating; FINISHED while restoring) with 
`FutureUtils.retrySuccessfulWithDelay` against the long `TEST_TIMEOUT` deadline 
(10000s, ~2.7 hours). If the job instead reaches a *different* globally 
terminal state, for example FAILED, the predicate never matches and the wait 
spins until that deadline, far past the CI per-fork watchdog, so the fork is 
killed with no usable error and the actual job failure is lost.
   
   Tracked in FLINK-39918; the historic hang tickets for this test 
(FLINK-18138, FLINK-12916) are long closed and unrelated.
   
   ## Brief change log
   
     - Replace the three duplicated wait/assert blocks with a shared 
`waitForJobStatus` helper whose status supplier fails fast when the job reaches 
a globally terminal state other than the target, surfacing the unexpected state 
(with the job id and target) instead of retrying until the deadline. 
`FutureUtils.retrySuccessfulWithDelay` aborts immediately on a thrown 
exception, so the failure propagates without further retries.
     - No timeout values are changed and no local `@Timeout` is added; the 
fail-fast behavior alone prevents the multi-hour spin.
   
   ## Verifying this change
   
   This change is already covered by existing tests: `KeyedComplexChainTest` 
(16 parameterized cases, 0 failures). The shared base also covers the other 
operator-restore tests.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no (test-only change)
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [X] Yes (Claude Opus 4.8 via Claude Code)
   
   Generated-by: Claude Opus 4.8 (1M context)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to