MartijnVisser opened a new pull request, #28404: URL: https://github.com/apache/flink/pull/28404
## What is the purpose of the change `RestoreTestBase#testRestore` (the `AfterRestoreSource.INFINITE` path) and the savepoint-generation path wait on `CompletableFuture.allOf(...).get()` with no timeout. The futures only complete when a sink observer sees an exact match of the expected results. `PROCTIME()` window boundaries come from the wall clock, so records can occasionally split across windows differently than when the expected data was captured; the match then never happens and the surefire fork hangs until the CI watchdog kills it after 900s of no output (exit code 143), hiding which test stalled. A thread dump from [build 75906](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75906) shows the test parked in that `get()` at `RestoreTestBase.testRestore` with the MiniCluster job alive and fully idle. This is the failure reported in FLINK-35562 (`WindowTableFunctionProcTimeRestoreTest`); `GroupWindowAggregateProcTimeRestoreTest` (FLINK-34404) shares the base class and failure mode. Note on scope: this change does not remove the underlying wall-clock dependence of the proc-time programs. It converts a rare leg-killing 900s hang into a rare, localized, diagnosable test failure that names the program and prints actual vs expected sink results within 5 minutes. Making the proc-time restore programs deterministic is left as follow-up under the JIRA tickets. ## Brief change log - Add `RESULT_AWAIT_TIMEOUT_MILLIS` (5 minutes) as the upper bound for awaiting the expected sink results. - Extract `awaitExpectedResults(program, futures)`, which on timeout throws an `AssertionError` naming the program id and reporting each sink's actual results. - Route both previously unbounded `.get()` call sites (the INFINITE restore path and the savepoint-generation path) through it. No assertion logic or test data is changed. Only the two proc-time restore tests use the INFINITE path; both complete in ~20s when healthy. The second call site is inside the `@Disabled` savepoint generator and never runs in CI. ## Verifying this change This change is already covered by existing tests: - `WindowTableFunctionProcTimeRestoreTest` passes (4 run, 0 failures). - `GroupWindowAggregateProcTimeRestoreTest` (sibling, shares the base) passes (4 run, 0 failures). - `WindowAggregateEventTimeRestoreTest` (FINITE path regression check) passes (25 run, 0 failures). The hang itself is an intermittent CI condition (a processing-time window split) and is not reliably reproducible on a fast developer machine. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers: no - The runtime per-record code paths (performance sensitive): no - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no - The S3 file system connector: no ## Documentation - Does this pull request introduce a new feature? no --- ##### Was generative AI tooling used to co-author this PR? - [X] Yes (Claude Opus 4.8 via Claude Code) Generated-by: Claude Opus 4.8 (1M context) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
