MartijnVisser opened a new pull request, #28404:
URL: https://github.com/apache/flink/pull/28404

   ## What is the purpose of the change
   
   `RestoreTestBase#testRestore` (the `AfterRestoreSource.INFINITE` path) and 
the savepoint-generation path wait on `CompletableFuture.allOf(...).get()` with 
no timeout. The futures only complete when a sink observer sees an exact match 
of the expected results. `PROCTIME()` window boundaries come from the wall 
clock, so records can occasionally split across windows differently than when 
the expected data was captured; the match then never happens and the surefire 
fork hangs until the CI watchdog kills it after 900s of no output (exit code 
143), hiding which test stalled. A thread dump from [build 
75906](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75906)
 shows the test parked in that `get()` at `RestoreTestBase.testRestore` with 
the MiniCluster job alive and fully idle.
   
   This is the failure reported in FLINK-35562 
(`WindowTableFunctionProcTimeRestoreTest`); 
`GroupWindowAggregateProcTimeRestoreTest` (FLINK-34404) shares the base class 
and failure mode.
   
   Note on scope: this change does not remove the underlying wall-clock 
dependence of the proc-time programs. It converts a rare leg-killing 900s hang 
into a rare, localized, diagnosable test failure that names the program and 
prints actual vs expected sink results within 5 minutes. Making the proc-time 
restore programs deterministic is left as follow-up under the JIRA tickets.
   
   ## Brief change log
   
     - Add `RESULT_AWAIT_TIMEOUT_MILLIS` (5 minutes) as the upper bound for 
awaiting the expected sink results.
     - Extract `awaitExpectedResults(program, futures)`, which on timeout 
throws an `AssertionError` naming the program id and reporting each sink's 
actual results.
     - Route both previously unbounded `.get()` call sites (the INFINITE 
restore path and the savepoint-generation path) through it.
   
   No assertion logic or test data is changed. Only the two proc-time restore 
tests use the INFINITE path; both complete in ~20s when healthy. The second 
call site is inside the `@Disabled` savepoint generator and never runs in CI.
   
   ## Verifying this change
   
   This change is already covered by existing tests:
   
     - `WindowTableFunctionProcTimeRestoreTest` passes (4 run, 0 failures).
     - `GroupWindowAggregateProcTimeRestoreTest` (sibling, shares the base) 
passes (4 run, 0 failures).
     - `WindowAggregateEventTimeRestoreTest` (FINITE path regression check) 
passes (25 run, 0 failures).
   
   The hang itself is an intermittent CI condition (a processing-time window 
split) and is not reliably reproducible on a fast developer machine.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [X] Yes (Claude Opus 4.8 via Claude Code)
   
   Generated-by: Claude Opus 4.8 (1M context)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to