MartijnVisser opened a new pull request, #28405:
URL: https://github.com/apache/flink/pull/28405

   ## What is the purpose of the change
   
   `SystemProcessingTimeServiceTest.testScheduleAtFixedRate` timed out after 
10s in [build 
75865](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75865)
 (leg `test_cron_azure core`), reporting only a bare `TimeoutException`.
   
   Root cause (two compounding test bugs, both involving the timing assertions 
added by FLINK-38932):
   
   1. The callback compared each scheduled timestamp against an 
`initialTimestamp` sampled by a *separate* `getCurrentProcessingTime()` call in 
the test, while the service captures its own scheduling base inside 
`scheduleAtFixedRate`. On a loaded CI machine the gap between those two clock 
reads can exceed the 10ms period, failing the first `isCloseTo` assertion.
   2. The resulting `AssertionError` (an `Error`, not an `Exception`) was 
wrapped and rethrown from the callback. `ScheduledThreadPoolExecutor` silently 
suppresses all further executions of a periodic task whose `run()` throws, so 
the `CountDownLatch` never reached zero and the test hung to its timeout, 
masking the real assertion failure.
   
   FLINK-34921 is the open test-stability ticket for this test class; the 
originally reported quiesce-path hang (2024) is a different method, so this 
change complements rather than closes that report. See the JIRA comment for 
details.
   
   ## Brief change log
   
     - Anchor the fixed-rate spacing check on the first observed timestamp 
instead of a racy second clock read (fixed-rate timestamps advance by exactly 
one period per execution regardless of execution jitter).
     - On any callback throwable, record it into `errorRef` (shared with the 
service's async exception handler, first-error-wins) and still count down, so a 
genuine drift fails fast with the real assertion instead of hanging.
     - In `testScheduleAtFixedDelay`, remove the first-execution check: the 
only non-racy reference point would be the timestamp itself, making the check 
vacuous. The meaningful fixed-delay verification against the previous 
execution's observed completion time is kept.
   
   The pre-existing local `@Timeout(10000ms)` on both tests predates this 
change and is intentionally left untouched.
   
   ## Verifying this change
   
   This change is already covered by existing tests: 
`SystemProcessingTimeServiceTest` (8 run, 0 failures).
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [X] Yes (Claude Opus 4.8 via Claude Code)
   
   Generated-by: Claude Opus 4.8 (1M context)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to