MartijnVisser opened a new pull request, #28405: URL: https://github.com/apache/flink/pull/28405
## What is the purpose of the change `SystemProcessingTimeServiceTest.testScheduleAtFixedRate` timed out after 10s in [build 75865](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75865) (leg `test_cron_azure core`), reporting only a bare `TimeoutException`. Root cause (two compounding test bugs, both involving the timing assertions added by FLINK-38932): 1. The callback compared each scheduled timestamp against an `initialTimestamp` sampled by a *separate* `getCurrentProcessingTime()` call in the test, while the service captures its own scheduling base inside `scheduleAtFixedRate`. On a loaded CI machine the gap between those two clock reads can exceed the 10ms period, failing the first `isCloseTo` assertion. 2. The resulting `AssertionError` (an `Error`, not an `Exception`) was wrapped and rethrown from the callback. `ScheduledThreadPoolExecutor` silently suppresses all further executions of a periodic task whose `run()` throws, so the `CountDownLatch` never reached zero and the test hung to its timeout, masking the real assertion failure. FLINK-34921 is the open test-stability ticket for this test class; the originally reported quiesce-path hang (2024) is a different method, so this change complements rather than closes that report. See the JIRA comment for details. ## Brief change log - Anchor the fixed-rate spacing check on the first observed timestamp instead of a racy second clock read (fixed-rate timestamps advance by exactly one period per execution regardless of execution jitter). - On any callback throwable, record it into `errorRef` (shared with the service's async exception handler, first-error-wins) and still count down, so a genuine drift fails fast with the real assertion instead of hanging. - In `testScheduleAtFixedDelay`, remove the first-execution check: the only non-racy reference point would be the timestamp itself, making the check vacuous. The meaningful fixed-delay verification against the previous execution's observed completion time is kept. The pre-existing local `@Timeout(10000ms)` on both tests predates this change and is intentionally left untouched. ## Verifying this change This change is already covered by existing tests: `SystemProcessingTimeServiceTest` (8 run, 0 failures). ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers: no - The runtime per-record code paths (performance sensitive): no - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no - The S3 file system connector: no ## Documentation - Does this pull request introduce a new feature? no --- ##### Was generative AI tooling used to co-author this PR? - [X] Yes (Claude Opus 4.8 via Claude Code) Generated-by: Claude Opus 4.8 (1M context) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
