MartijnVisser opened a new pull request, #28587: URL: https://github.com/apache/flink/pull/28587
## What is the purpose of the change Fixes a test flakiness regression introduced by FLINK-38536 (#28350) and observed in nightly build [76520](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76520&view=results) (legs `test_cron_azure core` and `test_cron_hadoop313 core`): ``` ExecutionGraphFinishTest.testJobFinishes:114 expected: 2 but was: 1 ``` **Root cause.** Both `ExecutionGraphFinishTest` and `FinalizeOnMasterTest` build the scheduler with a real future executor. `Execution.deploy()` sets `DEPLOYING` synchronously, but offloads `TaskDeploymentDescriptor` creation to that future executor and syncs the deploy-completion callback back to the main thread asynchronously. `Execution.tryGetTaskDeploymentDescriptorForSlot` fails the deployment (via `markFailed`) when it runs and finds the execution is no longer `DEPLOYING`. The tests drive executions to `RUNNING` (`switchAllVerticesToRunning`) before that callback runs, so it fails the deployment; with the default `NoRestartBackoffTimeStrategy` the job fails and a vertex never reaches `FINISHED`, leaving `getNumExecutionVertexFinished()` short by one. The #28350 rewrite (a dedicated single-thread main executor with blocking `runInMainThread`/`supplyInMainThread` helpers) addressed the original main-thread-assertion race but left this async deployment-callback race, and the tight blocking calls made it more likely to trip. ## Brief change log - Run scheduling and deployment synchronously on the test thread by building the scheduler with `ComponentMainThreadExecutorServiceAdapter.forMainThread()` and a `DirectScheduledExecutorService` future executor, so the deployment callbacks complete inline while executions are still `DEPLOYING`. This is the established idiom in the package (e.g. `DefaultExecutionGraphDeploymentTest`). - Supersede the `forSingleThreadExecutor` approach from #28350 and remove the `runInMainThread`/`supplyInMainThread` helpers it introduced. Note: removing `eg.waitUntilTerminal()` in `ExecutionGraphFinishTest.testJobFinishes` is safe and equivalent here, because deployment and the final state transition are now reached synchronously by the last `markFinished()`. ## Verifying this change This change is already covered by the existing tests `ExecutionGraphFinishTest` and `FinalizeOnMasterTest`, and was verified by reproducing the flake before the fix and confirming it is gone after: - Before the fix, `ExecutionGraphFinishTest#testJobFinishes` (run as `@RepeatedTest(2000)`) failed ~2% of repetitions; temporary instrumentation in `tryGetTaskDeploymentDescriptorForSlot` showed every failure was preceded by the callback running while the execution was already `RUNNING`. - With the fix, `ExecutionGraphFinishTest` (2000 repetitions) and `FinalizeOnMasterTest` (2000 repetitions per method) pass with zero failures and zero instrumentation hits. - `mvn -pl flink-runtime spotless:check` passes. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers: no - The runtime per-record code paths (performance sensitive): no - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no (test-only change) - The S3 file system connector: no ## Documentation - Does this pull request introduce a new feature? no --- ##### Was generative AI tooling used to co-author this PR? - [X] Yes (please specify the tool below) Generated-by: Claude Opus 4.8 (1M context) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
