[PR] [FLINK-38536][tests] Deploy synchronously to avoid deployment-callback race [flink]

via GitHub Tue, 30 Jun 2026 06:17:53 -0700


MartijnVisser opened a new pull request, #28587:
URL: https://github.com/apache/flink/pull/28587


   ## What is the purpose of the change
   
   Fixes a test flakiness regression introduced by FLINK-38536 (#28350) and 
observed in nightly build 
[76520](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76520&view=results)
 (legs `test_cron_azure core` and `test_cron_hadoop313 core`):
   
   ```
   ExecutionGraphFinishTest.testJobFinishes:114
   expected: 2
    but was: 1
   ```
   
   **Root cause.** Both `ExecutionGraphFinishTest` and `FinalizeOnMasterTest` 
build the scheduler with a real future executor. `Execution.deploy()` sets 
`DEPLOYING` synchronously, but offloads `TaskDeploymentDescriptor` creation to 
that future executor and syncs the deploy-completion callback back to the main 
thread asynchronously. `Execution.tryGetTaskDeploymentDescriptorForSlot` fails 
the deployment (via `markFailed`) when it runs and finds the execution is no 
longer `DEPLOYING`. The tests drive executions to `RUNNING` 
(`switchAllVerticesToRunning`) before that callback runs, so it fails the 
deployment; with the default `NoRestartBackoffTimeStrategy` the job fails and a 
vertex never reaches `FINISHED`, leaving `getNumExecutionVertexFinished()` 
short by one.
   
   The #28350 rewrite (a dedicated single-thread main executor with blocking 
`runInMainThread`/`supplyInMainThread` helpers) addressed the original 
main-thread-assertion race but left this async deployment-callback race, and 
the tight blocking calls made it more likely to trip.
   
   ## Brief change log
   
     - Run scheduling and deployment synchronously on the test thread by 
building the scheduler with 
`ComponentMainThreadExecutorServiceAdapter.forMainThread()` and a 
`DirectScheduledExecutorService` future executor, so the deployment callbacks 
complete inline while executions are still `DEPLOYING`. This is the established 
idiom in the package (e.g. `DefaultExecutionGraphDeploymentTest`).
     - Supersede the `forSingleThreadExecutor` approach from #28350 and remove 
the `runInMainThread`/`supplyInMainThread` helpers it introduced.
   
   Note: removing `eg.waitUntilTerminal()` in 
`ExecutionGraphFinishTest.testJobFinishes` is safe and equivalent here, because 
deployment and the final state transition are now reached synchronously by the 
last `markFinished()`.
   
   ## Verifying this change
   
   This change is already covered by the existing tests 
`ExecutionGraphFinishTest` and `FinalizeOnMasterTest`, and was verified by 
reproducing the flake before the fix and confirming it is gone after:
   
     - Before the fix, `ExecutionGraphFinishTest#testJobFinishes` (run as 
`@RepeatedTest(2000)`) failed ~2% of repetitions; temporary instrumentation in 
`tryGetTaskDeploymentDescriptorForSlot` showed every failure was preceded by 
the callback running while the execution was already `RUNNING`.
     - With the fix, `ExecutionGraphFinishTest` (2000 repetitions) and 
`FinalizeOnMasterTest` (2000 repetitions per method) pass with zero failures 
and zero instrumentation hits.
     - `mvn -pl flink-runtime spotless:check` passes.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no (test-only change)
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [X] Yes (please specify the tool below)
   
   Generated-by: Claude Opus 4.8 (1M context)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [FLINK-38536][tests] Deploy synchronously to avoid deployment-callback race [flink]

Reply via email to