MartijnVisser opened a new pull request, #28350: URL: https://github.com/apache/flink/pull/28350
## What is the purpose of the change This pull request fixes an intermittent CI failure in `FinalizeOnMasterTest` (reported in [FLINK-38536](https://issues.apache.org/jira/browse/FLINK-38536), e.g. [build 75697](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75697)) and the same latent data race in its sibling `ExecutionGraphFinishTest`. The flakiness is a test-harness defect, not a production bug. Both tests wired the scheduler with `ComponentMainThreadExecutorServiceAdapter.forMainThread()` as the JobManager main-thread executor while passing a *separate* single-threaded I/O executor. `forMainThread()` is backed by a `DirectScheduledExecutorService`, whose `execute()` runs the submitted command **inline on the calling thread** rather than confining work to one dedicated main thread. `Execution#deploy()` builds the `TaskDeploymentDescriptor` on the I/O executor via `CompletableFuture.supplyAsync(..., ioExecutor)` and then composes the continuation back to the main thread with `thenComposeAsync(..., mainThreadExecutor)`. Because `forMainThread()` executes inline, those continuations — TDD creation, task submission, and the deployment-completion handling that can call `markFailed` — ran on the I/O thread, concurrently with the test thread that was still inside `startScheduling()` (and, in `ExecutionGraphFinishTest`, subsequently mutating state via `markFinished()`). This unsynchronized concurrent mutation of the execution graph produced the two observed signatures: - `IllegalStateException: BUG: trying to schedule a region which is not in CREATED state` (a region's vertices were mutated mid-scheduling), and - `expected: RUNNING but was: FAILING` (a background deployment callback failed an execution and triggered failover). The async I/O-executor deploy path was introduced recently by [FLINK-38114](https://issues.apache.org/jira/browse/FLINK-38114) (asynchronous offloading of `TaskRestore`), which is why these tests started flaking now. The production scheduling code is unchanged; the fix is confined to the test harness. ## Brief change log - `FinalizeOnMasterTest`: replace `forMainThread()` with a dedicated single-threaded JobManager main-thread executor (`forSingleThreadExecutor` over the existing `TestingUtils.jmMainThreadExecutorExtension()`), keeping the separate I/O executor. All execution-graph interactions are routed through `runInMainThread` / `supplyInMainThread` helpers so the asynchronous deployment callbacks are serialized with the test logic instead of racing on the I/O thread. The temporary debug logging added under FLINK-38536 by PR #27168 is removed, as the root cause is now fixed. - `ExecutionGraphFinishTest`: apply the identical remedy to `testJobFinishes()`, which shares the same wiring and code path and is therefore exposed to the same race. No assertion or functional test logic was changed in either test; this is purely a threading-model fix in the test harness. The change mirrors the pattern already used by `ExecutionGraphSuspendTest` and `SchedulerTestingUtils`. ## Verifying this change This change is already covered by the existing tests, which it stabilizes: - `mvn test -pl flink-runtime -Dtest='FinalizeOnMasterTest,ExecutionGraphFinishTest'` passes (3 tests, 0 failures), and `spotless:check` passes. - The original flake requires the concurrent I/O-thread timing of CI and is not reproducible locally in isolation, which is consistent with the diagnosed race; the fix removes the cross-thread access entirely rather than widening a timing window. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers: no - The runtime per-record code paths (performance sensitive): no - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no (test-only change) - The S3 file system connector: no ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? not applicable ## AI-assisted contributions - [X] Yes — generative AI tooling was used (Claude Code, Opus 4.8). Generated-by: Claude Opus 4.8 (1M context) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
