MartijnVisser opened a new pull request, #28350:
URL: https://github.com/apache/flink/pull/28350

   ## What is the purpose of the change
   
   This pull request fixes an intermittent CI failure in `FinalizeOnMasterTest`
   (reported in 
[FLINK-38536](https://issues.apache.org/jira/browse/FLINK-38536),
   e.g. [build 
75697](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75697))
   and the same latent data race in its sibling `ExecutionGraphFinishTest`.
   
   The flakiness is a test-harness defect, not a production bug. Both tests 
wired the
   scheduler with `ComponentMainThreadExecutorServiceAdapter.forMainThread()` 
as the
   JobManager main-thread executor while passing a *separate* single-threaded 
I/O
   executor. `forMainThread()` is backed by a `DirectScheduledExecutorService`, 
whose
   `execute()` runs the submitted command **inline on the calling thread** 
rather than
   confining work to one dedicated main thread.
   
   `Execution#deploy()` builds the `TaskDeploymentDescriptor` on the I/O 
executor via
   `CompletableFuture.supplyAsync(..., ioExecutor)` and then composes the 
continuation
   back to the main thread with `thenComposeAsync(..., mainThreadExecutor)`. 
Because
   `forMainThread()` executes inline, those continuations — TDD creation, task
   submission, and the deployment-completion handling that can call 
`markFailed` — ran
   on the I/O thread, concurrently with the test thread that was still inside
   `startScheduling()` (and, in `ExecutionGraphFinishTest`, subsequently 
mutating state
   via `markFinished()`). This unsynchronized concurrent mutation of the 
execution graph
   produced the two observed signatures:
   
   - `IllegalStateException: BUG: trying to schedule a region which is not in 
CREATED state`
     (a region's vertices were mutated mid-scheduling), and
   - `expected: RUNNING but was: FAILING` (a background deployment callback 
failed an
     execution and triggered failover).
   
   The async I/O-executor deploy path was introduced recently by
   [FLINK-38114](https://issues.apache.org/jira/browse/FLINK-38114) 
(asynchronous
   offloading of `TaskRestore`), which is why these tests started flaking now. 
The
   production scheduling code is unchanged; the fix is confined to the test 
harness.
   
   ## Brief change log
   
   - `FinalizeOnMasterTest`: replace `forMainThread()` with a dedicated 
single-threaded
     JobManager main-thread executor (`forSingleThreadExecutor` over the 
existing
     `TestingUtils.jmMainThreadExecutorExtension()`), keeping the separate I/O 
executor.
     All execution-graph interactions are routed through `runInMainThread` /
     `supplyInMainThread` helpers so the asynchronous deployment callbacks are 
serialized
     with the test logic instead of racing on the I/O thread. The temporary 
debug logging
     added under FLINK-38536 by PR #27168 is removed, as the root cause is now 
fixed.
   - `ExecutionGraphFinishTest`: apply the identical remedy to 
`testJobFinishes()`, which
     shares the same wiring and code path and is therefore exposed to the same 
race.
   
   No assertion or functional test logic was changed in either test; this is 
purely a
   threading-model fix in the test harness. The change mirrors the pattern 
already used
   by `ExecutionGraphSuspendTest` and `SchedulerTestingUtils`.
   
   ## Verifying this change
   
   This change is already covered by the existing tests, which it stabilizes:
   
   - `mvn test -pl flink-runtime 
-Dtest='FinalizeOnMasterTest,ExecutionGraphFinishTest'`
     passes (3 tests, 0 failures), and `spotless:check` passes.
   - The original flake requires the concurrent I/O-thread timing of CI and is 
not
     reproducible locally in isolation, which is consistent with the diagnosed 
race; the
     fix removes the cross-thread access entirely rather than widening a timing 
window.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components),
       Checkpointing, Kubernetes/Yarn, ZooKeeper: no (test-only change)
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   
   ## AI-assisted contributions
   
   - [X] Yes — generative AI tooling was used (Claude Code, Opus 4.8).
   
   Generated-by: Claude Opus 4.8 (1M context)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to