MartijnVisser opened a new pull request, #28406:
URL: https://github.com/apache/flink/pull/28406

   ## What is the purpose of the change
   
   `JobMasterTriggerSavepointITCase.testDoNotCancelJobIfSavepointFails` failed 
in [build 
75865](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75865)
 (leg `test_cron_azure tests`): the assertion that `StandaloneResourceManager` 
logged `Disconnect job manager .*` found only the "Registering/Registered job 
manager" events.
   
   Root cause: the test cancels the job and `waitForDisconnect` waits for the 
client-visible status to reach `CANCELED`, after which `verifyJobIdIsLogged` 
asserts the disconnect was logged. The JobMaster disconnects from the 
ResourceManager asynchronously while shutting down, which happens *after* the 
job reaches CANCELED. The run logs show the window: job CANCELED at 
`06:17:51,115`, JobMaster began stopping at `06:17:51,136`, and the 
verification ran in between.
   
   FLINK-37821 fixed an earlier failure of this test on a different signal; 
this is a distinct race against the RM disconnect logging, tracked in 
FLINK-39917.
   
   ## Brief change log
   
     - In `waitForDisconnect`, after the CANCELED wait, additionally wait until 
the ResourceManager has actually logged the "Disconnect job manager" event 
before returning. What is asserted does not change.
     - Factor the disconnect log prefix into a shared constant used by both the 
new wait predicate and `verifyJobIdIsLogged`'s regex, so the two cannot drift.
   
   ## Verifying this change
   
   This change is already covered by existing tests: 
`JobMasterTriggerSavepointITCase` (4 run, 0 failures).
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [X] Yes (Claude Opus 4.8 via Claude Code)
   
   Generated-by: Claude Opus 4.8 (1M context)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to