[ 
https://issues.apache.org/jira/browse/FLINK-39917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088699#comment-18088699
 ] 

Yuepeng Pan commented on FLINK-39917:
-------------------------------------

merged into master(2.4.0) via: 4f6db6ea8a58f1109b4aa8c8333512d3eb0e655f

> JobMasterTriggerSavepointITCase.testDoNotCancelJobIfSavepointFails: 
> "Disconnect job manager" log assertion races the async JM->RM disconnect
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39917
>                 URL: https://issues.apache.org/jira/browse/FLINK-39917
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Tests
>            Reporter: Martijn Visser
>            Assignee: Martijn Visser
>            Priority: Major
>              Labels: pull-request-available, test-stability
>             Fix For: 2.3.0, 2.4.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75865&view=results
>  (leg: test_cron_azure tests)
> {code}
>   06:17:51.991 [ERROR] 
> org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase.testDoNotCancelJobIfSavepointFails(ClusterClient)
>  -- Time elapsed: 0.394 s <<< FAILURE!
>   java.lang.AssertionError:
>   [not all expected events logged by 
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager, logged:
>   [... Message=Registering job manager ..., ... Message=Registered job 
> manager ...]]
>   Expecting empty but was: [Disconnect job manager .*]
>         at 
> org.apache.flink.util.JobIDLoggingUtil.assertKeyPresent(JobIDLoggingUtil.java:98)
>         at 
> org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase.verifyJobIdIsLogged(JobMasterTriggerSavepointITCase.java:280)
>   {code}
> Root cause: {{waitForDisconnect}} cancels the job and waits for the 
> client-visible {{CANCELED}} status, then {{verifyJobIdIsLogged}} asserts that 
> {{StandaloneResourceManager}} logged "Disconnect job manager ...". The 
> JobMaster disconnects from the ResourceManager asynchronously during 
> shutdown, *after* the job reports CANCELED. The run logs confirm the window: 
> job CANCELED at 06:17:51,115, JobMaster began stopping at 06:17:51,136, and 
> the assertion ran in between, capturing only the "Registering/Registered job 
> manager" events.
> Not the same failure as FLINK-37821 (closed), which addressed a different 
> signal in this test.
> Proposed fix: in {{waitForDisconnect}}, after the CANCELED wait, additionally 
> wait until the RM has actually logged the disconnect event before returning. 
> No assertion change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to