[
https://issues.apache.org/jira/browse/FLINK-39917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088699#comment-18088699
]
Yuepeng Pan commented on FLINK-39917:
-------------------------------------
merged into master(2.4.0) via: 4f6db6ea8a58f1109b4aa8c8333512d3eb0e655f
> JobMasterTriggerSavepointITCase.testDoNotCancelJobIfSavepointFails:
> "Disconnect job manager" log assertion races the async JM->RM disconnect
> --------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-39917
> URL: https://issues.apache.org/jira/browse/FLINK-39917
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Tests
> Reporter: Martijn Visser
> Assignee: Martijn Visser
> Priority: Major
> Labels: pull-request-available, test-stability
> Fix For: 2.3.0, 2.4.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75865&view=results
> (leg: test_cron_azure tests)
> {code}
> 06:17:51.991 [ERROR]
> org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase.testDoNotCancelJobIfSavepointFails(ClusterClient)
> -- Time elapsed: 0.394 s <<< FAILURE!
> java.lang.AssertionError:
> [not all expected events logged by
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager, logged:
> [... Message=Registering job manager ..., ... Message=Registered job
> manager ...]]
> Expecting empty but was: [Disconnect job manager .*]
> at
> org.apache.flink.util.JobIDLoggingUtil.assertKeyPresent(JobIDLoggingUtil.java:98)
> at
> org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase.verifyJobIdIsLogged(JobMasterTriggerSavepointITCase.java:280)
> {code}
> Root cause: {{waitForDisconnect}} cancels the job and waits for the
> client-visible {{CANCELED}} status, then {{verifyJobIdIsLogged}} asserts that
> {{StandaloneResourceManager}} logged "Disconnect job manager ...". The
> JobMaster disconnects from the ResourceManager asynchronously during
> shutdown, *after* the job reports CANCELED. The run logs confirm the window:
> job CANCELED at 06:17:51,115, JobMaster began stopping at 06:17:51,136, and
> the assertion ran in between, capturing only the "Registering/Registered job
> manager" events.
> Not the same failure as FLINK-37821 (closed), which addressed a different
> signal in this test.
> Proposed fix: in {{waitForDisconnect}}, after the CANCELED wait, additionally
> wait until the RM has actually logged the disconnect event before returning.
> No assertion change.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)