[ 
https://issues.apache.org/jira/browse/FLINK-26773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539730#comment-17539730
 ] 

Jonathan Lazarus edited comment on FLINK-26773 at 5/19/22 6:33 PM:
-------------------------------------------------------------------

[~mapohl] thanks for assigning. As for [~freeke]'s solution in FLINK-27354: 
{quote}"set the  resourceManagerAddress as null when JobMaster shuts down, then 
JobMaster::isConnectingToResourceManager will return false and JobMaster will 
not try to reconnect to resourceManager"
{quote}
That should work theoretically, but it probably won't because:
 # This is already done in [JobMaster::closeResourceManagerConnection, 
|https://github.com/apache/flink/blob/HEAD/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1145-L1149]before
 
[JobMaster::isConnectingToResourceManager|https://github.com/apache/flink/blob/HEAD/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L784]
 
 # Even if we set the resourceManagerAddress to null at the start of shutdown, 
the ResourceManagerLeaderRetriever [asynchronously updates the 
resourceManagerAddress|https://github.com/apache/flink/blob/HEAD/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1223],
 so it is possible for the resourceManagerAddress to have a nonnull value. This 
service is only 
[stopped|https://github.com/apache/flink/blob/HEAD/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L957]
 after trying to [disconnect the 
ResourceManager|https://github.com/apache/flink/blob/HEAD/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L980].

Therefore, I suggested using a boolean flag. Otherwise, e could use [~freeke]'s 
solution if we stop the ResourceManagerLeaderRetriever before setting 
resourceManagerAddress as null. What do you think [~mapohl]?


was (Author: JIRAUSER289288):
[~mapohl] thanks for assigning. As for [~freeke]'s solution in FLINK-27354: 
{quote}"set the  resourceManagerAddress as null when JobMaster shuts down, then 
JobMaster::isConnectingToResourceManager will return false and JobMaster will 
not try to reconnect to resourceManager"
{quote}
That should work theoretically, but it probably won't because:
 # This is already done in 
[JobMaster::closeResourceManagerConnection|https://github.com/apache/flink/blob/HEAD/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1145-L1149]
 
[,|https://github1s.com/apache/flink/blob/HEAD/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1145-L1146],]
 before 
[JobMaster::isConnectingToResourceManager|https://github.com/apache/flink/blob/HEAD/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L784]
 
 # Even if we set the resourceManagerAddress to null at the start of shutdown, 
the ResourceManagerLeaderRetriever [asynchronously updates the 
resourceManagerAddress|https://github.com/apache/flink/blob/HEAD/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1223],
 so it is possible for the resourceManagerAddress to have a nonnull value. This 
service is only 
[stopped|https://github.com/apache/flink/blob/HEAD/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L957]
 after trying to [disconnect the 
ResourceManager|https://github.com/apache/flink/blob/HEAD/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L980].

Therefore, I suggested using a boolean flag. Otherwise, e could use [~freeke]'s 
solution if we stop the ResourceManagerLeaderRetriever before setting 
resourceManagerAddress as null. What do you think [~mapohl]?

> ResourceManager leader election can a reconnect while shutting down the 
> JobMaster
> ---------------------------------------------------------------------------------
>
>                 Key: FLINK-26773
>                 URL: https://issues.apache.org/jira/browse/FLINK-26773
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0, 1.14.4, 1.16.0
>            Reporter: Matthias Pohl
>            Assignee: Jonathan Lazarus
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: FLINK-26773.failure-during-shutdown.log
>
>
> There's a race condition happening with the {{ResourceManager}} leader 
> election in the {{JobMaster}} while shutting it down. The {{JobMaster}} calls 
> {{dissolveResourceManagerConnection}} while shutting down itself trying to 
> disconnect itself from the {{ResourceManager}} (see 
> [JobMaster:1180|https://github.com/apache/flink/blob/fdb80108a3c0e4fb12dbc3f89ecb2327d97deebf/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1180]).
> This closes the RM connection to the {{JobMaster}} from the 
> {{ResourceManager}}'s side (see 
> [ResourceManager:979|https://github.com/apache/flink/blob/9055279d0286f4374694325250a45dc1c60301a7/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L979].
>  The {{JobMaster}} tries to reconnect to the {{ResourceManager}} leader if 
> there's still an address stored for that leader (which is the case during 
> shutdown; see 
> [JobMaster:790|https://github.com/apache/flink/blob/fdb80108a3c0e4fb12dbc3f89ecb2327d97deebf/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L790]).
> The {{JobMaster}} shouldn't try to reconnect after it has already freed it's 
> requirements as part of the shutdown.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to