Matthias Pohl created FLINK-26773:
-------------------------------------
Summary: ResourceManager leader election can a reconnect while
shutting down the JobMaster
Key: FLINK-26773
URL: https://issues.apache.org/jira/browse/FLINK-26773
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 1.14.4, 1.15.0, 1.16.0
Reporter: Matthias Pohl
There's a race condition happening with the {{ResourceManager}} leader election
in the {{JobMaster}} while shutting it down. The {{JobMaster}} calls
{{dissolveResourceManagerConnection}} while shutting down itself trying to
disconnect itself from the {{ResourceManager}} (see
[JobMaster:1180|https://github.com/apache/flink/blob/fdb80108a3c0e4fb12dbc3f89ecb2327d97deebf/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1180]).
This closes the RM connection to the {{JobMaster}} from the
{{ResourceManager}}'s side (see
[ResourceManager:979|https://github.com/apache/flink/blob/9055279d0286f4374694325250a45dc1c60301a7/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L979].
The {{JobMaster}} tries to reconnect to the {{ResourceManager}} leader if
there's still an address stored for that leader (which is the case during
shutdown; see
[JobMaster:790|https://github.com/apache/flink/blob/fdb80108a3c0e4fb12dbc3f89ecb2327d97deebf/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L790]).
The {{JobMaster}} shouldn't try to reconnect after it has already freed it's
requirements as part of the shutdown.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)