[
https://issues.apache.org/jira/browse/FLINK-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407343#comment-17407343
]
Till Rohrmann commented on FLINK-24038:
---------------------------------------
It is a good question what the easiest solution is. I think the proper solution
is option 1) because it does not only solve a symptom. Option 2) is desirable
for other reasons as well (no spread leaders across different processes, less
request load on HA system because there is only a single leader election) but
it also solves the described problem here (even though it is more indirectly).
I do see that option 1) will complicate things a bit because we have to create
new {{YarnClient}} and {{NamespacedKubernetesClient}} instances that are now
nicely encapsulated in the {{ResourceManagerDriver}}. I do think that we can
manage some of this complexity by choosing proper abstractions. But still, it
will make the system slightly more complicated.
Maybe we can start by looking into option 2) first in order to better
understand the scope of this change.
> DispatcherResourceManagerComponent fails to deregister application if no
> leading ResourceManager
> ------------------------------------------------------------------------------------------------
>
> Key: FLINK-24038
> URL: https://issues.apache.org/jira/browse/FLINK-24038
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.14.0
> Reporter: Till Rohrmann
> Priority: Critical
> Fix For: 1.14.0
>
>
> With FLINK-21667 we introduced a change that can cause the
> {{DispatcherResourceManagerComponent}} to fail when trying to stop the
> application. The problem is that the {{DispatcherResourceManagerComponent}}
> needs a leading {{ResourceManager}} to successfully execute the
> stop/deregister application call. If this is not the case, then it will fail
> fatally. In the case of multiple standby JobManager processes it can happen
> that the leading {{ResourceManager}} runs somewhere else.
> I do see two possible solutions:
> 1. Run the leader election process for the whole JobManager process
> 2. Move the registration/deregistration of the application out of the
> {{ResourceManager}} so that it can be executed w/o a leader
--
This message was sent by Atlassian Jira
(v8.3.4#803005)