[jira] [Commented] (FLINK-24038) DispatcherResourceManagerComponent fails to deregister application if no leading ResourceManager

Till Rohrmann (Jira) Tue, 31 Aug 2021 06:32:05 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407343#comment-17407343
 ]


Till Rohrmann commented on FLINK-24038:
---------------------------------------

It is a good question what the easiest solution is. I think the proper solution 
is option 1) because it does not only solve a symptom. Option 2) is desirable 
for other reasons as well (no spread leaders across different processes, less 
request load on HA system because there is only a single leader election) but 
it also solves the described problem here (even though it is more indirectly).

I do see that option 1) will complicate things a bit because we have to create 
new {{YarnClient}} and {{NamespacedKubernetesClient}} instances that are now 
nicely encapsulated in the {{ResourceManagerDriver}}. I do think that we can 
manage some of this complexity by choosing proper abstractions. But still, it 
will make the system slightly more complicated.

Maybe we can start by looking into option 2) first in order to better 
understand the scope of this change.

> DispatcherResourceManagerComponent fails to deregister application if no 
> leading ResourceManager
> ------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-24038
>                 URL: https://issues.apache.org/jira/browse/FLINK-24038
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.14.0
>            Reporter: Till Rohrmann
>            Priority: Critical
>             Fix For: 1.14.0
>
>
> With FLINK-21667 we introduced a change that can cause the 
> {{DispatcherResourceManagerComponent}} to fail when trying to stop the 
> application. The problem is that the {{DispatcherResourceManagerComponent}} 
> needs a leading {{ResourceManager}} to successfully execute the 
> stop/deregister application call. If this is not the case, then it will fail 
> fatally. In the case of multiple standby JobManager processes it can happen 
> that the leading {{ResourceManager}} runs somewhere else.
> I do see two possible solutions:
> 1. Run the leader election process for the whole JobManager process
> 2. Move the registration/deregistration of the application out of the 
> {{ResourceManager}} so that it can be executed w/o a leader



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-24038) DispatcherResourceManagerComponent fails to deregister application if no leading ResourceManager

Reply via email to