[ 
https://issues.apache.org/jira/browse/FLINK-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406756#comment-17406756
 ] 

Till Rohrmann commented on FLINK-24038:
---------------------------------------

I am not so sure about your verdict about option 2). At the moment it is 
correct that communication logic to interact with an external resource manager 
is encapsulated in the {{ResourceManagerDriver}}. However, this does not have 
to mean that the logic to register and deregister an application should also be 
the responsibility of the {{ResourceManager}}. I think this actually shows in 
the old code where we call into the {{ResourceManager}} to deregister the 
application independent of its leadership. Moreover, the decision whether to 
shut down the cluster or not is currently made by the {{Dispatcher}}. Hence, 
this component should be able to do this independent whether there is a {{RM}} 
running or not (also think about the hypothetical case where we split the 
{{Dispatcher}} and {{ResourceManager}} components into several processes).

Concerning your proposal of not doing the deregistration if there is no leading 
{{ResourceManager}}: How will this work if we use a K8s job? If I am not 
mistaken, then the return value of the process decides whether the job is 
restarted or not by K8s. So if we shut down normally but cannot deregister the 
application, then we will continue and stop with a zero exit code. So in this 
scenario, K8s will terminate the job but we won't clean up other K8s resources.

For Yarn, this proposal can work I believe, even though it is not super nice.

Are you taking care of this problem [~xtsong]?

> DispatcherResourceManagerComponent fails to deregister application if no 
> leading ResourceManager
> ------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-24038
>                 URL: https://issues.apache.org/jira/browse/FLINK-24038
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.14.0
>            Reporter: Till Rohrmann
>            Priority: Critical
>             Fix For: 1.14.0
>
>
> With FLINK-21667 we introduced a change that can cause the 
> {{DispatcherResourceManagerComponent}} to fail when trying to stop the 
> application. The problem is that the {{DispatcherResourceManagerComponent}} 
> needs a leading {{ResourceManager}} to successfully execute the 
> stop/deregister application call. If this is not the case, then it will fail 
> fatally. In the case of multiple standby JobManager processes it can happen 
> that the leading {{ResourceManager}} runs somewhere else.
> I do see two possible solutions:
> 1. Run the leader election process for the whole JobManager process
> 2. Move the registration/deregistration of the application out of the 
> {{ResourceManager}} so that it can be executed w/o a leader



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to