[
https://issues.apache.org/jira/browse/FLINK-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406350#comment-17406350
]
Xintong Song commented on FLINK-24038:
--------------------------------------
I think option 2) should not work. To deregister an application, it can involve
interactions with the underlying external resource manager. This is usually
specific to the underlying system, and is better performed by the
ResourceManagerDriver. Most importantly, deregistration of an application
usually means all the process will be terminated, thus a non-leader JobManager
process could kill a leader process if it is allowed to deregister, which is
undesired.
Option 1) might work. I would need to look into it a bit more to be sure about
that. Event this works, my gut feeling the efforts needed and the potential
impacts on stabilities may not be trivial.
Alternatively, we may consider simply not throwing the error there's not a
leading resource manager. To be specific, if there is a leading resource
manager, errors occurred during the deregistration should still be considered
fatal. But if there's not a leading resource manager, we simply don't do the
deregistration. For standalone clusters, there should be no difference anyway,
since the StandaloneResourceManager does not do anything for deregistration.
For active resource managers, I think it's a good contract that only the
leading resource manager interacts with the external resource manager (except
for pure reading operations). The side effect would be, if Flink tries to
deregister when there's no leader RM, the deregister cannot success and
K8s/Yarn will bring up another JobManager process anyway, which is the same as
how it is currently and IMHO not a bit problem.
> DispatcherResourceManagerComponent fails to deregister application if no
> leading ResourceManager
> ------------------------------------------------------------------------------------------------
>
> Key: FLINK-24038
> URL: https://issues.apache.org/jira/browse/FLINK-24038
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.14.0
> Reporter: Till Rohrmann
> Priority: Critical
> Fix For: 1.14.0
>
>
> With FLINK-21667 we introduced a change that can cause the
> {{DispatcherResourceManagerComponent}} to fail when trying to stop the
> application. The problem is that the {{DispatcherResourceManagerComponent}}
> needs a leading {{ResourceManager}} to successfully execute the
> stop/deregister application call. If this is not the case, then it will fail
> fatally. In the case of multiple standby JobManager processes it can happen
> that the leading {{ResourceManager}} runs somewhere else.
> I do see two possible solutions:
> 1. Run the leader election process for the whole JobManager process
> 2. Move the registration/deregistration of the application out of the
> {{ResourceManager}} so that it can be executed w/o a leader
--
This message was sent by Atlassian Jira
(v8.3.4#803005)