[
https://issues.apache.org/jira/browse/FLINK-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407048#comment-17407048
]
Xintong Song commented on FLINK-24038:
--------------------------------------
I think you're correct, about option 2) and about the K8s job.
Let me re-list the options I see.
1) Allow {{Dispatcher}} to deregister the application independently. This is
probably the least invasive solution. The only down side I see is that it
introduces the complexity of dealing with different external resource managers
to the dispatcher, which is probably not very complex with a proper unified
abstraction.
2) Run the leader election process for the whole
{{DispatcherResourceManagerComponent}}. This would guarantee we either have
both a leading {{Dispatcher}} and a leading {{ResourceManager}}, or neither or
them. This is IMHO too invasive to be done in the release stabilization phase.
Speaking of splitting a JobManager process into multiple processes, I'm not
entirely sure whether it is indeed needed. And even this is needed in future, I
wonder would it be good enough to have the {{Dispatcher}} and the
{{ResourceManager}} in one process, while each {{JobMaster}} in a separate
process. I guess it depends on what essential demands we see in separating the
process.
3) Make {{Dispatcher}} and {{ResourceManager}} talks to each other via RPC.
This should work in scenarios where the leading RM is somewhere else, but not
in scenarios where there's no leading RMs.
TBH, I'm not entirely sure which one of 1) & 2) is better. Maybe slightly
learning towards 2), which in general simplifies things rather than complicates
them, if we leave aside the topic of splitting the process. WDTY?
I'd be happy to look into this, but cannot promise to take care of it for
1.14.0. It really depends on to which direction we decide to go. For option 1),
I can have a try. For option 2), I would suggest to leave with the problem for
now and fix it for the next major release.
> DispatcherResourceManagerComponent fails to deregister application if no
> leading ResourceManager
> ------------------------------------------------------------------------------------------------
>
> Key: FLINK-24038
> URL: https://issues.apache.org/jira/browse/FLINK-24038
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.14.0
> Reporter: Till Rohrmann
> Priority: Critical
> Fix For: 1.14.0
>
>
> With FLINK-21667 we introduced a change that can cause the
> {{DispatcherResourceManagerComponent}} to fail when trying to stop the
> application. The problem is that the {{DispatcherResourceManagerComponent}}
> needs a leading {{ResourceManager}} to successfully execute the
> stop/deregister application call. If this is not the case, then it will fail
> fatally. In the case of multiple standby JobManager processes it can happen
> that the leading {{ResourceManager}} runs somewhere else.
> I do see two possible solutions:
> 1. Run the leader election process for the whole JobManager process
> 2. Move the registration/deregistration of the application out of the
> {{ResourceManager}} so that it can be executed w/o a leader
--
This message was sent by Atlassian Jira
(v8.3.4#803005)