[
https://issues.apache.org/jira/browse/FLINK-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483853#comment-17483853
]
Till Rohrmann edited comment on FLINK-24038 at 1/28/22, 5:07 PM:
-----------------------------------------------------------------
You are correct [~wangyang0918], that we no longer have the leader check when
writing to the {{<clusterId>-<jobId>-config-map}} because there is no leader
election happening for this map.
I think it gives now effectively the same guarantees that we also have with the
ZooKeeper HA implementation. If there is an old leader the leader can still
write things into the config map. If I am not mistaken, then the danger is that
checkpoints from the old leader can complete. If this is a problem then one
should not use the multiple component leader election ha services.
Alternatively, one could think about only using a single config map. However,
this would increase the concurrent writes to it. Given that we have the same
semantics with the Zk HA services, then I am not sure whether this is a real
problem.
was (Author: till.rohrmann):
You are correct [~wangyang0918], that we no longer have the leader check when
writing to the {{<clusterId>-<jobId>-config-map}} because there is no leader
election happening for this map.
I think it gives now effectively the same guarantees that we also have with the
ZooKeeper HA implementation. If there is an old leader the leader can still
write things into the config map. If I am not mistaken, then the danger is that
checkpoints from the old leader can complete. If this is a problem then one
should not use the multiple component leader election ha services.
Alternatively, one could think about only using a single config map. However,
this would increase the concurrent writes to it.
> DispatcherResourceManagerComponent fails to deregister application if no
> leading ResourceManager
> ------------------------------------------------------------------------------------------------
>
> Key: FLINK-24038
> URL: https://issues.apache.org/jira/browse/FLINK-24038
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.14.0
> Reporter: Till Rohrmann
> Assignee: Till Rohrmann
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.15.0
>
>
> With FLINK-21667 we introduced a change that can cause the
> {{DispatcherResourceManagerComponent}} to fail when trying to stop the
> application. The problem is that the {{DispatcherResourceManagerComponent}}
> needs a leading {{ResourceManager}} to successfully execute the
> stop/deregister application call. If this is not the case, then it will fail
> fatally. In the case of multiple standby JobManager processes it can happen
> that the leading {{ResourceManager}} runs somewhere else.
> I do see two possible solutions:
> 1. Run the leader election process for the whole JobManager process
> 2. Move the registration/deregistration of the application out of the
> {{ResourceManager}} so that it can be executed w/o a leader
--
This message was sent by Atlassian Jira
(v8.20.1#820001)