[jira] [Comment Edited] (FLINK-24038) DispatcherResourceManagerComponent fails to deregister application if no leading ResourceManager

Till Rohrmann (Jira) Fri, 28 Jan 2022 09:08:12 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483853#comment-17483853
 ]


Till Rohrmann edited comment on FLINK-24038 at 1/28/22, 5:07 PM:
-----------------------------------------------------------------

You are correct [~wangyang0918], that we no longer have the leader check when 
writing to the {{<clusterId>-<jobId>-config-map}} because there is no leader 
election happening for this map. 

I think it gives now effectively the same guarantees that we also have with the 
ZooKeeper HA implementation. If there is an old leader the leader can still 
write things into the config map. If I am not mistaken, then the danger is that 
checkpoints from the old leader can complete. If this is a problem then one 
should not use the multiple component leader election ha services. 
Alternatively, one could think about only using a single config map. However, 
this would increase the concurrent writes to it. Given that we have the same 
semantics with the Zk HA services, then I am not sure whether this is a real 
problem.


was (Author: till.rohrmann):
You are correct [~wangyang0918], that we no longer have the leader check when 
writing to the {{<clusterId>-<jobId>-config-map}} because there is no leader 
election happening for this map. 

I think it gives now effectively the same guarantees that we also have with the 
ZooKeeper HA implementation. If there is an old leader the leader can still 
write things into the config map. If I am not mistaken, then the danger is that 
checkpoints from the old leader can complete. If this is a problem then one 
should not use the multiple component leader election ha services. 
Alternatively, one could think about only using a single config map. However, 
this would increase the concurrent writes to it.

> DispatcherResourceManagerComponent fails to deregister application if no 
> leading ResourceManager
> ------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-24038
>                 URL: https://issues.apache.org/jira/browse/FLINK-24038
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.14.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.15.0
>
>
> With FLINK-21667 we introduced a change that can cause the 
> {{DispatcherResourceManagerComponent}} to fail when trying to stop the 
> application. The problem is that the {{DispatcherResourceManagerComponent}} 
> needs a leading {{ResourceManager}} to successfully execute the 
> stop/deregister application call. If this is not the case, then it will fail 
> fatally. In the case of multiple standby JobManager processes it can happen 
> that the leading {{ResourceManager}} runs somewhere else.
> I do see two possible solutions:
> 1. Run the leader election process for the whole JobManager process
> 2. Move the registration/deregistration of the application out of the 
> {{ResourceManager}} so that it can be executed w/o a leader



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (FLINK-24038) DispatcherResourceManagerComponent fails to deregister application if no leading ResourceManager

Reply via email to