[ 
https://issues.apache.org/jira/browse/FLINK-25893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486189#comment-17486189
 ] 

Xintong Song commented on FLINK-25893:
--------------------------------------

Hi [~trohrmann],

For problem 2, I think it is indeed a problem. {{ResourceManagerServiceImpl}} 
should not only check whether there is a leading {{ResourceManager}}, but also 
make sure the leading {{ResourceManager}} is fully started, before calling the 
{{ResourceManagerGateway#deregisterApplication}}.

For problem 1, I think it's kind of expected. When there's no leading 
{{ResourceManager}}, the {{ResourceManagerServiceImpl}} can respond to a 
`deregisterApplication` call by either ignoring the call or report an 
exception. When there's a leading {{ResourceManager}} in another process, it is 
desired that the non-leading process ignores the `deregisterApplication` call. 
On the other hand, if there is no leading {{ResourceManager}} in any process of 
the cluster, it would be desired that the failure of `deregisterApplication` is 
reported. Since {{ResourceManagerServiceImpl}} cannot know whether there's a 
leading {{ResourceManager}} in another process, I think a false alarm in the 
non-leading process when there is another leading process is probably better 
than failing the `deregisterApplication` silently when there's no leading 
process, as in the latter case Kubernetes / Yarn may unexpectedly bring the 
master process up again.

> ResourceManagerServiceImpl's lifecycle can lead to exceptions
> -------------------------------------------------------------
>
>                 Key: FLINK-25893
>                 URL: https://issues.apache.org/jira/browse/FLINK-25893
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0, 1.14.3
>            Reporter: Till Rohrmann
>            Priority: Critical
>              Labels: pull-request-available
>
> The {{ResourceManagerServiceImpl}} lifecycle can lead to exceptions when 
> calling {{ResourceManagerServiceImpl.deregisterApplication}}. The problem 
> arises when the {{DispatcherResourceManagerComponent}} is shutdown before the 
> {{ResourceManagerServiceImpl}} gains leadership or while it is starting the 
> {{ResourceManager}}.
> One problem is that {{deregisterApplication}} returns an exceptionally 
> completed future if there is no leading {{ResourceManager}}.
> Another problem is that if there is a leading {{ResourceManager}}, then it 
> can still be the case that it has not been started yet. If this is the case, 
> then 
> [ResourceManagerGateway.deregisterApplication|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManagerServiceImpl.java#L143]
>  will be discarded. The reason for this behaviour is that we create a 
> {{ResourceManager}} in one {{Runnable}} and only start it in another. Due to 
> this there can be the {{deregisterApplication}} call that gets the {{lock}} 
> in between.
> I'd suggest to correct the lifecycle and contract of the 
> {{ResourceManagerServiceImpl.deregisterApplication}}.
> Please note that due to this problem, the error reporting of this method has 
> been suppressed. See FLINK-25885 for more details.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to