[jira] [Commented] (FLINK-14010) Dispatcher & JobManagers don't give up leadership when AM is shut down

TisonKun (Jira) Tue, 17 Sep 2019 07:05:19 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-14010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931490#comment-16931490
 ]


TisonKun commented on FLINK-14010:
----------------------------------

Well, it's reasonable we try to gracefully shut down. I start to work on it but 
I'm not sure about what the future should look like.

There are two options in my mind, both of which introduce a {{shutdownFuture}} 
in {{ResourceManager}}.

1. {{ResourceManager#shutdownFuture}} is completed on 
{{YarnResourceManager#onShutdownRequest}} gets called. And we register callback 
in {{DispatcherResourceManagerComponent#registerShutDownFuture}}, when 
{{ResourceManager#shutdownFuture}} complete, we complete 
{{DispatcherResourceManagerComponent#shutDownFuture}} exceptionally. Concern 
here is that {{ResourceManager#shutdownFuture}} is never completed if 
{{YarnResourceManager#onShutdownRequest}} never gets called. I'm not sure if it 
is well.

2. {{ResourceManager#shutdownFuture}} is completed normally on 
{{ResourceManager#stopResourceManagerServices}} gets called, while completed 
exceptionally on {{YarnResourceManager#onShutdownRequest}} gets called. Also we 
register callback in 
{{DispatcherResourceManagerComponent#registerShutDownFuture}}, when 
{{ResourceManager#shutdownFuture}} complete exceptionally, we complete 
{{DispatcherResourceManagerComponent#shutDownFuture}} exceptionally; when when 
{{ResourceManager#shutdownFuture}} complete normally we do nothing. It might be 
a bit more complex than 1 and we should ensure that codepaths 
{{ResourceManager}} exit are all covered.

WDYT [~till.rohrmann]?

> Dispatcher & JobManagers don't give up leadership when AM is shut down
> ----------------------------------------------------------------------
>
>                 Key: FLINK-14010
>                 URL: https://issues.apache.org/jira/browse/FLINK-14010
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN, Runtime / Coordination
>    Affects Versions: 1.7.2, 1.8.1, 1.9.0, 1.10.0
>            Reporter: TisonKun
>            Priority: Critical
>
> In YARN deployment scenario, YARN RM possibly launches a new AM for the job 
> even if the previous AM does not terminated, for example, when AMRM heartbeat 
> timeout. This is a common case that RM will send a shutdown request to the 
> previous AM and expect the AM shutdown properly.
> However, currently in {{YARNResourceManager}}, we handle this request in 
> {{onShutdownRequest}} which simply close the {{YARNResourceManager}} *but not 
> Dispatcher and JobManagers*. Thus, Dispatcher and JobManager launched in new 
> AM cannot be granted leadership properly. Visually,
> on previous AM: Dispatcher leader, JM leaders
> on new AM: ResourceManager leader
> since on client side or in per-job mode, JobManager address and port are 
> configured as the new AM, the whole cluster goes into an unrecoverable 
> inconsistent status: client all queries the dispatcher on new AM who is now 
> the leader. Briefly, Dispatcher and JobManagers on previous AM do not give up 
> their leadership properly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (FLINK-14010) Dispatcher & JobManagers don't give up leadership when AM is shut down

Reply via email to