[jira] [Commented] (FLINK-14010) Dispatcher & JobManagers don't give up leadership when AM is shut down

TisonKun (Jira) Wed, 18 Sep 2019 06:07:58 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-14010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932423#comment-16932423
 ]


TisonKun commented on FLINK-14010:
----------------------------------

I have thought of this. The problem is that when the situation described here 
happens, we actually complete {{ResourceManager#getTerminationFuture}} 
normally, which cannot be sourced that it comes from 
{{YarnResourceManager#onShutdownRequest}}.

If we achieve the function by using {{ResourceManager#getTerminationFuture}} to 
trigger the shut down of the {{DispatcherResourceManagerComponent}}, the 
assumption is:

If ResourceManager is closed first(since termination future completes normally 
in both cases, we cannot distinguish by {{whenComplete}}), it infers an 
exceptionally status so that we should complete 
{{DispatcherResourceManagerComponent#getShutDownFuture}} exceptionally. 
Otherwise ResourceManager closes normally by other triggers, and the either 
{{DispatcherResourceManagerComponent#getShutDownFuture}} is already completed 
or {{ClusterEntrypoint#shutdownAsync}} is guarded to be executed once.

I think this assumption is counter-intuitive that ResourceManager terminates 
"normally" but we complete shutdownFuture exceptionally.

> Dispatcher & JobManagers don't give up leadership when AM is shut down
> ----------------------------------------------------------------------
>
>                 Key: FLINK-14010
>                 URL: https://issues.apache.org/jira/browse/FLINK-14010
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN, Runtime / Coordination
>    Affects Versions: 1.7.2, 1.8.1, 1.9.0, 1.10.0
>            Reporter: TisonKun
>            Priority: Critical
>
> In YARN deployment scenario, YARN RM possibly launches a new AM for the job 
> even if the previous AM does not terminated, for example, when AMRM heartbeat 
> timeout. This is a common case that RM will send a shutdown request to the 
> previous AM and expect the AM shutdown properly.
> However, currently in {{YARNResourceManager}}, we handle this request in 
> {{onShutdownRequest}} which simply close the {{YARNResourceManager}} *but not 
> Dispatcher and JobManagers*. Thus, Dispatcher and JobManager launched in new 
> AM cannot be granted leadership properly. Visually,
> on previous AM: Dispatcher leader, JM leaders
> on new AM: ResourceManager leader
> since on client side or in per-job mode, JobManager address and port are 
> configured as the new AM, the whole cluster goes into an unrecoverable 
> inconsistent status: client all queries the dispatcher on new AM who is now 
> the leader. Briefly, Dispatcher and JobManagers on previous AM do not give up 
> their leadership properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-14010) Dispatcher & JobManagers don't give up leadership when AM is shut down

Reply via email to