[ 
https://issues.apache.org/jira/browse/FLINK-14010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932676#comment-16932676
 ] 

Till Rohrmann commented on FLINK-14010:
---------------------------------------

Can't we say that we always complete 
{{DispatcherResourceManagerComponent#shutDownFuture}} exceptionally if 
{{ResourceManager.getTerminationFuture()}} terminates while 
{{DispatcherResourceManagerComponent#isRunning}} is {{true}}? The contract 
could be that the {{ResourceManager}} always needs to run and if it stops, then 
this is an indicator that something went wrong and we should stop the 
{{ClusterEntrypoint}}. We could do this by completing 
{{DispatcherResourceManagerComponent#shutDownFuture}} exceptionally if 
{{DispatcherResourceManagerComponent#isRunning}} is {{true}}. However, one 
could similarly also simply call {{onFatalError}} from within the 
{{ResourceManager}} as you've initially proposed.

> Dispatcher & JobManagers don't give up leadership when AM is shut down
> ----------------------------------------------------------------------
>
>                 Key: FLINK-14010
>                 URL: https://issues.apache.org/jira/browse/FLINK-14010
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN, Runtime / Coordination
>    Affects Versions: 1.7.2, 1.8.1, 1.9.0, 1.10.0
>            Reporter: TisonKun
>            Priority: Critical
>
> In YARN deployment scenario, YARN RM possibly launches a new AM for the job 
> even if the previous AM does not terminated, for example, when AMRM heartbeat 
> timeout. This is a common case that RM will send a shutdown request to the 
> previous AM and expect the AM shutdown properly.
> However, currently in {{YARNResourceManager}}, we handle this request in 
> {{onShutdownRequest}} which simply close the {{YARNResourceManager}} *but not 
> Dispatcher and JobManagers*. Thus, Dispatcher and JobManager launched in new 
> AM cannot be granted leadership properly. Visually,
> on previous AM: Dispatcher leader, JM leaders
> on new AM: ResourceManager leader
> since on client side or in per-job mode, JobManager address and port are 
> configured as the new AM, the whole cluster goes into an unrecoverable 
> inconsistent status: client all queries the dispatcher on new AM who is now 
> the leader. Briefly, Dispatcher and JobManagers on previous AM do not give up 
> their leadership properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to