[jira] [Commented] (FLINK-14010) Dispatcher & JobManagers don't give up leadership when AM is shut down

Till Rohrmann (Jira) Tue, 17 Sep 2019 05:37:11 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-14010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931386#comment-16931386
 ]


Till Rohrmann commented on FLINK-14010:
---------------------------------------

{{#onFatalError}} could also be an option but I would prefer to distinguish 
here. I would consider {{#onShutdownRequest}} as request and not an error case. 
Hence, I would suggest to try to gracefully shut down. If this does not work, 
then we could fail fatally.

> Dispatcher & JobManagers don't give up leadership when AM is shut down
> ----------------------------------------------------------------------
>
>                 Key: FLINK-14010
>                 URL: https://issues.apache.org/jira/browse/FLINK-14010
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN, Runtime / Coordination
>    Affects Versions: 1.7.2, 1.8.1, 1.9.0, 1.10.0
>            Reporter: TisonKun
>            Priority: Critical
>
> In YARN deployment scenario, YARN RM possibly launches a new AM for the job 
> even if the previous AM does not terminated, for example, when AMRM heartbeat 
> timeout. This is a common case that RM will send a shutdown request to the 
> previous AM and expect the AM shutdown properly.
> However, currently in {{YARNResourceManager}}, we handle this request in 
> {{onShutdownRequest}} which simply close the {{YARNResourceManager}} *but not 
> Dispatcher and JobManagers*. Thus, Dispatcher and JobManager launched in new 
> AM cannot be granted leadership properly. Visually,
> on previous AM: Dispatcher leader, JM leaders
> on new AM: ResourceManager leader
> since on client side or in per-job mode, JobManager address and port are 
> configured as the new AM, the whole cluster goes into an unrecoverable 
> inconsistent status: client all queries the dispatcher on new AM who is now 
> the leader. Briefly, Dispatcher and JobManagers on previous AM do not give up 
> their leadership properly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (FLINK-14010) Dispatcher & JobManagers don't give up leadership when AM is shut down

Reply via email to