[
https://issues.apache.org/jira/browse/FLINK-14010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann updated FLINK-14010:
----------------------------------
Fix Version/s: 1.8.3
1.10.0
> Dispatcher & JobManagers don't give up leadership when AM is shut down
> ----------------------------------------------------------------------
>
> Key: FLINK-14010
> URL: https://issues.apache.org/jira/browse/FLINK-14010
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN, Runtime / Coordination
> Affects Versions: 1.7.2, 1.8.2, 1.9.0, 1.10.0
> Reporter: tison
> Assignee: tison
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.10.0, 1.9.1, 1.8.3
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> In YARN deployment scenario, YARN RM possibly launches a new AM for the job
> even if the previous AM does not terminated, for example, when AMRM heartbeat
> timeout. This is a common case that RM will send a shutdown request to the
> previous AM and expect the AM shutdown properly.
> However, currently in {{YARNResourceManager}}, we handle this request in
> {{onShutdownRequest}} which simply close the {{YARNResourceManager}} *but not
> Dispatcher and JobManagers*. Thus, Dispatcher and JobManager launched in new
> AM cannot be granted leadership properly. Visually,
> on previous AM: Dispatcher leader, JM leaders
> on new AM: ResourceManager leader
> since on client side or in per-job mode, JobManager address and port are
> configured as the new AM, the whole cluster goes into an unrecoverable
> inconsistent status: client all queries the dispatcher on new AM who is now
> the leader. Briefly, Dispatcher and JobManagers on previous AM do not give up
> their leadership properly.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)