[
https://issues.apache.org/jira/browse/FLINK-14010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
TisonKun updated FLINK-14010:
-----------------------------
Affects Version/s: (was: 1.8.1)
1.8.2
> Dispatcher & JobManagers don't give up leadership when AM is shut down
> ----------------------------------------------------------------------
>
> Key: FLINK-14010
> URL: https://issues.apache.org/jira/browse/FLINK-14010
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN, Runtime / Coordination
> Affects Versions: 1.7.2, 1.8.2, 1.9.0, 1.10.0
> Reporter: TisonKun
> Assignee: TisonKun
> Priority: Critical
>
> In YARN deployment scenario, YARN RM possibly launches a new AM for the job
> even if the previous AM does not terminated, for example, when AMRM heartbeat
> timeout. This is a common case that RM will send a shutdown request to the
> previous AM and expect the AM shutdown properly.
> However, currently in {{YARNResourceManager}}, we handle this request in
> {{onShutdownRequest}} which simply close the {{YARNResourceManager}} *but not
> Dispatcher and JobManagers*. Thus, Dispatcher and JobManager launched in new
> AM cannot be granted leadership properly. Visually,
> on previous AM: Dispatcher leader, JM leaders
> on new AM: ResourceManager leader
> since on client side or in per-job mode, JobManager address and port are
> configured as the new AM, the whole cluster goes into an unrecoverable
> inconsistent status: client all queries the dispatcher on new AM who is now
> the leader. Briefly, Dispatcher and JobManagers on previous AM do not give up
> their leadership properly.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)