Tao Yang created HDFS-14689:
-------------------------------
Summary: AM container might leak
Key: HDFS-14689
URL: https://issues.apache.org/jira/browse/HDFS-14689
Project: Hadoop HDFS
Issue Type: Bug
Reporter: Tao Yang
Assignee: Tao Yang
There is a risk that AM container might leak when NM exits unexpected meanwhile
AM container is localizing if AM expiry interval (conf-key:
yarn.am.liveness-monitor.expiry-interval-ms) is less than NM expiry interval
(conf-key: yarn.nm.liveness-monitor.expiry-interval-ms).
RMAppAttempt state changes as follows:
{noformat}
LAUNCHED/RUNNING – event:EXPIRED(FinalSavingTransition)
--> FINAL_SAVING – event:ATTEMPT_UPDATE_SAVED(FinalStateSavedTransition /
ExpiredTransition: send AMLauncherEventType.CLEANUP ) --> FAILED
{noformat}
AMLauncherEventType.CLEANUP will be handled by AMLauncher#cleanup which
internally call ContainerManagementProtocol#stopContainer to stop AM container
via communicating with NM, if NM can't be connected, it just skip it without
any logs.
I think in this case we can complete the AM container in scheduler when failed
to stop it, so that it will have a chance to be stopped when NM reconnects with
RM.
Hope to hear your thoughts? Thank you!
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]