[
https://issues.apache.org/jira/browse/HDFS-14689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tao Yang resolved HDFS-14689.
-----------------------------
Resolution: Invalid
Assignee: (was: Tao Yang)
This issue should be launched in YARN, sorry about that!
> AM container might leak
> -----------------------
>
> Key: HDFS-14689
> URL: https://issues.apache.org/jira/browse/HDFS-14689
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Tao Yang
> Priority: Major
>
> There is a risk that AM container might leak when NM exits unexpected
> meanwhile AM container is localizing if AM expiry interval (conf-key:
> yarn.am.liveness-monitor.expiry-interval-ms) is less than NM expiry interval
> (conf-key: yarn.nm.liveness-monitor.expiry-interval-ms).
> RMAppAttempt state changes as follows:
> {noformat}
> LAUNCHED/RUNNING – event:EXPIRED(FinalSavingTransition)
> --> FINAL_SAVING – event:ATTEMPT_UPDATE_SAVED(FinalStateSavedTransition /
> ExpiredTransition: send AMLauncherEventType.CLEANUP ) --> FAILED
> {noformat}
> AMLauncherEventType.CLEANUP will be handled by AMLauncher#cleanup which
> internally call ContainerManagementProtocol#stopContainer to stop AM
> container via communicating with NM, if NM can't be connected, it just skip
> it without any logs.
> I think in this case we can complete the AM container in scheduler when
> failed to stop it, so that it will have a chance to be stopped when NM
> reconnects with RM.
> Hope to hear your thoughts? Thank you!
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]