[ 
https://issues.apache.org/jira/browse/HDFS-14689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16897707#comment-16897707
 ] 

Tao Yang commented on HDFS-14689:
---------------------------------

Thanks [~jojochuang], get it.

> AM container might leak
> -----------------------
>
>                 Key: HDFS-14689
>                 URL: https://issues.apache.org/jira/browse/HDFS-14689
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Tao Yang
>            Priority: Major
>
> There is a risk that AM container might leak when NM exits unexpected 
> meanwhile AM container is localizing if AM expiry interval (conf-key: 
> yarn.am.liveness-monitor.expiry-interval-ms) is less than NM expiry interval 
> (conf-key: yarn.nm.liveness-monitor.expiry-interval-ms).
>  RMAppAttempt state changes as follows:
> {noformat}
> LAUNCHED/RUNNING – event:EXPIRED(FinalSavingTransition) 
>  --> FINAL_SAVING – event:ATTEMPT_UPDATE_SAVED(FinalStateSavedTransition / 
> ExpiredTransition: send AMLauncherEventType.CLEANUP )  --> FAILED
> {noformat}
> AMLauncherEventType.CLEANUP will be handled by AMLauncher#cleanup which 
> internally call ContainerManagementProtocol#stopContainer to stop AM 
> container via communicating with NM, if NM can't be connected, it just skip 
> it without any logs.
> I think in this case we can complete the AM container in scheduler when 
> failed to stop it, so that it will have a chance to be stopped when NM 
> reconnects with RM. 
>  Hope to hear your thoughts? Thank you!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to