[
https://issues.apache.org/jira/browse/MAPREDUCE-6135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ming Ma resolved MAPREDUCE-6135.
--------------------------------
Resolution: Duplicate
Thanks, Jason. Resolve this as dup. Will continue the discussion over at
MAPREDUCE-5502. It looks like Robert in MAPREDUCE-4428 also mentioned the
approach of rerun AM for cleanup.
> Job staging directory remains if MRAppMaster is OOM
> ---------------------------------------------------
>
> Key: MAPREDUCE-6135
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6135
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Reporter: Ming Ma
>
> If MRAppMaster attempts run out of memory, it won't go through the normal job
> clean up process to move history files to history server location. When
> customers try to find out why the job failed, the data won't be available on
> history server webUI.
> The work around is to extract the container id and NM id from the jhist file
> in the job staging directory; then use "yarn logs" command to get the AM logs.
> It would be great the platform can take care of it by moving these hist files
> automatically to history server if AM attempts don't exit properly.
> We discuss ideas on how to address this and would like get suggestions from
> others. Not sure if timeline server design covers this scenario.
> 1. Define some protocol for YARN to tell AppMaster "you have exceeded AM max
> attempt, please clean up". For example, YARN can launch AppMaster one more
> time after AM max attempt and MRAppMaster use that as the indication this is
> clean-up-only attempt.
> 2. Have some program periodically check job statuses and move files from job
> staging directory to history server for those finished jobs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)