[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated MAPREDUCE-5641:
-------------------------------------

    Description: Currently, the JHS has no information about jobs whose AMs 
have failed.  This is because the History is written by the AM to the 
intermediate folder just before finishing, so when it fails for any reason, 
this information isn't copied there.  However, it is not lost as its in the 
AM's staging directory.  To make the History available in the JHS, all we need 
to do is have another mechanism to move the History from the staging directory 
to the intermediate directory.  The AM also writes a "Summary" file before 
exiting normally, which is also unavailable when the AM fails.    (was: 
Currently, the JHS has no information about jobs whose AMs have failed.  This 
is because the History is written by the AM to the intermediate folder just 
before finishing, so when it fails for any reason, this information isn't 
copied there.  However, it is not lost as its in the AM's staging directory.  
To make the History available in the JHS, all we need to do is have another 
mechanism to move the History from the staging directory to the intermediate 
directory.  The AM also writes a "Summary" file before exiting normally, which 
is also unavailable when the AM fails.  

I propose we solve this issue by doing the following:
The Resource Manager is aware when the AM fails; when an AM fails, the RM can 
write a flag file to a new “fail” directory.  The JHS periodically scans the 
"fail" dir for these flag files.  When it sees one, it then looks for the 
History for that failed AM; if found, it copies/moves the History to the 
intermediate directory, where it will be processed by the JHS normally.  If not 
found, it does nothing.  Once done, the JHS can then delete the flag file.
For the Summary file, most of it is static, so we can simply have the AM write 
that file out at startup (with 0 or "N/A" for dynamic fields) and then 
overwrite it at shutdown to get the values for the dynamic fields as it does 
now.  If the AM fails, then the JHS will at least be able to pickup the first 
version of the Summary file.  

)

I propose we solve this issue by doing the following:
The Resource Manager is aware when the AM fails; when an AM fails, the RM can 
write a flag file to a new “fail” directory. The JHS periodically scans the 
"fail" dir for these flag files. When it sees one, it then looks for the 
History for that failed AM; if found, it copies/moves the History to the 
intermediate directory, where it will be processed by the JHS normally. If not 
found, it does nothing. Once done, the JHS can then delete the flag file.
For the Summary file, most of it is static, so we can simply have the AM write 
that file out at startup (with 0 or "N/A" for dynamic fields) and then 
overwrite it at shutdown to get the values for the dynamic fields as it does 
now. If the AM fails, then the JHS will at least be able to pickup the first 
version of the Summary file.

> History for failed Application Masters should be made available to the Job 
> History Server
> -----------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5641
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5641
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: applicationmaster, jobhistoryserver
>    Affects Versions: 2.2.0
>            Reporter: Robert Kanter
>            Assignee: Robert Kanter
>
> Currently, the JHS has no information about jobs whose AMs have failed.  This 
> is because the History is written by the AM to the intermediate folder just 
> before finishing, so when it fails for any reason, this information isn't 
> copied there.  However, it is not lost as its in the AM's staging directory.  
> To make the History available in the JHS, all we need to do is have another 
> mechanism to move the History from the staging directory to the intermediate 
> directory.  The AM also writes a "Summary" file before exiting normally, 
> which is also unavailable when the AM fails.  



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to