[
https://issues.apache.org/jira/browse/MAPREDUCE-5641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Kanter updated MAPREDUCE-5641:
-------------------------------------
Description: Currently, the JHS has no information about jobs whose AMs
have failed. This is because the History is written by the AM to the
intermediate folder just before finishing, so when it fails for any reason,
this information isn't copied there. However, it is not lost as its in the
AM's staging directory. To make the History available in the JHS, all we need
to do is have another mechanism to move the History from the staging directory
to the intermediate directory. The AM also writes a "Summary" file before
exiting normally, which is also unavailable when the AM fails. (was:
Currently, the JHS has no information about jobs whose AMs have failed. This
is because the History is written by the AM to the intermediate folder just
before finishing, so when it fails for any reason, this information isn't
copied there. However, it is not lost as its in the AM's staging directory.
To make the History available in the JHS, all we need to do is have another
mechanism to move the History from the staging directory to the intermediate
directory. The AM also writes a "Summary" file before exiting normally, which
is also unavailable when the AM fails.
I propose we solve this issue by doing the following:
The Resource Manager is aware when the AM fails; when an AM fails, the RM can
write a flag file to a new “fail” directory. The JHS periodically scans the
"fail" dir for these flag files. When it sees one, it then looks for the
History for that failed AM; if found, it copies/moves the History to the
intermediate directory, where it will be processed by the JHS normally. If not
found, it does nothing. Once done, the JHS can then delete the flag file.
For the Summary file, most of it is static, so we can simply have the AM write
that file out at startup (with 0 or "N/A" for dynamic fields) and then
overwrite it at shutdown to get the values for the dynamic fields as it does
now. If the AM fails, then the JHS will at least be able to pickup the first
version of the Summary file.
)
I propose we solve this issue by doing the following:
The Resource Manager is aware when the AM fails; when an AM fails, the RM can
write a flag file to a new “fail” directory. The JHS periodically scans the
"fail" dir for these flag files. When it sees one, it then looks for the
History for that failed AM; if found, it copies/moves the History to the
intermediate directory, where it will be processed by the JHS normally. If not
found, it does nothing. Once done, the JHS can then delete the flag file.
For the Summary file, most of it is static, so we can simply have the AM write
that file out at startup (with 0 or "N/A" for dynamic fields) and then
overwrite it at shutdown to get the values for the dynamic fields as it does
now. If the AM fails, then the JHS will at least be able to pickup the first
version of the Summary file.
> History for failed Application Masters should be made available to the Job
> History Server
> -----------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-5641
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5641
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: applicationmaster, jobhistoryserver
> Affects Versions: 2.2.0
> Reporter: Robert Kanter
> Assignee: Robert Kanter
>
> Currently, the JHS has no information about jobs whose AMs have failed. This
> is because the History is written by the AM to the intermediate folder just
> before finishing, so when it fails for any reason, this information isn't
> copied there. However, it is not lost as its in the AM's staging directory.
> To make the History available in the JHS, all we need to do is have another
> mechanism to move the History from the staging directory to the intermediate
> directory. The AM also writes a "Summary" file before exiting normally,
> which is also unavailable when the AM fails.
--
This message was sent by Atlassian JIRA
(v6.1#6144)