[
https://issues.apache.org/jira/browse/MAPREDUCE-4729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488741#comment-13488741
]
Jason Lowe commented on MAPREDUCE-4729:
---------------------------------------
I tried testing the patch with a sleep job using
-Dyarn.app.mapreduce.am.job.recovery.enable=false and manually killing the
ApplicationMaster with a kill -9, but it didn't work. The log showed this
exception:
{noformat}
2012-11-01 14:37:01,543 WARN [main]
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Could not parse the old history
file. Will not have old AMinfos
java.io.IOException: Incompatible event log version: null
at
org.apache.hadoop.mapreduce.jobhistory.EventReader.<init>(EventReader.java:70)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.readJustAMInfos(MRAppMaster.java:915)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.start(MRAppMaster.java:846)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1143)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1378)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1139)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1098)
{noformat}
It looks like the AM is buffering the history file output, and we didn't flush
out the AMInfos from previous runs. When I used a normal kill instead of kill
-9, it worked. We will want to flush/sync the job history file after writing
the AMInfos to help guard against unclean teardowns losing prior AM attempts in
the history. This can be fixed in a separate JIRA if we don't want to fix it
here.
Couple of other comments on the patch:
* Application attempts start from 1 instead of 0, so the first attempt tries to
recover AMInfos when it shouldn't and leads to a large FileNotFoundException
stacktrace being logged
* Nit: In RecoveryService.parse there's an extra space logged before a comma.
{{LOG.info("Got an error parsing job-history file "}} should be {{LOG.info("Got
an error parsing job-history file"}}
* Nit: The body of the while loop in readJustAMInfos could be a bit cleaner
with fewer conditionals. For example:
{code}
while ((event = jobHistoryEventReader.getNextEvent()) != null) {
if (event.getEventType() == EventType.AM_STARTED) {
amStartedEventsBegan = true;
AMStartedEvent amStartedEvent = (AMStartedEvent) event;
amInfos.add(MRBuilderUtils.newAMInfo(
amStartedEvent.getAppAttemptId(), amStartedEvent.getStartTime(),
amStartedEvent.getContainerId(),
StringInterner.weakIntern(amStartedEvent.getNodeManagerHost()),
amStartedEvent.getNodeManagerPort(),
amStartedEvent.getNodeManagerHttpPort()));
} else if (amStartedEventsBegan) {
// This means AMStartedEvents began and this event is a
// non-AMStarted event.
// No need to continue reading all the other events.
break;
}
}
{code}
> job history UI not showing all job attempts
> -------------------------------------------
>
> Key: MAPREDUCE-4729
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4729
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobhistoryserver
> Affects Versions: 0.23.3
> Reporter: Thomas Graves
> Assignee: Vinod Kumar Vavilapalli
> Attachments: MAPREDUCE-4729-20121031.txt
>
>
> We are seeing a case where a job runs but the AM is running out of memory in
> the first 3 attempts. The job eventually finishes on the 4th attempt. When
> you go to the job history UI for that job, it only shows the last attempt.
> This is bad since we want to see why the first 3 attempts failed.
> The RM web ui shows all 4 attempts.
> Also I tested this locally by running "kill" on the app master and in that
> case the history server UI does show all attempts.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira