[
https://issues.apache.org/jira/browse/MAPREDUCE-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13201414#comment-13201414
]
Robert Joseph Evans commented on MAPREDUCE-3802:
------------------------------------------------
I have been looking for differences in the jhist files between the initial
kill, and the second one, and I cannot find much as far as the content is
concerned. So I think it might have something to do with the order of the
events.
In the second jhist file there are two AMStarted events. I assume that is to
allow for the history server to show how many have been started.
The mapFinishTime of all the new MAP_ATTEMPT_FINISH_EVENTS is 0. The finishTime
for these events are all showing the same finish time which I think is when the
attempt was recovered, not the original event finish time. The state of the
event changed from "map" (which seems like a bug) to "SUCCEEDED" which looks
more correct to me. And finally the clockSplits counters are all 0 in the new
one as well.
Similarly the TASK_FINISH_EVENTS have the finish time of the recovery, not the
actual finish time.
I could not find anything else that is significantly different.
> If an MR AM dies twice it looks like the process freezes
> ---------------------------------------------------------
>
> Key: MAPREDUCE-3802
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3802
> Project: Hadoop Map/Reduce
> Issue Type: Sub-task
> Components: applicationmaster, mrv2
> Affects Versions: 0.23.1, 0.24.0
> Reporter: Robert Joseph Evans
> Priority: Critical
> Attachments: syslog
>
>
> It looks like recovering from an RM AM dieing works very well on a single
> failure. But if it fails multiple times we appear to get into a live lock
> situation.
> {noformat}
> yarn jar
> hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*-SNAPSHOT.jar
> wordcount -Dyarn.app.mapreduce.am.log.level=DEBUG -Dmapreduce.job.reduces=30
> input output
> 12/02/03 21:06:57 WARN conf.Configuration: fs.default.name is deprecated.
> Instead, use fs.defaultFS
> 12/02/03 21:06:57 WARN conf.Configuration: mapred.used.genericoptionsparser
> is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
> 12/02/03 21:06:57 INFO input.FileInputFormat: Total input paths to process :
> 17
> 12/02/03 21:06:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library
> 12/02/03 21:06:57 WARN snappy.LoadSnappy: Snappy native library not loaded
> 12/02/03 21:06:57 INFO mapreduce.JobSubmitter: number of splits:17
> 12/02/03 21:06:57 INFO mapred.ResourceMgrDelegate: Submitted application
> application_1328302034486_0003 to ResourceManager at HOST/IP:8040
> 12/02/03 21:06:57 INFO mapreduce.Job: The url to track the job:
> http://HOST:8088/proxy/application_1328302034486_0003/
> 12/02/03 21:06:57 INFO mapreduce.Job: Running job: job_1328302034486_0003
> 12/02/03 21:07:03 INFO mapreduce.Job: Job job_1328302034486_0003 running in
> uber mode : false
> 12/02/03 21:07:03 INFO mapreduce.Job: map 0% reduce 0%
> 12/02/03 21:07:09 INFO mapreduce.Job: map 5% reduce 0%
> 12/02/03 21:07:10 INFO mapreduce.Job: map 17% reduce 0%
> #KILLED AM with kill -9 here
> 12/02/03 21:07:16 INFO mapreduce.Job: map 29% reduce 0%
> 12/02/03 21:07:17 INFO mapreduce.Job: map 35% reduce 0%
> 12/02/03 21:07:30 INFO mapreduce.Job: map 52% reduce 0%
> 12/02/03 21:07:35 INFO mapreduce.Job: map 58% reduce 0%
> 12/02/03 21:07:37 INFO mapreduce.Job: map 70% reduce 0%
> 12/02/03 21:07:41 INFO mapreduce.Job: map 76% reduce 0%
> 12/02/03 21:07:43 INFO mapreduce.Job: map 82% reduce 0%
> 12/02/03 21:07:44 INFO mapreduce.Job: map 88% reduce 0%
> 12/02/03 21:07:47 INFO mapreduce.Job: map 94% reduce 0%
> 12/02/03 21:07:49 INFO mapreduce.Job: map 100% reduce 0%
> 12/02/03 21:07:53 INFO mapreduce.Job: map 100% reduce 3%
> 12/02/03 21:08:00 INFO mapreduce.Job: map 100% reduce 6%
> 12/02/03 21:08:06 INFO mapreduce.Job: map 100% reduce 10%
> 12/02/03 21:08:12 INFO mapreduce.Job: map 100% reduce 13%
> 12/02/03 21:08:18 INFO mapreduce.Job: map 100% reduce 16%
> #killed AM with kill -9 here
> 12/02/03 21:08:20 INFO ipc.Client: Retrying connect to server: HOST/IP:44223.
> Already tried 0 time(s).
> 12/02/03 21:08:21 INFO ipc.Client: Retrying connect to server: HOST/IP:44223.
> Already tried 1 time(s).
> 12/02/03 21:08:22 INFO ipc.Client: Retrying connect to server: HOST/IP:44223.
> Already tried 2 time(s).
> 12/02/03 21:08:26 INFO mapreduce.Job: map 64% reduce 16%
> #It never makes any more progress...
> {noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira