[ http://issues.apache.org/jira/browse/HADOOP-750?page=all ]
Doug Cutting updated HADOOP-750:
--------------------------------
Status: Resolved (was: Patch Available)
Resolution: Fixed
I just fixed this. Thanks, Owen!
> race condition on stalled map output fetches
> --------------------------------------------
>
> Key: HADOOP-750
> URL: http://issues.apache.org/jira/browse/HADOOP-750
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.8.0
> Reporter: Owen O'Malley
> Assigned To: Owen O'Malley
> Fix For: 0.9.0
>
> Attachments: fetch-no-lease.patch
>
>
> I've seen reduces getting killed because of a race condition in the
> ReduceTaskRunner. In the logs it looks like:
> 2006-11-27 08:40:44,795 WARN org.apache.hadoop.mapred.TaskRunner: Map output
> copy stalled on
> http://kry2296.inktomisearch.com:7030/mapOutput?map=task_0001_m_015626_0
> ...
> 2006-11-27 09:16:41,361 INFO org.apache.hadoop.mapred.TaskRunner:
> task_0001_r_000658_0 Need 52 map output(s)
> 2006-11-27 09:16:41,361 INFO org.apache.hadoop.mapred.TaskRunner:
> task_0001_r_000658_0 Got 39 known map output location(s); scheduling...
> 2006-11-27 09:16:41,361 INFO org.apache.hadoop.mapred.TaskRunner:
> task_0001_r_000658_0 Scheduled 0 of 39 known outputs (0 slow hosts and 39 dup
> hosts)
> ...
> 2006-11-27 09:16:47,071 INFO org.apache.hadoop.mapred.TaskTracker:
> task_0001_r_000658_0 0.3328575% reduce > copy (28679 of 28720 at 0.76 MB/s) >
> ...
> 2006-11-27 09:16:47,338 INFO org.apache.hadoop.mapred.TaskRunner:
> task_0001_r_000658_0 done copying task_0001_m_015462_0 output from node1
> ...
> 2006-11-27 09:36:51,398 INFO org.apache.hadoop.mapred.TaskTracker:
> task_0001_r_000658_0: Task failed to report status for 1204 seconds. Killing.
> Basically, the handling of the stall has a race condition that leaves the
> fetcher in a bad state. At the end of the fetch, all of the tasks finish and
> their results never get handled. When the thread times out, all of the map
> output copiers are waiting for things to fetch and the prepare thread is
> waiting for results.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira