[jira] Created: (HADOOP-750) race condition on stalled map output fetches

Owen O'Malley (JIRA) Mon, 27 Nov 2006 16:30:42 -0800

race condition on stalled map output fetches
--------------------------------------------


                 Key: HADOOP-750
                 URL: http://issues.apache.org/jira/browse/HADOOP-750
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.8.0
            Reporter: Owen O'Malley
         Assigned To: Owen O'Malley
             Fix For: 0.9.0


I've seen reduces getting killed because of a race condition in the 
ReduceTaskRunner.  In the logs it looks like:

2006-11-27 08:40:44,795 WARN org.apache.hadoop.mapred.TaskRunner: Map output 
copy stalled on 
http://kry2296.inktomisearch.com:7030/mapOutput?map=task_0001_m_015626_0
...
2006-11-27 09:16:41,361 INFO org.apache.hadoop.mapred.TaskRunner: 
task_0001_r_000658_0 Need 52 map output(s)
2006-11-27 09:16:41,361 INFO org.apache.hadoop.mapred.TaskRunner: 
task_0001_r_000658_0 Got 39 known map output location(s); scheduling...
2006-11-27 09:16:41,361 INFO org.apache.hadoop.mapred.TaskRunner: 
task_0001_r_000658_0 Scheduled 0 of 39 known outputs (0 slow hosts and 39 dup 
hosts)
...
2006-11-27 09:16:47,071 INFO org.apache.hadoop.mapred.TaskTracker: 
task_0001_r_000658_0 0.3328575% reduce > copy (28679 of 28720 at 0.76 MB/s) >
...
2006-11-27 09:16:47,338 INFO org.apache.hadoop.mapred.TaskRunner: 
task_0001_r_000658_0 done copying task_0001_m_015462_0 output from node1
...
2006-11-27 09:36:51,398 INFO org.apache.hadoop.mapred.TaskTracker: 
task_0001_r_000658_0: Task failed to report status for 1204 seconds. Killing.

Basically, the handling of the stall has a race condition that leaves the 
fetcher in a bad state. At the end of the fetch, all of the tasks finish and 
their results never get handled. When the thread times out, all of the map 
output copiers are waiting for things to fetch and the prepare thread is 
waiting for results.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (HADOOP-750) race condition on stalled map output fetches

Reply via email to