[jira] Created: (HADOOP-5241) Reduce tasks get stuck because of over-estimated task size (regression from 0.18)

Andy Pavlo (JIRA) Thu, 12 Feb 2009 14:03:24 -0800

Reduce tasks get stuck because of over-estimated task size (regression from 
0.18)
---------------------------------------------------------------------------------


                 Key: HADOOP-5241
                 URL: https://issues.apache.org/jira/browse/HADOOP-5241
             Project: Hadoop Core
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.19.0
         Environment: Red Hat Enterprise Linux Server release 5.2
JDK 1.6.0_11
Hadoop 0.19.0
            Reporter: Andy Pavlo
            Priority: Blocker


I have a simple MR benchmark job that computes PageRank on about 600 GB of HTML 
files using a 100 node cluster. For some reason, my reduce tasks get caught in 
a pending state. The JobTracker's log gets filled with the following messages:

2009-02-12 15:47:29,839 WARN org.apache.hadoop.mapred.JobInProgress: No room 
for reduce task. Node tracker_d-59.cs.wisc.edu:localhost/127.0.0.1:33227 has 
110125027328 bytes free; but we expect reduce input to take 399642198235
2009-02-12 15:47:29,852 WARN org.apache.hadoop.mapred.JobInProgress: No room 
for reduce task. Node tracker_d-67.cs.wisc.edu:localhost/127.0.0.1:48626 has 
107537776640 bytes free; but we expect reduce input to take 399642198235
2009-02-12 15:47:29,885 WARN org.apache.hadoop.mapred.JobInProgress: No room 
for reduce task. Node tracker_d-73.cs.wisc.edu:localhost/127.0.0.1:58849 has 
113631690752 bytes free; but we expect reduce input to take 399642198235
<SNIP>

The weird thing is that I get through about 70 reduce tasks completing before 
it hangs. If I reduce the amount of the input data on 100 nodes down to 200GB, 
then it seems to work. As I scale the amount of input to the number of nodes, I 
can get it work some of the times on 50 nodes and without any problems on 25 
nodes and less.

Note that it worked without any problems on Hadoop 0.18 late last year without 
changing any of the input data or the actual MR code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HADOOP-5241) Reduce tasks get stuck because of over-estimated task size (regression from 0.18)

Reply via email to