Hi George, This is interesting. Does the configuration of the 8 node cluster match the configuration of the 4 or 16 node cluster? There are some configs that can directly affect the shuffle phase of reduce. They would most likely be set in $HADOOPDIR/conf/hadoop-default.xml.
The parameters to check out, in part, would be io.sort.factor, mapred.inmem.merge.threshold, mapred.job.shuffle.merge.percent, mapred.job.shuffle.input.buffer.percent, mapred.job.reduce.input.buffer.percent. Check out the "shuffle/reduce parameters section" on http://hadoop.apache.org/common/docs/current/mapred_tutorial.html if that does seem to be the problem. Chad -----Original Message----- From: George Porter [mailto:[email protected]] Sent: Monday, August 17, 2009 11:53 AM To: [email protected] Subject: Shuffle phase not starting until 100% of maps are done? Hi everyone, I've come across a problem running Map/Reduce on an EC2 cluster, and was wondering if anyone here had any thoughts to what the issue was. I'm running a simple 'sort' M/R job on 40GB from the examples JAR on Hadoop 19.0 (using the Hadoop 19.0 AMI for Amazon EC2 on Extra-large images). When I run the sort job on a 4 or 16 node cluster, things work fine, and I notice that the shuffle phase begins when approx 45-50% of the maps are completed. However, when I run the sort job on an 8-node cluster, the shuffle doesn't begin until 100% of the maps are done. This causes the 8 node cluster to run much slower than I would have thought. There are over 2000 map tasks, and 16 map slots across those 8 nodes, and so a lot of map tasks have finished before the shuffle starts. Any thoughts on what would be delaying the start of the shuffle phase? Thanks, George
