Hi guys, I am running a Fetch task on an EC2 cluster. The Map part is reasonably fast but the Reduce is taking forever. I see no explicit Reducer specified for the Job so I assume that the output of the reduce is simply copied to HDFS. Since all the DataNodes are on EC2 I imagine that the cost of duplicating the data is not too high.
I had a look at the EC2 instance doing the reduction : the CPU is at 40 something percent and there is no RAM available (most of it being used by the TaskTracker and DataNode). Any idea of why it is so slow? Are there any parameters which could influence the performance? Thanks for your help Julien -- DigitalPebble Ltd http://www.digitalpebble.com
