On Tue, Oct 28, 2008 at 12:12 PM, Julien Nioche <[EMAIL PROTECTED]> wrote: > Hi guys, > > I am running a Fetch task on an EC2 cluster. The Map part is reasonably fast > but the Reduce is taking forever. I see no explicit Reducer specified for > the Job so I assume that the output of the reduce is simply copied to HDFS. > Since all the DataNodes are on EC2 I imagine that the cost of duplicating > the data is not too high. >
Are you parsing during fetching? If you are ParseOutputFormat runs during reduce and that may be the slow part (because without parsing, fetch-reduce is just identity reduce) > I had a look at the EC2 instance doing the reduction : the CPU is at 40 > something percent and there is no RAM available (most of it being used by > the TaskTracker and DataNode). > > Any idea of why it is so slow? Are there any parameters which could > influence the performance? > > Thanks for your help > > Julien > > -- > DigitalPebble Ltd > http://www.digitalpebble.com > -- Doğacan Güney
