Just to get some details - could you please let us know how long does the
job take with 8 reduces, and, with 4 reduces. The rate of copy that you see
is not that much indicative of the actual transfer rate for small jobs. The
rate is simply #bytes-copied/t, where t = total time elapsed since copy
began. In a small job, where the amount of data to shuffle is less, the
overhead of scheduling of copies will start dominating, more so, if the
#reduces are more.

> -----Original Message-----
> From: Ross Boucher [mailto:[EMAIL PROTECTED] 
> Sent: Friday, September 21, 2007 1:33 PM
> To: [email protected]
> Subject: Re: Reduce Performance
> 
> The input data was about 5GB, the total map processing time 
> was about 10 minutes.  Then, there was 5 minutes of reduce 
> time on top of that spent moving the files around.
> 
> On Sep 21, 2007, at 12:20 PM, Doug Cutting wrote:
> 
> > Ross Boucher wrote:
> >> My cluster has 4 machines on it, so based on the 
> recommendations on 
> >> the wiki, I set my reduce count to 8.  Unfortunately, the 
> performance 
> >> was less than ideal.  Specifically, when the map functions had 
> >> finished, I had to wait an additional 40% of the total job 
> time just 
> >> for copying/sorting the files.  I know for a fact that the sort is 
> >> very fast, so the only remaining question is why moving the files 
> >> around takes so long.
> >
> > How much data was there to copy?  How long was the total 
> job time?   
> > If there are only small amounts of data, and the total job time is 
> > short, then copy scheduling overhead might be significant.
> >
> > Doug
> >
> >
> 
> 

Reply via email to