Just to get some details - could you please let us know how long does the job take with 8 reduces, and, with 4 reduces. The rate of copy that you see is not that much indicative of the actual transfer rate for small jobs. The rate is simply #bytes-copied/t, where t = total time elapsed since copy began. In a small job, where the amount of data to shuffle is less, the overhead of scheduling of copies will start dominating, more so, if the #reduces are more.
> -----Original Message----- > From: Ross Boucher [mailto:[EMAIL PROTECTED] > Sent: Friday, September 21, 2007 1:33 PM > To: [email protected] > Subject: Re: Reduce Performance > > The input data was about 5GB, the total map processing time > was about 10 minutes. Then, there was 5 minutes of reduce > time on top of that spent moving the files around. > > On Sep 21, 2007, at 12:20 PM, Doug Cutting wrote: > > > Ross Boucher wrote: > >> My cluster has 4 machines on it, so based on the > recommendations on > >> the wiki, I set my reduce count to 8. Unfortunately, the > performance > >> was less than ideal. Specifically, when the map functions had > >> finished, I had to wait an additional 40% of the total job > time just > >> for copying/sorting the files. I know for a fact that the sort is > >> very fast, so the only remaining question is why moving the files > >> around takes so long. > > > > How much data was there to copy? How long was the total > job time? > > If there are only small amounts of data, and the total job time is > > short, then copy scheduling overhead might be significant. > > > > Doug > > > > > >
