It is a bit of a cheat relative what you are doing, but you can avoid the
problem in counting situations by using a combiner function.  Combiners are
useful when your can do the reduce in pieces.  Counting is a great example
of this.  The practical effect is typically that the combiner results in
10-1000x decrease in size of the data being passed to the reduce so you get
better performance.

As I mentioned, this doesn't address your question about why things move
slowly.  It just avoids the problem.


On 9/21/07 12:01 PM, "Ross Boucher" <[EMAIL PROTECTED]> wrote:

> I've been running a program to count search terms in log files, which
> is basically a small modification of the wordcount program.  This
> doesn't have a reduce phase, so the only tasks for the reduce jobs to
> perform is sorting the output files of the map jobs.
> 
> My cluster has 4 machines on it, so based on the recommendations on
> the wiki, I set my reduce count to 8.  Unfortunately, the performance
> was less than ideal.  Specifically, when the map functions had
> finished, I had to wait an additional 40% of the total job time just
> for copying/sorting the files.  I know for a fact that the sort is
> very fast, so the only remaining question is why moving the files
> around takes so long.
> 
> Looking at the jobtracker webapp, I noticed that the reduce->copying
> phase listed under the job showed a transfer speed of 0.01MB/s, which
> is fairly slow.  The machines are connected on a gigabit switch, and
> uploading 5GB of files to the hdfs system (hadoop dfs copyFromLocal)
> only takes about a minute.
> 
> Finally, when I set the reduce count to the number of machines,
> performance is good, since one reduce task will start up right away,
> and the slow transfers will continue throughout the map phase, and be
> ready almost immediately at the end of the map phase.
> 
> If anyone has some suggestions on how I might be able to increase
> performance, or what might be going on in this scenario, I would
> appreciate the tips.  I'd be happy to provide some more details about
> the setup if needed, for the moment its more of a testing ground to
> see what my options are.
> 
> Thanks.
> 
> Ross Boucher
> [EMAIL PROTECTED]

Reply via email to