It is a bit of a cheat relative what you are doing, but you can avoid the problem in counting situations by using a combiner function. Combiners are useful when your can do the reduce in pieces. Counting is a great example of this. The practical effect is typically that the combiner results in 10-1000x decrease in size of the data being passed to the reduce so you get better performance.
As I mentioned, this doesn't address your question about why things move slowly. It just avoids the problem. On 9/21/07 12:01 PM, "Ross Boucher" <[EMAIL PROTECTED]> wrote: > I've been running a program to count search terms in log files, which > is basically a small modification of the wordcount program. This > doesn't have a reduce phase, so the only tasks for the reduce jobs to > perform is sorting the output files of the map jobs. > > My cluster has 4 machines on it, so based on the recommendations on > the wiki, I set my reduce count to 8. Unfortunately, the performance > was less than ideal. Specifically, when the map functions had > finished, I had to wait an additional 40% of the total job time just > for copying/sorting the files. I know for a fact that the sort is > very fast, so the only remaining question is why moving the files > around takes so long. > > Looking at the jobtracker webapp, I noticed that the reduce->copying > phase listed under the job showed a transfer speed of 0.01MB/s, which > is fairly slow. The machines are connected on a gigabit switch, and > uploading 5GB of files to the hdfs system (hadoop dfs copyFromLocal) > only takes about a minute. > > Finally, when I set the reduce count to the number of machines, > performance is good, since one reduce task will start up right away, > and the slow transfers will continue throughout the map phase, and be > ready almost immediately at the end of the map phase. > > If anyone has some suggestions on how I might be able to increase > performance, or what might be going on in this scenario, I would > appreciate the tips. I'd be happy to provide some more details about > the setup if needed, for the moment its more of a testing ground to > see what my options are. > > Thanks. > > Ross Boucher > [EMAIL PROTECTED]
