I've been running a program to count search terms in log files, which is basically a small modification of the wordcount program. This doesn't have a reduce phase, so the only tasks for the reduce jobs to perform is sorting the output files of the map jobs.

My cluster has 4 machines on it, so based on the recommendations on the wiki, I set my reduce count to 8. Unfortunately, the performance was less than ideal. Specifically, when the map functions had finished, I had to wait an additional 40% of the total job time just for copying/sorting the files. I know for a fact that the sort is very fast, so the only remaining question is why moving the files around takes so long.

Looking at the jobtracker webapp, I noticed that the reduce->copying phase listed under the job showed a transfer speed of 0.01MB/s, which is fairly slow. The machines are connected on a gigabit switch, and uploading 5GB of files to the hdfs system (hadoop dfs copyFromLocal) only takes about a minute.

Finally, when I set the reduce count to the number of machines, performance is good, since one reduce task will start up right away, and the slow transfers will continue throughout the map phase, and be ready almost immediately at the end of the map phase.

If anyone has some suggestions on how I might be able to increase performance, or what might be going on in this scenario, I would appreciate the tips. I'd be happy to provide some more details about the setup if needed, for the moment its more of a testing ground to see what my options are.

Thanks.

Ross Boucher
[EMAIL PROTECTED]

Reply via email to