Re: Performance issues with single/multiple files

Owen O'Malley Thu, 27 Dec 2007 22:22:05 -0800

On Dec 26, 2007, at 2:50 PM, jag123 wrote:

A strange thing happens when I scale this problem:


1 million records, 4 map + 4 reduce ==> 30 seconds per map process
5 million records, 20 map + 20 reduce ==>  1 minute per map process
50 million records, 200 map + 200 reduce ==>  3 minute per map process

500 million records, 2000 map + 2000 reduces ==> 45 minutes! permap process

Thanks for sharing your problem. With a 70 node cluster, I'd stronglysuggest that you cut back to 140 reduces. You should save a lot oftime in the shuffle.

The task setup in all the above cases takes 30 seconds or so. Butthen the
map process practically crawls.

Look at the web ui and find the execution times on the maps withoutthe shuffle. Are they similar between the 50m and 500m runs? If not,it would help to get in there with a profiler. Either use one of yourown or try out my HADOOP-2367 patch. *smile*


-- Owen

Re: Performance issues with single/multiple files

Reply via email to