On Dec 26, 2007, at 2:50 PM, jag123 wrote:
A strange thing happens when I scale this problem:
1 million records, 4 map + 4 reduce ==> 30 seconds per map process
5 million records, 20 map + 20 reduce ==> 1 minute per map process
50 million records, 200 map + 200 reduce ==> 3 minute per map process
500 million records, 2000 map + 2000 reduces ==> 45 minutes! per
map process
Thanks for sharing your problem. With a 70 node cluster, I'd strongly
suggest that you cut back to 140 reduces. You should save a lot of
time in the shuffle.
The task setup in all the above cases takes 30 seconds or so. But
then the
map process practically crawls.
Look at the web ui and find the execution times on the maps without
the shuffle. Are they similar between the 50m and 500m runs? If not,
it would help to get in there with a profiler. Either use one of your
own or try out my HADOOP-2367 patch. *smile*
-- Owen