Hi,

I am running a map/reduce task on a large cluster (70+ machines). I use a
single input file, and sufficient number of map/reduce tasks so that each
map process gets 250k records. That is, if my  input file contains 1 
million records, I use 4 map and 4 reduce processes so that each map process
gets 250k records.  The maps/reduce usually takes 30 seconds to complete.

A strange thing happens when I scale this problem:

1 million records, 4 map + 4 reduce ==> 30 seconds per map process
5 million records, 20 map + 20 reduce ==>  1 minute per map process
50 million records, 200 map + 200 reduce ==>  3 minute per map process
500 million records, 2000 map + 2000 reduces ==> 45 minutes! per map process

Note that in all the above cases, the map process performs the same amount
of work (250k records).

In all the cases, I use a single large input file. Hadoop breaks the file
into ~16 MB chunks (about 250k records). Input format is
TextInputFormat.class. I cannot think of any reason why this is happening.
The task setup in all the above cases takes 30 seconds or so. But then the
map process practically crawls. 
-- 
View this message in context: 
http://www.nabble.com/Performance-issues-with-single-multiple-files-tp14507375p14507375.html
Sent from the Hadoop Users mailing list archive at Nabble.com.

Reply via email to