Hi, I am running a map/reduce task on a large cluster (70+ machines). I use a single input file, and sufficient number of map/reduce tasks so that each map process gets 250k records. That is, if my input file contains 1 million records, I use 4 map and 4 reduce processes so that each map process gets 250k records. The maps/reduce usually takes 30 seconds to complete.
A strange thing happens when I scale this problem: 1 million records, 4 map + 4 reduce ==> 30 seconds per map process 5 million records, 20 map + 20 reduce ==> 1 minute per map process 50 million records, 200 map + 200 reduce ==> 3 minute per map process 500 million records, 2000 map + 2000 reduces ==> 45 minutes! per map process Note that in all the above cases, the map process performs the same amount of work (250k records). In all the cases, I use a single large input file. Hadoop breaks the file into ~16 MB chunks (about 250k records). Input format is TextInputFormat.class. I cannot think of any reason why this is happening. The task setup in all the above cases takes 30 seconds or so. But then the map process practically crawls. -- View this message in context: http://www.nabble.com/Performance-issues-with-single-multiple-files-tp14507375p14507375.html Sent from the Hadoop Users mailing list archive at Nabble.com.