Hi, The number of mappers initialized depends largely on your input format ( the getSplits of your input format) , (almost all) input formats available in hadoop derive from fileinputformat, hence the 1 mapper per file block notion ( this actually is 1 mapper per split ). You say that you have too many small files. In general each of these small files ( < 64 mb ) will be executed by a single mapper. However, I would suggest looking at CombineFileInputFormat which does the job of packaging many small files together depending on data locality for better performance ( initialization time is a significant factor in hadoop's performance ). On the other side, many small files will hamper your namenode performance since file metadata is stored in memory and limit its overall capacity wrt number of files.
Amogh On 2/25/10 11:15 PM, "Michael Kintzer" <[email protected]> wrote: Hi, We are using the streaming API. We are trying to understand what hadoop uses as a threshold or trigger to involve more TaskTracker nodes in a given Map-Reduce execution. With default settings (64MB chunk size in HDFS), if the input file is less than 64MB, will the data processing only occur on a single TaskTracker Node, even if our cluster size is greater than 1? For example, we are trying to figure out if hadoop is more efficient at processing: a) a single input file which is just an index file that refers to a jar archive of 100K or 1M individual small files, where the jar file is passed as the "-archives" argument, or b) a single input file containing all the raw data represented by the 100K or 1M small files. With (a), our input file is <64MB. With (b) our input file is very large. Thanks for any insight, -Michael
