Amogh, Thank you for the detailed information. Our initial prototyping seems to agree with your statements below, i.e. a single large input file is performing better than an index file + an archive of small files. I will take a look at the CombineFileInputFormat as you suggested.
One question. Since the many small input files are all in a single jar archive managed by the name node, does that still hamper name node performance? I was under the impression these archives are are only unpacked into the temporary map reduce file space (and I'm assuming cleaned up after map-reduce completes). Does the name node need to store the metadata of each individual file during the unpacking for this case? -Michael On Feb 25, 2010, at 10:31 PM, Amogh Vasekar wrote: > Hi, > The number of mappers initialized depends largely on your input format ( the > getSplits of your input format) , (almost all) input formats available in > hadoop derive from fileinputformat, hence the 1 mapper per file block notion > ( this actually is 1 mapper per split ). > You say that you have too many small files. In general each of these small > files ( < 64 mb ) will be executed by a single mapper. However, I would > suggest looking at CombineFileInputFormat which does the job of packaging > many small files together depending on data locality for better performance ( > initialization time is a significant factor in hadoop's performance ). > On the other side, many small files will hamper your namenode performance > since file metadata is stored in memory and limit its overall capacity wrt > number of files. > > Amogh > > > On 2/25/10 11:15 PM, "Michael Kintzer" <[email protected]> wrote: > > Hi, > > We are using the streaming API. We are trying to understand what hadoop > uses as a threshold or trigger to involve more TaskTracker nodes in a given > Map-Reduce execution. > > With default settings (64MB chunk size in HDFS), if the input file is less > than 64MB, will the data processing only occur on a single TaskTracker Node, > even if our cluster size is greater than 1? > > For example, we are trying to figure out if hadoop is more efficient at > processing: > a) a single input file which is just an index file that refers to a jar > archive of 100K or 1M individual small files, where the jar file is passed as > the "-archives" argument, or > b) a single input file containing all the raw data represented by the 100K or > 1M small files. > > With (a), our input file is <64MB. With (b) our input file is very large. > > Thanks for any insight, > > -Michael >
