Amogh,

Thank you for the detailed information.   Our initial prototyping seems to 
agree with your statements below, i.e. a single large input file is performing 
better than an index file + an archive of small files.   I will take a look at 
the CombineFileInputFormat as you suggested.     

One question.   Since the many small input files are all in a single jar 
archive managed by the name node, does that still hamper name node performance? 
  I was under the impression these archives are are only unpacked into the 
temporary map reduce file space (and I'm assuming cleaned up after map-reduce 
completes).   Does the name node need to store the metadata of each individual 
file during the unpacking for this case?

-Michael

On Feb 25, 2010, at 10:31 PM, Amogh Vasekar wrote:

> Hi,
> The number of mappers initialized depends largely on your input format ( the 
> getSplits of your input format) , (almost all) input formats available in 
> hadoop derive from fileinputformat, hence the 1 mapper per file block notion 
> ( this actually is 1 mapper per split ).
> You say that you have too many small files. In general each of these small 
> files  ( < 64 mb ) will be executed by a single mapper. However, I would 
> suggest looking at CombineFileInputFormat which does the job of packaging 
> many small files together depending on data locality for better performance ( 
> initialization time is a significant factor in hadoop's performance ).
> On the other side, many small files will hamper your namenode performance 
> since file metadata is stored in memory and limit its overall capacity wrt 
> number of files.
> 
> Amogh
> 
> 
> On 2/25/10 11:15 PM, "Michael Kintzer" <[email protected]> wrote:
> 
> Hi,
> 
> We are using the streaming API.    We are trying to understand what hadoop 
> uses as a threshold or trigger to involve more TaskTracker nodes in a given 
> Map-Reduce execution.
> 
> With default settings (64MB chunk size in HDFS), if the input file is less 
> than 64MB, will the data processing only occur on a single TaskTracker Node, 
> even if our cluster size is greater than 1?
> 
> For example, we are trying to figure out if hadoop is more efficient at 
> processing:
> a) a single input file which is just an index file that refers to a jar 
> archive of 100K or 1M individual small files, where the jar file is passed as 
> the "-archives" argument, or
> b) a single input file containing all the raw data represented by the 100K or 
> 1M small files.
> 
> With (a), our input file is <64MB.   With (b) our input file is very large.
> 
> Thanks for any insight,
> 
> -Michael
> 

Reply via email to