Hi,
You mentioned you pass the files packed together using -archives option. This 
will uncompress the archive on the compute node itself, so the namenode won't 
be hampered in this case. However, cleaning up the distributed cache is a 
tricky scenario ( user doesn't have explicit control over this ), you may 
search this list for many discussions pertaining to this. And while on the 
topic of archives, while it may not be practical for you as of now, but Hadoop 
Archives (har) provide similar functionality.
Hope this helps.

Amogh


On 2/27/10 12:53 AM, "Michael Kintzer" <[email protected]> wrote:

Amogh,

Thank you for the detailed information.   Our initial prototyping seems to 
agree with your statements below, i.e. a single large input file is performing 
better than an index file + an archive of small files.   I will take a look at 
the CombineFileInputFormat as you suggested.

One question.   Since the many small input files are all in a single jar 
archive managed by the name node, does that still hamper name node performance? 
  I was under the impression these archives are are only unpacked into the 
temporary map reduce file space (and I'm assuming cleaned up after map-reduce 
completes).   Does the name node need to store the metadata of each individual 
file during the unpacking for this case?

-Michael

On Feb 25, 2010, at 10:31 PM, Amogh Vasekar wrote:

> Hi,
> The number of mappers initialized depends largely on your input format ( the 
> getSplits of your input format) , (almost all) input formats available in 
> hadoop derive from fileinputformat, hence the 1 mapper per file block notion 
> ( this actually is 1 mapper per split ).
> You say that you have too many small files. In general each of these small 
> files  ( < 64 mb ) will be executed by a single mapper. However, I would 
> suggest looking at CombineFileInputFormat which does the job of packaging 
> many small files together depending on data locality for better performance ( 
> initialization time is a significant factor in hadoop's performance ).
> On the other side, many small files will hamper your namenode performance 
> since file metadata is stored in memory and limit its overall capacity wrt 
> number of files.
>
> Amogh
>
>
> On 2/25/10 11:15 PM, "Michael Kintzer" <[email protected]> wrote:
>
> Hi,
>
> We are using the streaming API.    We are trying to understand what hadoop 
> uses as a threshold or trigger to involve more TaskTracker nodes in a given 
> Map-Reduce execution.
>
> With default settings (64MB chunk size in HDFS), if the input file is less 
> than 64MB, will the data processing only occur on a single TaskTracker Node, 
> even if our cluster size is greater than 1?
>
> For example, we are trying to figure out if hadoop is more efficient at 
> processing:
> a) a single input file which is just an index file that refers to a jar 
> archive of 100K or 1M individual small files, where the jar file is passed as 
> the "-archives" argument, or
> b) a single input file containing all the raw data represented by the 100K or 
> 1M small files.
>
> With (a), our input file is <64MB.   With (b) our input file is very large.
>
> Thanks for any insight,
>
> -Michael
>


Reply via email to