Re: on number of input files and split size

Ted Dunning Fri, 04 Apr 2008 13:07:54 -0700

The split will depend entirely on the input format that you use and the
files that you have.  In your case, you have lots of very small files so the
limiting factor will almost certainly be the number of files.  Thus, you
will have 1000 splits (one per file).

Your performance, btw, will likely be pretty poor with so many small files.
Can you consolidate them?  100MB of data should probably be in no more than
a few files if you want good performance.  At that, most kinds of processing
will be completely dominated by job startup time.  If your jobs are I/O
bound, they will be able to read 100MB of data in a just a few seconds at
most.  Startup time for a hadoop job is typically 10 seconds or more.

On 4/4/08 12:58 PM, "Prasan Ary" <[EMAIL PROTECTED]> wrote:

> I have a question on how input files are split before they are given out to
> Map functions.
>   Say I have an input directory containing  1000 files whose total size is 100
> MB, and I have 10 machines in my cluster and I have configured 10
> mapred.map.tasks in hadoop-site.xml.
>    
>   1. With this configuration, do we have a way to know what size each split
> will be of?
>   2. Does split size depend on how many files there are in the input
> directory? What if I have only 10 files in input directory, but the total size
> of all these files is still 100 MB? Will it affect split size?
>    
>   Thanks.
> 
>        
> ---------------------------------
> You rock. That's why Blockbuster's offering you one month of Blockbuster Total
> Access, No Cost.

Re: on number of input files and split size

Reply via email to