Re: on number of input files and split size

Ted Dunning Sun, 06 Apr 2008 21:30:58 -0700


Even though it is embarrassing, I should re-iterate my own point here.


These speedups will apply to a conventional implementation on a single
machine as well.  Improving disk access patterns is just plain good.


On 4/6/08 6:16 PM, "Colin Freas" <[EMAIL PROTECTED]> wrote:

> i just wanted to reiterate ted's point here.
> 
> my first run through with hadoop i used our log files as there are, which
> are designed as small input files for a mysql database instance.  the files
> were at most a few megabytes in size.  and we had tens something like 10,000
> of them.  performance was atrocious.  it was really disheartening.
> 
> but then i strung them together into files of about 250mb performance was
> fantastic.  then compressing those 250mb files increased performance again.
> increased performance as in jobs that were were taking hours (on 5 machines)
> were now taking 20 minutes.
> 
> so, you know, if you're wondering is it really worth the trouble to get the
> input into larger chunks?  my experience, though limited, is that it
> absolutely is.
> 
> -colin
> 
> 
> On Fri, Apr 4, 2008 at 5:20 PM, Prasan Ary <[EMAIL PROTECTED]> wrote:
> 
>> So it seems best for my application if I can somehow consolidate smaller
>> files into a couple of large files.
>> 
>>  All of my files reside on S3, and I am using 'distcp' command to copy
>> them to hdfs on EC2 before running a MR job. I was thinking it would be nice
>> if I could modify distcp such that each EC2 image running 'distcp' on the
>> EC2 cluster will concatenate input files into single file, so that at the
>> end of the copy process , we will have as many files as there are machines
>> in the cluster.
>> 
>>  Any thoughts if how I should proceeed on this ? or if this is a good idea
>> at all ?
>> 
>> 
>> 
>> Ted Dunning <[EMAIL PROTECTED]> wrote:
>> 
>> The split will depend entirely on the input format that you use and the
>> files that you have. In your case, you have lots of very small files so
>> the
>> limiting factor will almost certainly be the number of files. Thus, you
>> will have 1000 splits (one per file).
>> 
>> Your performance, btw, will likely be pretty poor with so many small
>> files.
>> Can you consolidate them? 100MB of data should probably be in no more than
>> a few files if you want good performance. At that, most kinds of
>> processing
>> will be completely dominated by job startup time. If your jobs are I/O
>> bound, they will be able to read 100MB of data in a just a few seconds at
>> most. Startup time for a hadoop job is typically 10 seconds or more.
>> 
>> 
>> On 4/4/08 12:58 PM, "Prasan Ary" wrote:
>> 
>>> I have a question on how input files are split before they are given out
>> to
>>> Map functions.
>>> Say I have an input directory containing 1000 files whose total size is
>> 100
>>> MB, and I have 10 machines in my cluster and I have configured 10
>>> mapred.map.tasks in hadoop-site.xml.
>>> 
>>> 1. With this configuration, do we have a way to know what size each
>> split
>>> will be of?
>>> 2. Does split size depend on how many files there are in the input
>>> directory? What if I have only 10 files in input directory, but the
>> total size
>>> of all these files is still 100 MB? Will it affect split size?
>>> 
>>> Thanks.
>>> 
>>> 
>>> ---------------------------------
>>> You rock. That's why Blockbuster's offering you one month of Blockbuster
>> Total
>>> Access, No Cost.
>> 
>> 
>> 
>> 
>> ---------------------------------
>> You rock. That's why Blockbuster's offering you one month of Blockbuster
>> Total Access, No Cost.
>>

Re: on number of input files and split size

Reply via email to