So it seems best for my application if I can somehow consolidate smaller files
into a couple of large files.
All of my files reside on S3, and I am using 'distcp' command to copy them to
hdfs on EC2 before running a MR job. I was thinking it would be nice if I could
modify distcp such that each EC2 image running 'distcp' on the EC2 cluster will
concatenate input files into single file, so that at the end of the copy
process , we will have as many files as there are machines in the cluster.
Any thoughts if how I should proceeed on this ? or if this is a good idea at
all ?
Ted Dunning <[EMAIL PROTECTED]> wrote:
The split will depend entirely on the input format that you use and the
files that you have. In your case, you have lots of very small files so the
limiting factor will almost certainly be the number of files. Thus, you
will have 1000 splits (one per file).
Your performance, btw, will likely be pretty poor with so many small files.
Can you consolidate them? 100MB of data should probably be in no more than
a few files if you want good performance. At that, most kinds of processing
will be completely dominated by job startup time. If your jobs are I/O
bound, they will be able to read 100MB of data in a just a few seconds at
most. Startup time for a hadoop job is typically 10 seconds or more.
On 4/4/08 12:58 PM, "Prasan Ary" wrote:
> I have a question on how input files are split before they are given out to
> Map functions.
> Say I have an input directory containing 1000 files whose total size is 100
> MB, and I have 10 machines in my cluster and I have configured 10
> mapred.map.tasks in hadoop-site.xml.
>
> 1. With this configuration, do we have a way to know what size each split
> will be of?
> 2. Does split size depend on how many files there are in the input
> directory? What if I have only 10 files in input directory, but the total size
> of all these files is still 100 MB? Will it affect split size?
>
> Thanks.
>
>
> ---------------------------------
> You rock. That's why Blockbuster's offering you one month of Blockbuster Total
> Access, No Cost.
---------------------------------
You rock. That's why Blockbuster's offering you one month of Blockbuster Total
Access, No Cost.