Even though it is embarrassing, I should re-iterate my own point here.
These speedups will apply to a conventional implementation on a single machine as well. Improving disk access patterns is just plain good. On 4/6/08 6:16 PM, "Colin Freas" <[EMAIL PROTECTED]> wrote: > i just wanted to reiterate ted's point here. > > my first run through with hadoop i used our log files as there are, which > are designed as small input files for a mysql database instance. the files > were at most a few megabytes in size. and we had tens something like 10,000 > of them. performance was atrocious. it was really disheartening. > > but then i strung them together into files of about 250mb performance was > fantastic. then compressing those 250mb files increased performance again. > increased performance as in jobs that were were taking hours (on 5 machines) > were now taking 20 minutes. > > so, you know, if you're wondering is it really worth the trouble to get the > input into larger chunks? my experience, though limited, is that it > absolutely is. > > -colin > > > On Fri, Apr 4, 2008 at 5:20 PM, Prasan Ary <[EMAIL PROTECTED]> wrote: > >> So it seems best for my application if I can somehow consolidate smaller >> files into a couple of large files. >> >> All of my files reside on S3, and I am using 'distcp' command to copy >> them to hdfs on EC2 before running a MR job. I was thinking it would be nice >> if I could modify distcp such that each EC2 image running 'distcp' on the >> EC2 cluster will concatenate input files into single file, so that at the >> end of the copy process , we will have as many files as there are machines >> in the cluster. >> >> Any thoughts if how I should proceeed on this ? or if this is a good idea >> at all ? >> >> >> >> Ted Dunning <[EMAIL PROTECTED]> wrote: >> >> The split will depend entirely on the input format that you use and the >> files that you have. In your case, you have lots of very small files so >> the >> limiting factor will almost certainly be the number of files. Thus, you >> will have 1000 splits (one per file). >> >> Your performance, btw, will likely be pretty poor with so many small >> files. >> Can you consolidate them? 100MB of data should probably be in no more than >> a few files if you want good performance. At that, most kinds of >> processing >> will be completely dominated by job startup time. If your jobs are I/O >> bound, they will be able to read 100MB of data in a just a few seconds at >> most. Startup time for a hadoop job is typically 10 seconds or more. >> >> >> On 4/4/08 12:58 PM, "Prasan Ary" wrote: >> >>> I have a question on how input files are split before they are given out >> to >>> Map functions. >>> Say I have an input directory containing 1000 files whose total size is >> 100 >>> MB, and I have 10 machines in my cluster and I have configured 10 >>> mapred.map.tasks in hadoop-site.xml. >>> >>> 1. With this configuration, do we have a way to know what size each >> split >>> will be of? >>> 2. Does split size depend on how many files there are in the input >>> directory? What if I have only 10 files in input directory, but the >> total size >>> of all these files is still 100 MB? Will it affect split size? >>> >>> Thanks. >>> >>> >>> --------------------------------- >>> You rock. That's why Blockbuster's offering you one month of Blockbuster >> Total >>> Access, No Cost. >> >> >> >> >> --------------------------------- >> You rock. That's why Blockbuster's offering you one month of Blockbuster >> Total Access, No Cost. >>
