In principle I agree with you Ted. However, in many cases we have multiple large jobs generating outputs that are not that big and as result the number of small size files (significantly smaller than a Hadoop block) is large, using the default splitting logic there triggers jobs with a large amount of tasks that inefficiently clogs the cluster.
The MultipleFileInputFormat helps to avoid that, but it has a problem, if the file set is a mix of small and large files the splits are uneven and it does not do split on single large files. To address this we've written our own InputFormat (for Text and SequenceFiles) that collapses small files into a splits up to the block size and splits big files into the block size. It has a twist that you can you specify the max number of MAPs that you want or the BLOCK size you want to use for the splits. When a particular split contains multiple small files, the suggested host for the splits is order based on the host that has most of the data for those files. We'll still have to do some clean up on the code and then we'll submit it to Hadoop. A On Sat, Mar 29, 2008 at 10:20 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > Small files are a bad idea for high throughput no matter what technology you > use. The issue is that you need a larger file in order to avoid disk seeks. > > > > > On 3/28/08 7:34 PM, "Jason Curtes" <[EMAIL PROTECTED]> wrote: > > > Hello, > > > > I have been trying to run Hadoop on a set of small text files, not larger > > than 10k each. The total input size is 15MB. If I try to run the example > > word count application, it takes about 2000 seconds, more than half an hour > > to complete. However, if I merge all the files into one large file, it > takes > > much less than a minute. I think using MultiInputFileFormat can be helpful > > at this point. However, the API documentation is not really helpful. I > > wonder if MultiInputFileFormat can really solve my problem, and if so, can > > you suggest me a reference on how to use it, or a few lines to be added to > > the word count example to make things more clear? > > > > Thanks in advance. > > > > Regards, > > > > Jason Curtes > >
