Small files are a bad idea for high throughput no matter what technology you use. The issue is that you need a larger file in order to avoid disk seeks.
On 3/28/08 7:34 PM, "Jason Curtes" <[EMAIL PROTECTED]> wrote: > Hello, > > I have been trying to run Hadoop on a set of small text files, not larger > than 10k each. The total input size is 15MB. If I try to run the example > word count application, it takes about 2000 seconds, more than half an hour > to complete. However, if I merge all the files into one large file, it takes > much less than a minute. I think using MultiInputFileFormat can be helpful > at this point. However, the API documentation is not really helpful. I > wonder if MultiInputFileFormat can really solve my problem, and if so, can > you suggest me a reference on how to use it, or a few lines to be added to > the word count example to make things more clear? > > Thanks in advance. > > Regards, > > Jason Curtes
