Small files are a bad idea for high throughput no matter what technology you
use.  The issue is that you need a larger file in order to avoid disk seeks.


On 3/28/08 7:34 PM, "Jason Curtes" <[EMAIL PROTECTED]> wrote:

> Hello,
> 
> I have been trying to run Hadoop on a set of small text files, not larger
> than 10k each. The total input size is 15MB. If I try to run the example
> word count application, it takes about 2000 seconds, more than half an hour
> to complete. However, if I merge all the files into one large file, it takes
> much less than a minute. I think using MultiInputFileFormat can be helpful
> at this point. However, the API documentation is not really helpful. I
> wonder if MultiInputFileFormat can really solve my problem, and if so, can
> you suggest me a reference on how to use it, or a few lines to be added to
> the word count example to make things more clear?
> 
> Thanks in advance.
> 
> Regards,
> 
> Jason Curtes

Reply via email to