In principle I agree with you Ted.

However, in many cases we have multiple large jobs generating outputs
that are not that big and as result the number of small size files
(significantly smaller than a Hadoop block) is large, using the
default splitting logic there triggers jobs with a large amount of
tasks that inefficiently clogs the cluster.

The MultipleFileInputFormat helps to avoid that, but it has a problem,
if the file set is a mix of small and large files the splits are
uneven and it does not do split on single large files.

To address this we've written our own InputFormat (for Text and
SequenceFiles) that collapses small files into a splits up to the
block size and splits big files into the block size.

It has a  twist that you can you specify the max number of MAPs that
you want or the BLOCK size you want to use for the splits.

When a particular split contains multiple small files, the suggested
host for the splits is order based on the host that has most of the
data for those files.

We'll still have to do some clean up on the code and then we'll submit
it to Hadoop.

A

On Sat, Mar 29, 2008 at 10:20 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
>  Small files are a bad idea for high throughput no matter what technology you
>  use.  The issue is that you need a larger file in order to avoid disk seeks.
>
>
>
>
>  On 3/28/08 7:34 PM, "Jason Curtes" <[EMAIL PROTECTED]> wrote:
>
>  > Hello,
>  >
>  > I have been trying to run Hadoop on a set of small text files, not larger
>  > than 10k each. The total input size is 15MB. If I try to run the example
>  > word count application, it takes about 2000 seconds, more than half an hour
>  > to complete. However, if I merge all the files into one large file, it 
> takes
>  > much less than a minute. I think using MultiInputFileFormat can be helpful
>  > at this point. However, the API documentation is not really helpful. I
>  > wonder if MultiInputFileFormat can really solve my problem, and if so, can
>  > you suggest me a reference on how to use it, or a few lines to be added to
>  > the word count example to make things more clear?
>  >
>  > Thanks in advance.
>  >
>  > Regards,
>  >
>  > Jason Curtes
>
>

Reply via email to