On 3/21/08 8:29 AM, "Dan Tamowski" <[EMAIL PROTECTED]> wrote:

> -Does Hadoop/MR offer a clean abstraction for both consuming and producing a
> large number of files? (I know it can handily consume a large number of
> fies, but all examples of output seem to form a single file)

Yes.

IT works very well if your definition of large is less than hundreds of
thousands and your files are reasonably large (>> 1MB).  If this is not
true, then pasting your files together with a synchronization string between
them that you can scan for quickly works pretty well.

> -Does Hadoop provide the input/output formats relevant to this or would I
> have to create my own? (e.g non-splittable binary input, and binary output)

It has input formats for multiple input files (with an obvious name that I
am spacing at the moment).  Building a glue factory to paste otherwise
unsplittable files together and pull them apart at map time would be pretty
easy.

> -Is this issue even well-suited to Hadoop in the first place? This type of
> job may only need the map phase, and not the reduce phase, so maybe I'm
> looking in the wrong place.

Hadoop is surprisingly beneficial in these cases and you are likely to be
surprised at how useful a reduce phase can be, if only to concatenate and/or
summarize your results.

The benefit in map-only jobs has to do with moving jobs to be near the data
without explicit management by you.  It makes it so that you can come close
to having full disk bandwidth without having to know where your data is, how
to spawn jobs, how many nodes there are, how failures are handled or many
other things.



Reply via email to