Ted Dunning wrote:
Yes.  I am recommending a pre-processing step before the map-reduce program.

And yes. They do get split up again.  They also get copied to multiple nodes
so that the reads can proceed in parallel.  The most important effects of
concatenation and importing into HDFS are the parallelism and the reading of
sequential disk blocks in processing.

Actually, hadoop's map-reduce usually works on 'logical' splits i.e. each map works only on the 'logical' split (<filename, offset, length> triplet).

One end of the spectrum is for each map to work on a whole input-file (e.g. compressed files - gzip/zlib/lzo files) since compressed files cannot be logically split up without decompressing them first.

Arun

How many replicas, how many large files and how small the splits are
determines the number of map functions that you can run in parallel without
getting IO bound.

If you are working on a small problem, then running Hadoop on a single node
works just fine and accessing the local file system works just fine, but if
you can do that, you might as well just write a sequential program in the
first place.  If you have a large problem that requires parallelism, then
reading from a local file system is likely be be a serious bottle neck.
This is particularly true if you are processing your data repeatedly as is
relatively common when, say, doing log processing of various kinds at
multiple time scales.


On 8/26/07 5:45 PM, "mfc" <[EMAIL PROTECTED]> wrote:


[concatenation .. Compression]...but then the map/reduce job in HADOOP breaks

the large files back down

into small chunks. This is what prompted the question in the first place
about running Map/Reduce directly on the small files in the local file
system.

I'm wondering if doing the conversion to large files and copy into HDFS
would introduce a lot of overhead that would not be neccessary if map/reduce
could be run directly on the local file system on the small files.



Reply via email to