Cat would work if you don't care about total storage.

Often the input to map-reduce programs are line or record oriented data that
exhibit lots of redundancy and thus could be compressed significantly.  Log
files are a concrete example.

Thus, you might consider cat | gzip.  That might not be good enough if you
are worried about the original file boundaries, perhaps for audit trail
purposes.

Simple gzip is also not a particularly good choice since you can't just seek
to some point in the resulting file (which is helpful if you want to split
up the work on any given file).  Hadoop provides a block oriented
compression algorithm that avoids this problem and allows distribution of
the processing of a single file.


On 8/26/07 12:35 PM, "mfc" <[EMAIL PROTECTED]> wrote:

> When you talk about packaging lots of small files together before putting
> them into HDFS what are you talking about? Something as simple as cat?

Reply via email to