Cat would work if you don't care about total storage. Often the input to map-reduce programs are line or record oriented data that exhibit lots of redundancy and thus could be compressed significantly. Log files are a concrete example.
Thus, you might consider cat | gzip. That might not be good enough if you are worried about the original file boundaries, perhaps for audit trail purposes. Simple gzip is also not a particularly good choice since you can't just seek to some point in the resulting file (which is helpful if you want to split up the work on any given file). Hadoop provides a block oriented compression algorithm that avoids this problem and allows distribution of the processing of a single file. On 8/26/07 12:35 PM, "mfc" <[EMAIL PROTECTED]> wrote: > When you talk about packaging lots of small files together before putting > them into HDFS what are you talking about? Something as simple as cat?
