Michael, On Tue, Nov 13, 2007 at 08:56:36AM -0800, Michael Harris wrote: >I have a question about file compression in Hadoop. When I set the >io.seqfile.compression.type=BLOCK does this also compress actual files I load >in the DFS or does this only control the map/reduce file compression? If it >doesnt compress the files on the file system, is there any way to compress a >file when its loaded? The concern here is that I am just getting started with >Pig/Hadoop and have a very small cluster of around 5 nodes. I want to limit IO >wait by compressing the actual data. As a test when I compressed our 4GB log >file using rar it was only 280mb. >
If you are loading files into HDFS as a SequenceFile and you set io.seqfile.compression.type=BLOCK (or RECORD) the file will have compressed records. Equivalently you can also use one of the many SequenceFile.createWriter methods (see http://lucene.apache.org/hadoop/api/org/apache/hadoop/io/SequenceFile.html) to specify the compression type, compression codec etc. Arun
