Yes, io.seqfile.compression controls compression of only the mapred files. A way you can compress files on the dfs, independent of mapred, is to use the java.util.zip package over the OutputStream that the DistributedFileSystem.create returns. For example, you can use java.util.zip.GZIPOutputStream. Pass the org.apache.hadoop.fs.FSDataOutputStream that org.apache.hadoop.dfs.DistributedFileSystem.create() returns as an argument to the GZIPOutputStream constructor.
> -----Original Message----- > From: Michael Harris [mailto:[EMAIL PROTECTED] > Sent: Tuesday, November 13, 2007 10:27 PM > To: [email protected] > Subject: File Compression > > I have a question about file compression in Hadoop. When I > set the io.seqfile.compression.type=BLOCK does this also > compress actual files I load in the DFS or does this only > control the map/reduce file compression? If it doesnt > compress the files on the file system, is there any way to > compress a file when its loaded? The concern here is that I > am just getting started with Pig/Hadoop and have a very small > cluster of around 5 nodes. I want to limit IO wait by > compressing the actual data. As a test when I compressed our > 4GB log file using rar it was only 280mb. > > Thanks, > Michael >
