Yes, io.seqfile.compression controls compression of only the mapred files. A
way you can compress files on the dfs, independent of mapred, is to use the
java.util.zip package over the OutputStream that the
DistributedFileSystem.create returns. For example, you can use
java.util.zip.GZIPOutputStream. Pass the
org.apache.hadoop.fs.FSDataOutputStream that
org.apache.hadoop.dfs.DistributedFileSystem.create() returns as an argument
to the GZIPOutputStream constructor.  

> -----Original Message-----
> From: Michael Harris [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, November 13, 2007 10:27 PM
> To: [email protected]
> Subject: File Compression
> 
> I have a question about file compression in Hadoop. When I 
> set the io.seqfile.compression.type=BLOCK does this also 
> compress actual files I load in the DFS or does this only 
> control the map/reduce file compression? If it doesnt 
> compress the files on the file system, is there any way to 
> compress a file when its loaded? The concern here is that I 
> am just getting started with Pig/Hadoop and have a very small 
> cluster of around 5 nodes. I want to limit IO wait by 
> compressing the actual data. As a test when I compressed our 
> 4GB log file using rar it was only 280mb.
> 
> Thanks,
> Michael
> 

Reply via email to