Re: [proposal] RAgzip: multiple map tasks for a large gzipped file

Johan Oskarsson Tue, 11 Nov 2008 03:28:19 -0800

I just finished writing a similar input format for lzo files compressed
with lzop. It also requires an index file to be created.


The results were surprisingly good, a very basic mapred job over 288gb
uncompressed and 47gb compressed data ran 30% faster over the compressed
data.

I'm going to try to clean up the code and make a patch from it.

/Johan

김대현[로그모델링] wrote:
> Hello,
> 
> I’m new to this mailing list, and this is the first trial of contribution.
> 
>  
> 
> We have made a patch that enables multiple map tasks for one large *gzipped* 
> file. We call the patch RAgzip, which is the abbreviation of Random Access 
> gzip. It is like HADOOP-3646, which supports a big bzip2 file, and is an 
> alternative approach of PIG-42 which requires re-compression.
> 
>  
> 
> RAgzip uses zlib's inflatePrime function which supports random access on a 
> gzipped file. Since the inflatePrime is supported from the version of 
> 1.2.2.4, it requires zlib 1.2.2.4 or higher. (We tested on zlib 1.2.3)
> 
>  
> 
> RAgzip requires the preprocessing step that creates an access point (.ap) 
> file, which is like the index of the gzipped file chunks. (Unfortunately, the 
> preprocessing step seems to be sequential, that is, we cannot find the way to 
> parallelize.)
> 
>  
> 
> RAgzip splits the gzipped file using the .ap file. To be more specific, 
> RAgzip reads the .ap file, get the start position and the compression 
> information of a partition of the gzipped file, decompress the partition and 
> feed it to the map task input when a map task starts.
> 
>  
> 
> In short, you may use RAgzip by just changing InputFormat to 
> RAGZIPInputFormat.
> 
>  
> 
> We have made RAgzip in two package types as follows:
> 
> 1. jar
> 
> - does not touch the Hadoop core
> 
>   - solves zlib version conflict problem by static linking zlib 1.2.3.
> 
> 2. hadoop patch
> 
> - integrated into Hadoop core
> 
> - patches ZlibDecompressor.{c,java}: libhadoop.so changes
> 
>   - the version of zlib on the system should be 1.2.2.4 or higher.
> 
>  
> 
> What I want to ask is:
> 
> How to contribute RAgzip to Hadoop? May I just submit the hadoop patch 
> (package 2) to JIRA?
> 
> I have read http://wiki.apache.org/hadoop/HowToContribute and changed our 
> source code to meet the coding style.
> 
>  
> 
> Any comments will be appreciated.
> 
> Thank you.
> 
>  
> 
> - Daehyun Kim
> 
>  
> 
>

Re: [proposal] RAgzip: multiple map tasks for a large gzipped file

Reply via email to