I just finished writing a similar input format for lzo files compressed with lzop. It also requires an index file to be created.
The results were surprisingly good, a very basic mapred job over 288gb uncompressed and 47gb compressed data ran 30% faster over the compressed data. I'm going to try to clean up the code and make a patch from it. /Johan 김대현[로그모델링] wrote: > Hello, > > I’m new to this mailing list, and this is the first trial of contribution. > > > > We have made a patch that enables multiple map tasks for one large *gzipped* > file. We call the patch RAgzip, which is the abbreviation of Random Access > gzip. It is like HADOOP-3646, which supports a big bzip2 file, and is an > alternative approach of PIG-42 which requires re-compression. > > > > RAgzip uses zlib's inflatePrime function which supports random access on a > gzipped file. Since the inflatePrime is supported from the version of > 1.2.2.4, it requires zlib 1.2.2.4 or higher. (We tested on zlib 1.2.3) > > > > RAgzip requires the preprocessing step that creates an access point (.ap) > file, which is like the index of the gzipped file chunks. (Unfortunately, the > preprocessing step seems to be sequential, that is, we cannot find the way to > parallelize.) > > > > RAgzip splits the gzipped file using the .ap file. To be more specific, > RAgzip reads the .ap file, get the start position and the compression > information of a partition of the gzipped file, decompress the partition and > feed it to the map task input when a map task starts. > > > > In short, you may use RAgzip by just changing InputFormat to > RAGZIPInputFormat. > > > > We have made RAgzip in two package types as follows: > > 1. jar > > - does not touch the Hadoop core > > - solves zlib version conflict problem by static linking zlib 1.2.3. > > 2. hadoop patch > > - integrated into Hadoop core > > - patches ZlibDecompressor.{c,java}: libhadoop.so changes > > - the version of zlib on the system should be 1.2.2.4 or higher. > > > > What I want to ask is: > > How to contribute RAgzip to Hadoop? May I just submit the hadoop patch > (package 2) to JIRA? > > I have read http://wiki.apache.org/hadoop/HowToContribute and changed our > source code to meet the coding style. > > > > Any comments will be appreciated. > > Thank you. > > > > - Daehyun Kim > > > >