Daehyun, Is there a Ragzip output fromat that produces the .ap (access points) file while writing out the gzipped file on HDFS ? Because that will eliminate the preprocessing stage for gzipped files.
Aside from that, I think it will be a great addition to Hadoop. - milind On 11/5/08 5:06 AM, "김대현[로그모델링]" <[EMAIL PROTECTED]> wrote: > Hello, > > I’m new to this mailing list, and this is the first trial of contribution. > > > > We have made a patch that enables multiple map tasks for one large *gzipped* > file. We call the patch RAgzip, which is the abbreviation of Random Access > gzip. It is like HADOOP-3646, which supports a big bzip2 file, and is an > alternative approach of PIG-42 which requires re-compression. > > > > RAgzip uses zlib's inflatePrime function which supports random access on a > gzipped file. Since the inflatePrime is supported from the version of 1.2.2.4, > it requires zlib 1.2.2.4 or higher. (We tested on zlib 1.2.3) > > > > RAgzip requires the preprocessing step that creates an access point (.ap) > file, which is like the index of the gzipped file chunks. (Unfortunately, the > preprocessing step seems to be sequential, that is, we cannot find the way to > parallelize.) > > > > RAgzip splits the gzipped file using the .ap file. To be more specific, RAgzip > reads the .ap file, get the start position and the compression information of > a partition of the gzipped file, decompress the partition and feed it to the > map task input when a map task starts. > > > > In short, you may use RAgzip by just changing InputFormat to > RAGZIPInputFormat. > > > > We have made RAgzip in two package types as follows: > > 1. jar > > - does not touch the Hadoop core > > - solves zlib version conflict problem by static linking zlib 1.2.3. > > 2. hadoop patch > > - integrated into Hadoop core > > - patches ZlibDecompressor.{c,java}: libhadoop.so changes > > - the version of zlib on the system should be 1.2.2.4 or higher. > > > > What I want to ask is: > > How to contribute RAgzip to Hadoop? May I just submit the hadoop patch > (package 2) to JIRA? > > I have read http://wiki.apache.org/hadoop/HowToContribute and changed our > source code to meet the coding style. > > > > Any comments will be appreciated. > > Thank you. > > > > - Daehyun Kim > > > -- Milind Bhandarkar Y!IM: GridSolutions 408-349-2136 ([EMAIL PROTECTED])