[ https://issues.apache.org/jira/browse/MAPREDUCE-491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daehyun Kim updated MAPREDUCE-491: ---------------------------------- Status: Patch Available (was: In Progress) > RAgzip: multiple map tasks for a large gzipped file > --------------------------------------------------- > > Key: MAPREDUCE-491 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-491 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Affects Versions: 0.21.0 > Reporter: Daehyun Kim > Assignee: Daehyun Kim > Priority: Minor > Fix For: 0.21.0 > > Attachments: HADOOP-4652-v2.patch, HADOOP-4652-v3.patch, > HADOOP-4652.path, MAPREDUCE-491.patch > > > Currently, the hadoop processes gzipped files with only one map. > We have made a patch that enables multiple map tasks for one large gzipped > file. We call the patch RAgzip. > To process multiple map tasks for gzipped file, you may use RAgzip by just > changing InputFormat to RAGZIPInputFormat. > The option used in RAGZIPInputFormat can be found at the javadoc of > RAGZIPInputFormat part. > RAgzip uses zlib's inflatePrime function which supports random access on a > gzipped file. > Since the inflatePrime is supported from the version of 1.2.2.4, it requires > zlib 1.2.2.4 or higher. (We tested on zlib 1.2.3) > RAgzip requires the preprocessing step that creates an access point (.ap) > file, which is like the index of the gzipped file chunks. > The access point(.ap) file is located in same path of the gzipped file. > If there is a "/user/hadoop/test.gz", the .ap file is created with > "/user/hadoop/test.gz.ap". -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.