[jira] [Updated] (MAPREDUCE-491) RAgzip: multiple map tasks for a large gzipped file

Arun C Murthy (JIRA) Wed, 07 Sep 2011 01:39:52 -0700

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Arun C Murthy updated MAPREDUCE-491:
------------------------------------

    Affects Version/s:     (was: 0.21.0)
               Status: Open  (was: Patch Available)

Sorry to come in late, the patch has gone stale. Can you please rebase? Thanks.


> RAgzip: multiple map tasks for a large gzipped file
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-491
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Daehyun Kim
>            Assignee: Daehyun Kim
>            Priority: Minor
>         Attachments: HADOOP-4652-v2.patch, HADOOP-4652-v3.patch, 
> HADOOP-4652.path, MAPREDUCE-491.patch
>
>
> Currently, the hadoop processes gzipped files with only one map.
> We have made a patch that enables multiple map tasks for one large gzipped 
> file. We call the patch RAgzip.
> To process multiple map tasks for gzipped file, you may use RAgzip by just 
> changing InputFormat to RAGZIPInputFormat.
> The option used in RAGZIPInputFormat can be found at the javadoc of 
> RAGZIPInputFormat part.
> RAgzip uses zlib's inflatePrime function which supports random access on a 
> gzipped file. 
> Since the inflatePrime is supported from the version of 1.2.2.4, it requires 
> zlib 1.2.2.4 or higher. (We tested on zlib 1.2.3)
> RAgzip requires the preprocessing step that creates an access point (.ap) 
> file, which is like the index of the gzipped file chunks. 
> The access point(.ap) file is located in same path of the gzipped file.
> If there is a "/user/hadoop/test.gz", the .ap file is created with 
> "/user/hadoop/test.gz.ap".

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-491) RAgzip: multiple map tasks for a large gzipped file

Reply via email to