[jira] Commented: (HADOOP-1824) want InputFormat for zip files

Ankur (JIRA) Tue, 05 Feb 2008 04:14:29 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565723#action_12565723
 ]


Ankur commented on HADOOP-1824:
-------------------------------

Or as another option we can have our implementation of ZipInputStream purely in 
Java (no native code) that is based upon Sun's Java.io.zip.ZipInputStream with 
some additions and modifications to :-

1. Work with a Seekable stream (like FSDataInputStream).
2. Read only central directory structure to obtain file information instead of 
sequentially 
    reading the whole archive (Sun's implementation).
3. Make sure Zip64 headers are processed correctly.

This way we will have the following advantages.

1. A pure Java Zip stream parser supporting Zip64 format (No native code).
2. Support for Random as well as Sequential access.
3. No dependency on any external components.
4. Ease of modification for adding append when HDFS provides this facility.
5. Possibility of donating our parser as a Zip64 compliant java zip parser to 
open source in future.

The above of course require a lot of work but looking at the advantages I feel 
its worth it.

Your opinion ?


> want InputFormat for zip files
> ------------------------------
>
>                 Key: HADOOP-1824
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1824
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.15.2
>            Reporter: Doug Cutting
>         Attachments: ZipInputFormat_fixed.patch
>
>
> HDFS is inefficient with large numbers of small files.  Thus one might pack 
> many small files into large, compressed, archives.  But, for efficient 
> map-reduce operation, it is desireable to be able to split inputs into 
> smaller chunks, with one or more small original file per split.  The zip 
> format, unlike tar, permits enumeration of files in the archive without 
> scanning the entire archive.  Thus a zip InputFormat could efficiently permit 
> splitting large archives into splits that contain one or more archived files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1824) want InputFormat for zip files

Reply via email to