[jira] Commented: (HADOOP-1824) want InputFormat for zip files

Ankur (JIRA) Sun, 03 Feb 2008 04:08:32 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565167#action_12565167
 ]


Ankur commented on HADOOP-1824:
-------------------------------

> ...so it might not be too hard to modify it to read from something else.
Actually! I already spent sufficient time setting things up, adding new code 
(I/O apis that can be plugged to the unzip code) to make it work in a manner 
that a Zip file name is passed to a native C call from Java which then uses 
unzip APIs to do open/read/seek operations on it. 

What's different is that my custom implemented I/O APIs are used to construct 
the I/O function pointer structure and this structure is passed to unzip APIs. 
The custom I/O APIs are responsible for making Java callbacks whenever unzip 
APIs request an I/O operation via them.

My concern is not that part, but the APIs of unzip.c that are ZIP format 
agnostic and does all the low level bit shifting operations, directory parsing, 
reading and uncompressing stuff since it is that part which fails for file > 
4GB. Now modifying that part would mean 2 things.

1. We would be extending unzip code in minizip to support ZIP64 format for our 
needs and we would be required to maintain it.
2. Any modification would require decent knowledge of the format and would need 
to ensure backward compatibility with older ZIP format.

So the question here is, do we go ahead and extend the minizip code for ZIP64 
format? (This would be quite involved I think)
Or do we stick with the present limitation of 4GB and schedule it for later ?

> want InputFormat for zip files
> ------------------------------
>
>                 Key: HADOOP-1824
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1824
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.15.2
>            Reporter: Doug Cutting
>         Attachments: ZipInputFormat_fixed.patch
>
>
> HDFS is inefficient with large numbers of small files.  Thus one might pack 
> many small files into large, compressed, archives.  But, for efficient 
> map-reduce operation, it is desireable to be able to split inputs into 
> smaller chunks, with one or more small original file per split.  The zip 
> format, unlike tar, permits enumeration of files in the archive without 
> scanning the entire archive.  Thus a zip InputFormat could efficiently permit 
> splitting large archives into splits that contain one or more archived files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1824) want InputFormat for zip files

Reply via email to