[jira] Commented: (HADOOP-1824) want InputFormat for zip files

Ankur (JIRA) Mon, 04 Feb 2008 23:33:38 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565661#action_12565661
 ]


Ankur commented on HADOOP-1824:
-------------------------------

Thanks for clarifying :-). But even Unzip in its present release 5.52 does not 
serve our purpose of supporting large files ( >  4GB) since it does not take 
care of extra headers in Zip64 format that are used specifically for supporting 
large archives. 

This is well documented and clearly stated in the FAQ,  
http://www.info-zip.org/FAQ.html#limits.
Given below is an excerpt from the page :-

"Also note that in August 2001, PKWARE released PKZIP 4.50 with support for 
large files and archives via a pair of new header types, "PK\x06\x06" and 
"PK\x06\x07". So far these headers are undocumented, but most of their fields 
are fairly obvious. We don't yet know when Zip and UnZip will support this 
extension to the format. In the short term, it is possible to improve Zip and 
UnZip's capabilities slightly on certain Linux systems (and probably other 
Unix-like systems) by recompiling with the -DLARGEFILE_SOURCE 
-D_FILE_OFFSET_BITS=64  options. This will allow the utilities to handle 
uncompressed data files greater than 2 GB in size, as long as the total size of 
the archive containing them is less than 2 GB."
                                                                                
             =======================================================

This leaves us with little options. Either we look for something else that 
implements zip64 extension and whose license is such that we can include it in 
our code or we ourselves implement these extensions in Minizip code which we 
will have to test extensively and maintain. Sigh.


> want InputFormat for zip files
> ------------------------------
>
>                 Key: HADOOP-1824
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1824
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.15.2
>            Reporter: Doug Cutting
>         Attachments: ZipInputFormat_fixed.patch
>
>
> HDFS is inefficient with large numbers of small files.  Thus one might pack 
> many small files into large, compressed, archives.  But, for efficient 
> map-reduce operation, it is desireable to be able to split inputs into 
> smaller chunks, with one or more small original file per split.  The zip 
> format, unlike tar, permits enumeration of files in the archive without 
> scanning the entire archive.  Thus a zip InputFormat could efficiently permit 
> splitting large archives into splits that contain one or more archived files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1824) want InputFormat for zip files

Reply via email to