[
https://issues.apache.org/jira/browse/HADOOP-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565661#action_12565661
]
Ankur commented on HADOOP-1824:
-------------------------------
Thanks for clarifying :-). But even Unzip in its present release 5.52 does not
serve our purpose of supporting large files ( > 4GB) since it does not take
care of extra headers in Zip64 format that are used specifically for supporting
large archives.
This is well documented and clearly stated in the FAQ,
http://www.info-zip.org/FAQ.html#limits.
Given below is an excerpt from the page :-
"Also note that in August 2001, PKWARE released PKZIP 4.50 with support for
large files and archives via a pair of new header types, "PK\x06\x06" and
"PK\x06\x07". So far these headers are undocumented, but most of their fields
are fairly obvious. We don't yet know when Zip and UnZip will support this
extension to the format. In the short term, it is possible to improve Zip and
UnZip's capabilities slightly on certain Linux systems (and probably other
Unix-like systems) by recompiling with the -DLARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64 options. This will allow the utilities to handle
uncompressed data files greater than 2 GB in size, as long as the total size of
the archive containing them is less than 2 GB."
=======================================================
This leaves us with little options. Either we look for something else that
implements zip64 extension and whose license is such that we can include it in
our code or we ourselves implement these extensions in Minizip code which we
will have to test extensively and maintain. Sigh.
> want InputFormat for zip files
> ------------------------------
>
> Key: HADOOP-1824
> URL: https://issues.apache.org/jira/browse/HADOOP-1824
> Project: Hadoop Core
> Issue Type: New Feature
> Components: mapred
> Affects Versions: 0.15.2
> Reporter: Doug Cutting
> Attachments: ZipInputFormat_fixed.patch
>
>
> HDFS is inefficient with large numbers of small files. Thus one might pack
> many small files into large, compressed, archives. But, for efficient
> map-reduce operation, it is desireable to be able to split inputs into
> smaller chunks, with one or more small original file per split. The zip
> format, unlike tar, permits enumeration of files in the archive without
> scanning the entire archive. Thus a zip InputFormat could efficiently permit
> splitting large archives into splits that contain one or more archived files.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.