[
https://issues.apache.org/jira/browse/HADOOP-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12562183#action_12562183
]
Doug Cutting commented on HADOOP-1824:
--------------------------------------
Some comments:
- isSplittable throws an exception when an empty zip archive is passed.
Instead, an empty zip file should just provide no keys and values, but not
throw exceptions.
- in getSplits, there's no need to explicitly test that each file exists.
Instead, we can rely on open() throwing an exception if a file does not exist.
- getRecordReader should not loop calling getNextEntry(), but instead just call
getEntry(String).
Oh, wait. On that last point, it looks like getEntry() is only available on
ZipFile, and we cannot create a ZipFile except from a File. Wih an InputStream
we must use ZipInputStream, which does not support getEntry(), since
InputStream doesn't support random access. Sigh. This considerably reduces
the utility of this InputFormat. GNU Classpath's implementation of
java.io.zip.ZipFile use a RandomAccessFile, which we could implement, but,
alas, we can't use GNU's code at Apache because it is under the GPL.
Zlib includes a zip file parser (minizip) that's under a BSD-like license and
that permits random access to zip file entries from a user-supplied input
stream. So we could do it in C. Sigh.
> want InputFormat for zip files
> ------------------------------
>
> Key: HADOOP-1824
> URL: https://issues.apache.org/jira/browse/HADOOP-1824
> Project: Hadoop Core
> Issue Type: New Feature
> Components: mapred
> Affects Versions: 0.15.2
> Reporter: Doug Cutting
> Attachments: ZipInputFormat_fixed.patch
>
>
> HDFS is inefficient with large numbers of small files. Thus one might pack
> many small files into large, compressed, archives. But, for efficient
> map-reduce operation, it is desireable to be able to split inputs into
> smaller chunks, with one or more small original file per split. The zip
> format, unlike tar, permits enumeration of files in the archive without
> scanning the entire archive. Thus a zip InputFormat could efficiently permit
> splitting large archives into splits that contain one or more archived files.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.