[jira] Commented: (HADOOP-1824) want InputFormat for zip files

Doug Cutting (JIRA) Thu, 24 Jan 2008 11:46:55 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12562183#action_12562183
 ]


Doug Cutting commented on HADOOP-1824:
--------------------------------------

Some comments:
- isSplittable throws an exception when an empty zip archive is passed.  
Instead, an empty zip file should just provide no keys and values, but not 
throw exceptions.
- in getSplits, there's no need to explicitly test that each file exists.  
Instead, we can rely on open() throwing an exception if a file does not exist.
- getRecordReader should not loop calling getNextEntry(), but instead just call 
getEntry(String).

Oh, wait.  On that last point, it looks like getEntry() is only available on 
ZipFile, and we cannot create a ZipFile except from a File.  Wih an InputStream 
we must use ZipInputStream, which does not support getEntry(), since 
InputStream doesn't support random access.  Sigh.  This considerably reduces 
the utility of this InputFormat.  GNU Classpath's implementation of 
java.io.zip.ZipFile use a RandomAccessFile, which we could implement, but, 
alas, we can't use GNU's code at Apache because it is under the GPL.

Zlib includes a zip file parser (minizip) that's under a BSD-like license and 
that permits random access to zip file entries from a user-supplied input 
stream.  So we could do it in C.  Sigh.


> want InputFormat for zip files
> ------------------------------
>
>                 Key: HADOOP-1824
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1824
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.15.2
>            Reporter: Doug Cutting
>         Attachments: ZipInputFormat_fixed.patch
>
>
> HDFS is inefficient with large numbers of small files.  Thus one might pack 
> many small files into large, compressed, archives.  But, for efficient 
> map-reduce operation, it is desireable to be able to split inputs into 
> smaller chunks, with one or more small original file per split.  The zip 
> format, unlike tar, permits enumeration of files in the archive without 
> scanning the entire archive.  Thus a zip InputFormat could efficiently permit 
> splitting large archives into splits that contain one or more archived files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1824) want InputFormat for zip files

Reply via email to