[ https://issues.apache.org/jira/browse/HADOOP-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644691#action_12644691 ]
Steve Loughran commented on HADOOP-1824: ---------------------------------------- The most tested/stable Apache-licensed Java unzip code is in Ant's codebase; you can either take/fork that or try and get the changes back in, which, with suitable tests, I am sure will be happily accepted. > want InputFormat for zip files > ------------------------------ > > Key: HADOOP-1824 > URL: https://issues.apache.org/jira/browse/HADOOP-1824 > Project: Hadoop Core > Issue Type: New Feature > Components: mapred > Affects Versions: 0.15.2 > Reporter: Doug Cutting > Attachments: ZipInputFormat_fixed.patch > > > HDFS is inefficient with large numbers of small files. Thus one might pack > many small files into large, compressed, archives. But, for efficient > map-reduce operation, it is desireable to be able to split inputs into > smaller chunks, with one or more small original file per split. The zip > format, unlike tar, permits enumeration of files in the archive without > scanning the entire archive. Thus a zip InputFormat could efficiently permit > splitting large archives into splits that contain one or more archived files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.