[ https://issues.apache.org/jira/browse/TIKA-3196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201955#comment-17201955 ]
ASF GitHub Bot commented on TIKA-3196: -------------------------------------- PeterAlfredLee opened a new pull request #364: URL: https://github.com/apache/tika/pull/364 When reading a zip archive entry with STORED and Data Descriptor, a UnsupportedZipFeatureException would be thrown. We can save the number of entries we have already read, reset the stream, and open the ZipArchieInputStream again with Data Descriptor allowed. Then we can finish reading the rest of the entries. 1. I set a limit of 100MB using variable `MARK_LIMIT`, which is used for `stream.mark`. 2. The `entryCnt` is used for storing the number of entries we have read. 3. I modified `parseEntry` a little bit : nothing would be written to `xhtml` if a zip entry uses `STORED` and `Data Descriptor` at the same time. Instread an exception is thrown and the stream would be `reset` and read for a second time. 4. I have generated a zip archive for test. This zip contains 5 entries. The 2nd and 4th entry in the zip archive are using `STORED` with `Data Descriptor`. This zip archive could be successfully parsed. See also [#356](https://github.com/apache/tika/pull/356) and [Commons Compress #137](https://github.com/apache/commons-compress/pull/137) for more information. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > PackageParser should attempt to parse entries from zip files with STORED > entries with data descriptor > ----------------------------------------------------------------------------------------------------- > > Key: TIKA-3196 > URL: https://issues.apache.org/jira/browse/TIKA-3196 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Trevor Bentley > Priority: Major > Attachments: OOO-107047-0.oxt-145.zip > > > We are currently using tika for text extraction. Currently some sites are > returning zips that have entries with stored data descriptors which fail to > extract due to the ZipArchiveInputStream (in commons-compress) defaulting to > false for 'allowStoredEntriesWithDataDescriptor'. > Since ZipArchiveInputStream has support for reading zips with data > descriptors we should attempt to read the zip with that feature enabled when > we get a data descriptor UnsupportedZipFeatureException. > Pull Request: > [https://github.com/apache/tika/pull/356|https://github.com/apache/tika/pull/355] -- This message was sent by Atlassian Jira (v8.3.4#803005)