[ 
https://issues.apache.org/jira/browse/TIKA-3196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200071#comment-17200071
 ] 

Tim Allison commented on TIKA-3196:
-----------------------------------

If only we has some way of finding files that trigger this problem...oh, wait:

https://corpora.tika.apache.org/datasette/corpora-metadata?sql=select%0D%0A++file_path%2C%0D%0A++c.length%2C%0D%0A++e.ID%2C%0D%0A++ORIG_STACK_TRACE%2C%0D%0A++SORT_STACK_TRACE%2C%0D%0A++PARSE_EXCEPTION_ID%0D%0Afrom%0D%0A++PARSE_EXCEPTIONS+e%0D%0A++join+profiles+p+on+e.id%3Dp.id%0D%0A++join+containers+c+on+p.container_id%3Dc.container_id%0D%0Awhere%0D%0A++orig_stack_trace+like+%27%25data+descriptor%25%27%0D%0Aorder+by%0D%0A++c.length+asc%0D%0Alimit%0D%0A++101

> PackageParser should attempt to parse entries from zip files with STORED 
> entries with data descriptor
> -----------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3196
>                 URL: https://issues.apache.org/jira/browse/TIKA-3196
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Trevor Bentley
>            Priority: Major
>
> We are currently using tika for text extraction. Currently some sites are 
> returning zips that have entries with stored data descriptors which fail to 
> extract due to the ZipArchiveInputStream (in commons-compress) defaulting to 
> false for 'allowStoredEntriesWithDataDescriptor'.
> Since ZipArchiveInputStream has support for reading zips with data 
> descriptors we should attempt to read the zip with that feature enabled when 
> we get a data descriptor UnsupportedZipFeatureException.
> Pull Request: 
> [https://github.com/apache/tika/pull/356|https://github.com/apache/tika/pull/355]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to