[ 
https://issues.apache.org/jira/browse/TIKA-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18066329#comment-18066329
 ] 

Tim Allison commented on TIKA-4563:
-----------------------------------

We're still missing zip content in S4P6, B7TH, QMXI. These are all truncated 
zips. The issue is that we switched from streaming to random access reading of 
the central directory in zips. This is more robust for non-truncated zips, but 
there's a problem.

When loading a zip as a file, commons compress reads from the end of the file 
trying to find a EOCD -- a pointer to the central directory. In these three 
truncated files that don't have an eocd or a central directory, compress is 
finding the bytes for the EOCD in a compressed stream and then, even though 
they don't point to a legit entry, commons compress appears to be reading a 
single entry without throwing an exception. We need to follow up on this issue, 
but I think we should let it be for 3.3.0. I think we gain much more by 
switching to reading zips via the central directory.

Once I push the recent fixes (move to poi-ooxml-full and add file names even if 
there's a stream exception), should I roll 3.3.0, do we need another full 
regression run, are we ok with my recommendation about living with suboptimal 
handling of truncated zips that appear to have eocd markers in their compressed 
data for now?


> Prep for 3.3.0 release
> ----------------------
>
>                 Key: TIKA-4563
>                 URL: https://issues.apache.org/jira/browse/TIKA-4563
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: kio5_perldoc.mo, tika-3.3.0-20260110.tgz, 
> tika-3.3.0-reports.tgz, tika-3.3.0.tgz, tika-3.3.0c.tgz
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to