Tim Allison created TIKA-4415:
---------------------------------

             Summary: Improve zip detection on truncated zips
                 Key: TIKA-4415
                 URL: https://issues.apache.org/jira/browse/TIKA-4415
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


On TIKA-4411, while running the regression tests, we found a file that used to 
be identified as an xps file with 3.1.0 was now identified as a zip file with 
the newer 3.x branch.

The file was: BCENRNQMIUX64IPK3K5BBMM6JWU7XNKO

The issue is subtle. The zip has a data descriptor. Our retry technique in the 
detector calls reset() on the inputstream. For some reason this was throwing an 
IOException (invalid mark) on a Tika inputstream. I couldn't figure out why 
this was happening, but if we shift to spooling the zip to a file and then 
retrying on that, everything works as it did.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to