Tim Allison created TIKA-1612:
---------------------------------

             Summary: Exceptions getting image data in PPT files
                 Key: TIKA-1612
                 URL: https://issues.apache.org/jira/browse/TIKA-1612
             Project: Tika
          Issue Type: Bug
            Reporter: Tim Allison
            Priority: Minor


In numerous (~500) ppt files in govdocs1, we're getting zip exceptions (unknown 
compression method, bad block, etc) when Tika's HSLFExtractor calls 
{{getData()}} on an embedded image.

Under normal circumstances (I just learned today...), if an attachment causes a 
RuntimeException, we are currently swallowing that in 
{{ParsingEmbeddedDocumentExtractor}}.

However, because we're calling {{getData()}} before the embedded extractor 
takes over, if there is an exception there, the parse of the entire file fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to