Tim Allison created TIKA-1701:
---------------------------------

             Summary: Fix DigestingParser to handle truncated package files 
more robustly
                 Key: TIKA-1701
                 URL: https://issues.apache.org/jira/browse/TIKA-1701
             Project: Tika
          Issue Type: Bug
            Reporter: Tim Allison
            Priority: Trivial


On a recent run against Common Crawl data, I found that the DigestingParser's 
strategy of mark()->read stream->reset() _before_ the parse is causing problems 
with truncated package files...the digester is hitting the EOF exception before 
the parsing of the embedded files is able to take place.

We might want to do the digesting after the parse (?) or wrap the InputStream 
to digest each byte as it is read.

In a very few cases, more attachments were able to be read with the 
DigestingParser than without, but the opposite was far more often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to