[
https://issues.apache.org/jira/browse/TIKA-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651742#comment-14651742
]
Tim Allison commented on TIKA-1701:
-----------------------------------
Anyone know of a DigestInputStream that works with mark(), reset() and skip()?
I have a query out to [commons
codec|http://mail-archives.apache.org/mod_mbox/commons-user/201508.mbox/%3CDM2PR09MB07135F86C7AC6981F1BB216BC78A0%40DM2PR09MB0713.namprd09.prod.outlook.com%3E
], but if anyone in Tika-land has a recommendation, I'd appreciate it.
> Fix DigestingParser to handle truncated package files more robustly
> -------------------------------------------------------------------
>
> Key: TIKA-1701
> URL: https://issues.apache.org/jira/browse/TIKA-1701
> Project: Tika
> Issue Type: Bug
> Reporter: Tim Allison
> Priority: Trivial
>
> On a recent run against Common Crawl data, I found that the DigestingParser's
> strategy of mark() --> digest stream -->reset() _before_ the parse is causing
> problems with truncated package files...the digester is hitting the EOF
> exception before the parsing of the embedded files is able to take place.
> We might want to do the digesting after the parse (?) or wrap the InputStream
> to digest each byte as it is read.
> In a very few cases, more attachments were able to be read with the
> DigestingParser than without, but the opposite was far more often.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)