[ 
https://issues.apache.org/jira/browse/TIKA-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651742#comment-14651742
 ] 

Tim Allison commented on TIKA-1701:
-----------------------------------

Anyone know of a DigestInputStream that works with mark(), reset() and skip()?  
I have a query out to [commons 
codec|http://mail-archives.apache.org/mod_mbox/commons-user/201508.mbox/%3CDM2PR09MB07135F86C7AC6981F1BB216BC78A0%40DM2PR09MB0713.namprd09.prod.outlook.com%3E
 ], but if anyone in Tika-land has a recommendation, I'd appreciate it.

> Fix DigestingParser to handle truncated package files more robustly
> -------------------------------------------------------------------
>
>                 Key: TIKA-1701
>                 URL: https://issues.apache.org/jira/browse/TIKA-1701
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Trivial
>
> On a recent run against Common Crawl data, I found that the DigestingParser's 
> strategy of mark() --> digest stream -->reset() _before_ the parse is causing 
> problems with truncated package files...the digester is hitting the EOF 
> exception before the parsing of the embedded files is able to take place.
> We might want to do the digesting after the parse (?) or wrap the InputStream 
> to digest each byte as it is read.
> In a very few cases, more attachments were able to be read with the 
> DigestingParser than without, but the opposite was far more often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to