pdf files are not calculated

Tim Allison (Jira) Thu, 24 Aug 2023 11:20:05 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758692#comment-17758692
 ]


Tim Allison commented on TIKA-4106:
-----------------------------------

I looked into this a bit more.  I _think_ we're ok in the items that I listed 
above.  I did find that if we don't set {{spoolToDisk}} to {{0}}, e.g. if we're 
processing embedded file streams in memory, we often don't get the length of 
the stream. So, I added length via the digester.  The notion is that if people 
care about lengths of embedded files, they'll probably also want digests of 
embedded files and vice versa.

There are still some files where we miss digests when there's an exception in 
the container file, but that's to be expected.

> Digesting and content length on embedded ole/zip/pdf files are not calculated
> -----------------------------------------------------------------------------
>
>                 Key: TIKA-4106
>                 URL: https://issues.apache.org/jira/browse/TIKA-4106
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>
> We've currently put the digester on the parser.  The problem is that some of 
> the detectors for some file formats open the full file and then put that 
> object in the openContainer of the TikaInputStream, which means that the 
> InputStream for those parsers that reuse the openContainer (created by the 
> detector) is never read.
>  
>  
> The outcome of this is that embedded OLE2, Zip (in some circumstances) and 
> PDF(?) files are never digested nor are their stream lengths extracted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4106) Digesting and content length on embedded ole/zip/pdf files are not calculated

Reply via email to