[jira] [Commented] (TIKA-4017) Add optional detection and parsing of incremental updates in PDF

Tim Allison (Jira) Fri, 14 Apr 2023 06:35:07 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17712391#comment-17712391
 ]


Tim Allison commented on TIKA-4017:
-----------------------------------

For anyone needing to find hidden data in PDFs, it would be helpful to identify 
PDFs with incremental updates.  As a next step, we should expose the %%EOF 
offsets so that users can truncate the files and parse the updates as needed.  
As the next step we can allow users to configure this within Tika.

I'm not sure how to handle this cleanly with embedded documents... if the 
container file is a PDF with incremental updates, then we can parse the others 
as if they were embedded documents with depth of 0, and internal path of base 
/update0/ or something?

How do we handle incremental updates on PDFs that themselves are embedded?  It 
gets messy quickly.

> Add optional detection and parsing of incremental updates in PDF
> ----------------------------------------------------------------
>
>                 Key: TIKA-4017
>                 URL: https://issues.apache.org/jira/browse/TIKA-4017
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4017) Add optional detection and parsing of incremental updates in PDF

Reply via email to