[
https://issues.apache.org/jira/browse/TIKA-4017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17712391#comment-17712391
]
Tim Allison commented on TIKA-4017:
-----------------------------------
For anyone needing to find hidden data in PDFs, it would be helpful to identify
PDFs with incremental updates. As a next step, we should expose the %%EOF
offsets so that users can truncate the files and parse the updates as needed.
As the next step we can allow users to configure this within Tika.
I'm not sure how to handle this cleanly with embedded documents... if the
container file is a PDF with incremental updates, then we can parse the others
as if they were embedded documents with depth of 0, and internal path of base
/update0/ or something?
How do we handle incremental updates on PDFs that themselves are embedded? It
gets messy quickly.
> Add optional detection and parsing of incremental updates in PDF
> ----------------------------------------------------------------
>
> Key: TIKA-4017
> URL: https://issues.apache.org/jira/browse/TIKA-4017
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)