[
https://issues.apache.org/jira/browse/TIKA-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711101#comment-17711101
]
Tim Allison edited comment on TIKA-4012 at 4/12/23 2:05 PM:
------------------------------------------------------------
According to the spec, these are some of the places where Associated Files
(/AF) may appear in 2.x:
{noformat}
• the PDF document catalog dictionary (14.13.3, "Associated files linked to the
PDF document’s
catalog")
• a page dictionary (14.13.4, "Associated files linked to a page dictionary")
• a graphics object (using a marked-content property list dictionary, 14.13.5,
"Associated files
linked to graphics objects")
• a structure element dictionary (14.13.6, "Associated files linked to
structure elements")
• an XObject dictionary (14.13.7, "Associated files linked to XObjects")
• a DParts dictionary (14.13.8, "Associated files linked to DParts")
• an annotation dictionary (14.13.9, "Associated files linked to an annotation
dictionary")
• a metadata stream dictionary (14.3.2, "Metadata streams")
{noformat}
Oh, I had forgotten about this:
https://www.pdfa.org/wp-content/uploads/2018/10/PDF20_AN002-AF.pdf
This suggests that /AF can appear anywhere ... not just in the above in PDF 2.0.
was (Author: [email protected]):
According to the spec, these are some of the places where Associated Files
(/AF) may appear in 2.x:
{noformat}
• the PDF document catalog dictionary (14.13.3, "Associated files linked to the
PDF document’s
catalog")
• a page dictionary (14.13.4, "Associated files linked to a page dictionary")
• a graphics object (using a marked-content property list dictionary, 14.13.5,
"Associated files
linked to graphics objects")
• a structure element dictionary (14.13.6, "Associated files linked to
structure elements")
• an XObject dictionary (14.13.7, "Associated files linked to XObjects")
• a DParts dictionary (14.13.8, "Associated files linked to DParts")
• an annotation dictionary (14.13.9, "Associated files linked to an annotation
dictionary")
• a metadata stream dictionary (14.3.2, "Metadata streams")
{noformat}
> Improve extraction of embedded documents in PDFs
> ------------------------------------------------
>
> Key: TIKA-4012
> URL: https://issues.apache.org/jira/browse/TIKA-4012
> Project: Tika
> Issue Type: New Feature
> Reporter: Tim Allison
> Priority: Major
> Attachments: pdfbox-new-attachments-reports.tgz
>
>
> We're currently processing the EmbeddedFiles entry in the name tree and
> annotations to look for file spec dictionaries. Unfortunately, PDFs may embed
> files in lots of other places. The newly free 2.0 spec makes this abundantly
> and painfully clear.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)