Ross Johnson created TIKA-3332:
----------------------------------
Summary: Embedded files not extracted from PDF files with
multilevel EmbeddedFiles tree
Key: TIKA-3332
URL: https://issues.apache.org/jira/browse/TIKA-3332
Project: Tika
Issue Type: Bug
Affects Versions: 1.25
Reporter: Ross Johnson
Attachments: image-2021-03-20-13-36-48-525.png
I have come across some portfolio PDFs that have many attachments / embedded
files, but Tika is not detecting or extracting them as it does with some other
portfolio PDFs. The issue may be that these files have a multilevel
EmbeddedFiles name tree that is not being handled properly by PDFBox.
Here is the EmbeddedFiles structure of one of the PDF portfolios in question.
Notice that the root EmbeddedFiles dictionary has a Kids array that only
consists of intermediate dictionaries, with the actual Names array being one
more level down.
!image-2021-03-20-13-36-48-525.png!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)