Ross Johnson created TIKA-3332:
----------------------------------

             Summary: Embedded files not extracted from PDF files with 
multilevel EmbeddedFiles tree
                 Key: TIKA-3332
                 URL: https://issues.apache.org/jira/browse/TIKA-3332
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.25
            Reporter: Ross Johnson
         Attachments: image-2021-03-20-13-36-48-525.png

I have come across some portfolio PDFs that have many attachments / embedded 
files, but Tika is not detecting or extracting them as it does with some other 
portfolio PDFs. The issue may be that these files have a multilevel 
EmbeddedFiles name tree that is not being handled properly by PDFBox.


Here is the EmbeddedFiles structure of one of the PDF portfolios in question. 
Notice that the root EmbeddedFiles dictionary has a Kids array that only 
consists of intermediate dictionaries, with the actual Names array being one 
more level down.

!image-2021-03-20-13-36-48-525.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to