[ 
https://issues.apache.org/jira/browse/TIKA-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306514#comment-17306514
 ] 

Hudson commented on TIKA-3332:
------------------------------

UNSTABLE: Integrated in Jenkins build Tika ยป tika-branch1x-jdk8 #109 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/109/])
TIKA-3332 -- recursively process the embedded file tree in PDFs. (tallison: 
[https://github.com/apache/tika/commit/8bf65c01e174e1ba872e813089b076f21ddb4410])
* (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (edit) CHANGES.txt
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
* (add) 
tika-parsers/src/test/resources/test-documents/testPDF_deeplyEmbeddedAttachments.pdf


> Embedded files not extracted from PDF files with multilevel EmbeddedFiles tree
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-3332
>                 URL: https://issues.apache.org/jira/browse/TIKA-3332
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.25
>            Reporter: Ross Johnson
>            Priority: Major
>             Fix For: 1.26
>
>         Attachments: Screen Shot 2021-03-22 at 10.29.51 AM.png, Screenshot 
> (5).png, image-2021-03-20-13-36-48-525.png
>
>
> I have come across some portfolio PDFs that have many attachments / embedded 
> files, but Tika is not detecting or extracting them as it does with some 
> other portfolio PDFs. The issue may be that these files have a multilevel 
> EmbeddedFiles name tree that is not being handled properly by PDFBox.
> Here is the EmbeddedFiles structure of one of the PDF portfolios in question. 
> Notice that the root EmbeddedFiles dictionary has a Kids array that only 
> consists of intermediate dictionaries, with the actual Names array being one 
> more level down.
> !image-2021-03-20-13-36-48-525.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to