[
https://issues.apache.org/jira/browse/TIKA-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306605#comment-17306605
]
Hudson commented on TIKA-3332:
------------------------------
SUCCESS: Integrated in Jenkins build Tika ยป tika-main-jdk8 #179 (See
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/179/])
TIKA-3332 -- checkstyle fix (tallison:
[https://github.com/apache/tika/commit/5da9984cf226ac7ab517fb6fbd2f2fb7ca504079])
* (edit)
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
> Embedded files not extracted from PDF files with multilevel EmbeddedFiles tree
> ------------------------------------------------------------------------------
>
> Key: TIKA-3332
> URL: https://issues.apache.org/jira/browse/TIKA-3332
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.25
> Reporter: Ross Johnson
> Priority: Major
> Fix For: 1.26
>
> Attachments: Screen Shot 2021-03-22 at 10.29.51 AM.png, Screenshot
> (5).png, image-2021-03-20-13-36-48-525.png
>
>
> I have come across some portfolio PDFs that have many attachments / embedded
> files, but Tika is not detecting or extracting them as it does with some
> other portfolio PDFs. The issue may be that these files have a multilevel
> EmbeddedFiles name tree that is not being handled properly by PDFBox.
> Here is the EmbeddedFiles structure of one of the PDF portfolios in question.
> Notice that the root EmbeddedFiles dictionary has a Kids array that only
> consists of intermediate dictionaries, with the actual Names array being one
> more level down.
> !image-2021-03-20-13-36-48-525.png!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)