[
https://issues.apache.org/jira/browse/TIKA-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074904#comment-14074904
]
Hudson commented on TIKA-1374:
------------------------------
SUCCESS: Integrated in tika-trunk-jdk1.7 #120 (See
[https://builds.apache.org/job/tika-trunk-jdk1.7/120/])
TIKA-1374: Try to extract OS-specific embedded files within PDFs (tallison:
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1613501)
*
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/TikaTest.java
*
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
> Need to add code to look for OS-specific keys for embedded files within PDFs
> ----------------------------------------------------------------------------
>
> Key: TIKA-1374
> URL: https://issues.apache.org/jira/browse/TIKA-1374
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Tim Allison
> Assignee: Tim Allison
> Priority: Minor
> Fix For: 1.6
>
>
> Embedded files in PDFs can be found by the general all purpose key we
> currently use via PDFBox: "F". However, embedded documents can also be
> stored under OS specific keys: "DOS", "Mac" and "Unix".
> [~lehmi] confirmed on the PDFBox users
> [list|http://mail-archives.apache.org/mod_mbox/pdfbox-users/201407.mbox/%[email protected]%3e]
> that we might be missing embedded documents if we're not trying the OS
> specific keys as well. As Andreas points out, according to the spec the OS
> specific keys shouldn't be used any more, but I think we should support
> extraction for them.
> My proposal is to pull all documents that are available by any of the four
> keys (well, via getEmbeddedFile<OS>() in PDFBox). This has the downside of
> potentially extracting basically duplicate documents, but I'd prefer to err
> on the side of extracting everything.
> The code fix is trivial, and I'll try to commit it today. However, it will
> take me a bit of time to generate a test file that stores files under the OS
> specific keys. So, if anyone has an ASF-friendly file available or wants to
> take the task of generating one, please do.
--
This message was sent by Atlassian JIRA
(v6.2#6252)