[
https://issues.apache.org/jira/browse/TIKA-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128921#comment-14128921
]
Jeremy Anderson commented on TIKA-1268:
---------------------------------------
Take a look at my last comment in TIKA-1285, to see some of the common
exceptions that I saw that prevented DomXmpParser from being able to get all
information from a files xmp data. Some of which had no results on Tika's
junit tests. (The scope of the inner workings of Xmp files and all is a bit out
of my knowledge set)
My patch removed Keyword & Comment metadata from a junit case or two in
JempboxExtractorTest and JpegParserTest. And some extended ones in
PDFParserTest. Take a look at my patch and search for "//TODO: Fix once
DomXmpParser error fixed:" which I placed by any test case that I commented out.
I believe the root reason for xmpbox.XmpDomParser failing is it requiring too
strict of adherence to standards that files don't necessarily adhere to with
their Xmp content, and a few missed cases of handling bags and Seq data.
If you apply the TIKA-1285 patch you can uncomment out the System.err.println's
to see what messages DomXmpParser fails with, but be sure to also apply the
PDFBOX-2318 patch as well which fixes a few easier issues with that parser.
> Extract images from PDF documents
> ---------------------------------
>
> Key: TIKA-1268
> URL: https://issues.apache.org/jira/browse/TIKA-1268
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Jukka Zitting
> Assignee: Jukka Zitting
> Fix For: 1.6
>
>
> It would be nice if images within PDF documents could be extracted much like
> embedded attachments are now being handled.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)