[jira] [Commented] (TIKA-1268) Extract images from PDF documents

Jeremy Anderson (JIRA) Wed, 10 Sep 2014 11:48:13 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128921#comment-14128921
 ]


Jeremy Anderson commented on TIKA-1268:
---------------------------------------

Take a look at my last comment in TIKA-1285, to see some of the common 
exceptions that I saw that prevented DomXmpParser from being able to get all 
information from a files xmp data.  Some of which had no results on Tika's 
junit tests. (The scope of the inner workings of Xmp files and all is a bit out 
of my knowledge set) 

My patch removed Keyword & Comment metadata from a junit case or two in 
JempboxExtractorTest and JpegParserTest. And some extended ones in 
PDFParserTest.  Take a look at my patch and search for "//TODO: Fix once 
DomXmpParser error fixed:" which I placed by any test case that I commented out.

I believe the root reason for xmpbox.XmpDomParser failing is it requiring too 
strict of adherence to standards that files don't necessarily adhere to with 
their Xmp content, and a few missed cases of handling bags and Seq data.

If you apply the TIKA-1285 patch you can uncomment out the System.err.println's 
to see what messages DomXmpParser fails with, but be sure to also apply the 
PDFBOX-2318 patch as well which fixes a few easier issues with that parser.

> Extract images from PDF documents
> ---------------------------------
>
>                 Key: TIKA-1268
>                 URL: https://issues.apache.org/jira/browse/TIKA-1268
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 1.6
>
>
> It would be nice if images within PDF documents could be extracted much like 
> embedded attachments are now being handled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1268) Extract images from PDF documents

Reply via email to