[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

Tim Allison (JIRA) Wed, 24 Sep 2014 10:44:48 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146588#comment-14146588
 ]


Tim Allison commented on TIKA-1396:
-----------------------------------

Y, I can think of a few options.  We still need to add tags in the PDFParser 
and RTFParser, and I'll do that on TIKA-1427...thank you for opening that.

You could use a ParserContainerExtractor to extract each file, or you could use 
an EmbeddedDocumentExtractor (see TikaCLI in tika-app or UnpackerResource in 
tika-server for examples).

You might also try the RecursiveParserWrapper that I just added to trunk if you 
know that your docs will be small enough to hold in memory.  With that, you 
parse a document and then call getMetadata() on the parser.  It returns a list 
of Metadata objects -- the first one is the parent document and then one 
metadata object for each attachment.  The text can be stored in a metadata 
field depending on what ContentHandlerFactory you pass in...but you would just 
iterate through the list to get the metadata and content for each embedded doc.



> Embedded images in PDF documents
> --------------------------------
>
>                 Key: TIKA-1396
>                 URL: https://issues.apache.org/jira/browse/TIKA-1396
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.5
>         Environment: *OS:* 
> Ubuntu 14.04.1 LTS
> *KERNEL:*
> 3.13.0-33-generic 
> gcc version 4.8.2
> *JAVA:*
> java version "1.8.0_11"
> Java(TM) SE Runtime Environment (build 1.8.0_11-b12)
> Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode)
>            Reporter: Damiano
>            Priority: Critical
>             Fix For: 1.6
>
>         Attachments: tika_images.pdf
>
>
> Hello!
> I just found a problem with PDF documents that have embedded images.
> Doing:
> java -jar tika-app-1.5.jar --extract tika.pdf
> Tika can not find the image.
> Is this a PDF related problem? Because if i do the same operation with a DOC 
> document Tika finds the image correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

Reply via email to