[ 
https://issues.apache.org/jira/browse/TIKA-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709848#comment-14709848
 ] 

Tim Allison commented on TIKA-1715:
-----------------------------------

If this is a usage question, probably better to ask on [email protected].

The RecursiveParserWrapper is only to be used for extraction of text content 
and metadata, not actual bytes.

To see an example of how to extract the bytes of embedded files, see this 
[example|https://svn.apache.org/viewvc/tika/trunk/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java?revision=1696751&view=markup].

Note that the caveat on TIKA-1674 still applies...this approach only extracts 
the bytes of the immediate children of the main document.  It will not pull out 
the grandchildren of the main document, etc.  This is the current behavior of 
tika-app.jar's -z option and tika-server's /unpack endpoint.

As for the speed, yes, it can be slow on some files.  That's why we chose not 
to extract inline images by default.  If you are finding better performance 
with PDFBox's ExtractImages, let us know!



> Save embedded images into another location
> ------------------------------------------
>
>                 Key: TIKA-1715
>                 URL: https://issues.apache.org/jira/browse/TIKA-1715
>             Project: Tika
>          Issue Type: Test
>          Components: metadata
>    Affects Versions: 1.10
>            Reporter: Damiano
>              Labels: newbie
>
> Hello,
> I am having a strange problem deadling with embedded images.
> This is my code:
> {code:xml}
>     public void getImages() throws IOException, TikaException, SAXException {
>         
>         try (InputStream stream = new FileInputStream(this.fileName)) {
>             RecursiveParserWrapper p = new RecursiveParserWrapper(
>                 new AutoDetectParser(),
>                 new 
> BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.IGNORE, -1)
>             );            
>             
>             ParseContext context = new ParseContext();
>             PDFParserConfig config = new PDFParserConfig();
>             config.setExtractInlineImages(true);
>             config.setExtractUniqueInlineImagesOnly(true);
>             context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, 
> config);
>             context.set(org.apache.tika.parser.Parser.class, p);            
>             
>             p.parse(stream, new BodyContentHandler(-1), new Metadata(), 
> context);
>             
>             List<Metadata> metadatas = p.getMetadata();
>                         
>             FileInputStream f = new FileInputStream("/tmp/" + 
> metadatas.get(1).get("File Name"));
>             //FileInputStream f = new 
> FileInputStream(metadatas.get(1).get("File Name"));
>             
>             System.out.println(f.available());
>         }
>     }
> {code}
> I can get the name of the embedded images with get("File Name") but the path 
> seems invalid.
> I need to save all the embedded images (inline images) to another location.
> Thank you in advance!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to