Simon Lucy  created TIKA-3348:
---------------------------------

             Summary: Improve the workflow for extracting and returning images 
from PDFs and other containers using Tika Server..
                 Key: TIKA-3348
                 URL: https://issues.apache.org/jira/browse/TIKA-3348
             Project: Tika
          Issue Type: Improvement
          Components: server
    Affects Versions: 1.25
            Reporter: Simon Lucy 
             Fix For: 2.0


There's a set of bumps in the road to navigate when extracting images from 
PDFs, retrieving them and managing the metadata using Tika Server.

The first is knowing that /unpack will do the basic job and return the embedded 
objects in a zip file (presuming setExtractInlineImages is True). Documenting 
this clearly in the Tika Server wiki page would help people enormously.

But processing those images after they've been extracted will either need 
inspecting with another tool or using /rmeta to return the mime types and the 
rest of the metadata.

This means that multiple passes need to be made over the same file and the same 
processes of extraction, identification and temporary storage will be made over.

The server processes of /rmeta and /unpack need to be melded. The simplest may 
be to generate /rmeta metadata in the __META__ file object added to the 
returned zip file. A more complicated but perhaps more hypermedia way would be 
to use Content Negotiation and indicate an Accept application/zip in the /rmeta 
request.

I've indicated a Fix version of 2.0 because it is if not a breaking change a 
considerable one.

I'm available for Help Wanted, if that helps.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to