[jira] [Updated] (TIKA-3348) Improve the workflow for extracting and returning images from PDFs and other containers using Tika Server..

Tim Allison (Jira) Wed, 21 Jul 2021 15:14:28 -0700


     [ 
https://issues.apache.org/jira/browse/TIKA-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Allison updated TIKA-3348:
------------------------------
    Fix Version/s:     (was: 2.0.0)
                   2.0.0-BETA

> Improve the workflow for extracting and returning images from PDFs and other 
> containers using Tika Server..
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3348
>                 URL: https://issues.apache.org/jira/browse/TIKA-3348
>             Project: Tika
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 1.25
>            Reporter: Simon Lucy 
>            Priority: Major
>             Fix For: 2.0.0-BETA
>
>
> There's a set of bumps in the road to navigate when extracting images from 
> PDFs, retrieving them and managing the metadata using Tika Server.
> The first is knowing that /unpack will do the basic job and return the 
> embedded objects in a zip file (presuming setExtractInlineImages is True). 
> Documenting this clearly in the Tika Server wiki page would help people 
> enormously.
> But processing those images after they've been extracted will either need 
> inspecting with another tool or using /rmeta to return the mime types and the 
> rest of the metadata.
> This means that multiple passes need to be made over the same file and the 
> same processes of extraction, identification and temporary storage will be 
> made over.
> The server processes of /rmeta and /unpack need to be melded. The simplest 
> may be to generate /rmeta metadata in the __META__ file object added to the 
> returned zip file. A more complicated but perhaps more hypermedia way would 
> be to use Content Negotiation and indicate an Accept application/zip in the 
> /rmeta request.
> I've indicated a Fix version of 2.0 because it is if not a breaking change a 
> considerable one.
> I'm available for Help Wanted, if that helps.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (TIKA-3348) Improve the workflow for extracting and returning images from PDFs and other containers using Tika Server..

Reply via email to