[jira] [Updated] (TIKA-3416) Extract logical images from PDFs

Tim Allison (Jira) Mon, 24 May 2021 09:49:06 -0700


     [ 
https://issues.apache.org/jira/browse/TIKA-3416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Allison updated TIKA-3416:
------------------------------
    Description: 
PDFs, bless their hearts, can store a logical image as hundreds or thousands of 
subimages that when rendered, look like one image.  

We currently have the option to let the user render the page and run OCR on 
that rendered image, or the user can extract inline images.  There has to be a 
happier medium, and the user should get back the rendering in, e.g., the 
/unpack endpoint (see TIKA-3348).

It would be handy for some use cases to do the geometry to find bounding boxes 
for image components and then render those bounding boxes so that a human gets 
a "logical image" <hand_waving>most of the time</hand_waving>.

There would have to be some heuristics for when to give up and just render the 
whole page, but I think we could do something that performed well enough.  More 
importantly, I'm sure this is a solved problem...any recs for efficient 
algorithms for this?

What do you think?



  was:
PDFs, bless their hearts, can store a logical image as hundreds or thousands of 
subimages that when rendered, look like one image.

It would be handy for some use cases to do the geometry to find bounding boxes 
for image components and then render those bounding boxes so that a human gets 
a "logical image" <hand_waving>most of the time</hand_waving>.

There would have to be some heuristics for when to give up and just render the 
whole page, but I think we could do something that performed well enough.  More 
importantly, I'm sure this is a solved problem...any recs for efficient 
algorithms for this?

What do you think?




> Extract logical images from PDFs
> --------------------------------
>
>                 Key: TIKA-3416
>                 URL: https://issues.apache.org/jira/browse/TIKA-3416
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Tim Allison
>            Priority: Major
>
> PDFs, bless their hearts, can store a logical image as hundreds or thousands 
> of subimages that when rendered, look like one image.  
> We currently have the option to let the user render the page and run OCR on 
> that rendered image, or the user can extract inline images.  There has to be 
> a happier medium, and the user should get back the rendering in, e.g., the 
> /unpack endpoint (see TIKA-3348).
> It would be handy for some use cases to do the geometry to find bounding 
> boxes for image components and then render those bounding boxes so that a 
> human gets a "logical image" <hand_waving>most of the time</hand_waving>.
> There would have to be some heuristics for when to give up and just render 
> the whole page, but I think we could do something that performed well enough. 
>  More importantly, I'm sure this is a solved problem...any recs for efficient 
> algorithms for this?
> What do you think?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (TIKA-3416) Extract logical images from PDFs

Reply via email to