[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box

Tim Allison (JIRA) Thu, 04 Oct 2018 05:54:49 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638183#comment-16638183
 ]


Tim Allison edited comment on TIKA-2749 at 10/4/18 12:46 PM:
-------------------------------------------------------------

The two basic options (see our [wiki on OCR and 
PDFs|https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR]):

1) run OCR on each inline image
2) render the page and then run OCR on that single image

My strawman, heuristic, 100% hackery proposal is this:

0) trigger OCR if fewer than 10 words are extracted from a page
1) if <= 5 inline images, run OCR on each of the inline images (strategy 1)
2) if a page contains > 5 inline images, render the full page and run OCR on 
that (strategy 2)

[~lfcnassif], I _think_ (0) above derives from one of your recommendations?  
Please chime in on this ticket. :D

This issue will take some time.  I don't plan to move out on it any time 
quickly.


was (Author: talli...@mitre.org):
The two basic options (see our [wiki on OCR and 
PDFs|https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR]:

1) run OCR on each inline image
2) render the page and then run OCR on that single image

My strawman, heuristic, 100% hackery proposal is this:

0) trigger OCR if fewer than 10 words are extracted from a page
1) if <= 5 inline images, run OCR on each of the inline images (strategy 1)
2) if a page contains > 5 inline images, render the full page and run OCR on 
that (strategy 2)

[~lfcnassif], I _think_ (0) above derives from one of your recommendations?  
Please chime in on this ticket. :D

This issue will take some time.  I don't plan to move out on it any time 
quickly.

> OCR on PDFs should "just work" out of the box
> ---------------------------------------------
>
>                 Key: TIKA-2749
>                 URL: https://issues.apache.org/jira/browse/TIKA-2749
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box

Reply via email to