[ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640724#comment-16640724
 ] 

Luis Filipe Nassif edited comment on TIKA-2749 at 10/6/18 1:39 PM:
-------------------------------------------------------------------

Hi [~talli...@apache.org],

Yes, currently we run ocr (rendering the page) if there are less than 100 chars 
in the page. Our main goal is to ocr scanned docs. If that is Tika's goal too, 
I think your proposal is very good.

If we want Tika to ocr more content, maybe:

1) if < 10 words in the page, ocr the rendered page (blank pages are fast to 
ocr, so the impact is minimal)

2) if >= 10 words in the page, ocr each extracted image

Maybe our pdfbox colleagues should be invited to join the discussion?


was (Author: lfcnassif):
Hi [~talli...@apache.org],

Yes, currently we run ocr (rendering the page) if there are less than 100 chars 
in the page. Our main goal is to ocr scanned docs.

If we want Tika to ocr more content, maybe:

1) if < 10 words in the page, ocr the rendered page (blank pages are fast to 
ocr, so the impact is minimal)

2) if >= 10 words in the page, ocr each extracted image

Maybe our pdfbox colleagues should be invited to join the discussion?

> OCR on PDFs should "just work" out of the box
> ---------------------------------------------
>
>                 Key: TIKA-2749
>                 URL: https://issues.apache.org/jira/browse/TIKA-2749
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to