[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

Luis Filipe Nassif (JIRA) Wed, 21 Nov 2018 15:19:28 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695351#comment-16695351
 ]


Luis Filipe Nassif commented on TIKA-2749:
------------------------------------------

Hi [~rleir]. Sorry, I meant our main goal when OCRing pdfs is to get the text 
from pdfs containing full page scanned images. We don't bother too much about 
logos, graphs and other small images between text.

And about converting tif to pdf before OCR, I think it will not improve OCR 
quality, because contrast should not be changed. You will still have to adjust 
it manually.

> OCR on PDFs should "just work" out of the box
> ---------------------------------------------
>
>                 Key: TIKA-2749
>                 URL: https://issues.apache.org/jira/browse/TIKA-2749
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

Reply via email to