[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

Rick Leir (JIRA) Wed, 21 Nov 2018 08:36:07 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694911#comment-16694911
 ]


Rick Leir commented on TIKA-2749:
---------------------------------

Hi Luis [~lfcnassif]

Your main goal is "to ocr scanned docs". Can I assume you are happy using 
Tesseract, and know how to use it directly?

I OCR'd 25M images as described in  [https://github.com/rleir/c7atess] .  My 
biggest challenge was to pre-process the TIFF images to improve contrast. I 
never converted them to PDF, and I did not use Tika.  Is Tika helpful in 
achieving your goal? Not meaning to detract from Tika, I find it excellent for 
other purposes.

 

> OCR on PDFs should "just work" out of the box
> ---------------------------------------------
>
>                 Key: TIKA-2749
>                 URL: https://issues.apache.org/jira/browse/TIKA-2749
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

Reply via email to