[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

Rick Leir (JIRA) Wed, 21 Nov 2018 08:58:57 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694939#comment-16694939
 ]


Rick Leir commented on TIKA-2749:
---------------------------------

Hi Tim [[email protected]]

Yes, the "just work" goal is great. However, as the wiki you linked to says, 
there would be a large performance hit when Tika elects to OCR an image. 
Perhaps we can assume that when Tesseract has been installed and configured 
then images in PDF's should be automatically extracted and OCR'd. Note: I have 
no need for OCR recently, so this is just talk.   Cheers – Rick

> OCR on PDFs should "just work" out of the box
> ---------------------------------------------
>
>                 Key: TIKA-2749
>                 URL: https://issues.apache.org/jira/browse/TIKA-2749
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

Reply via email to