[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695351#comment-16695351 ]
Luis Filipe Nassif commented on TIKA-2749: ------------------------------------------ Hi [~rleir]. Sorry, I meant our main goal when OCRing pdfs is to get the text from pdfs containing full page scanned images. We don't bother too much about logos, graphs and other small images between text. And about converting tif to pdf before OCR, I think it will not improve OCR quality, because contrast should not be changed. You will still have to adjust it manually. > OCR on PDFs should "just work" out of the box > --------------------------------------------- > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)