[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694911#comment-16694911 ]
Rick Leir commented on TIKA-2749: --------------------------------- Hi Luis [~lfcnassif] Your main goal is "to ocr scanned docs". Can I assume you are happy using Tesseract, and know how to use it directly? I OCR'd 25M images as described in [https://github.com/rleir/c7atess] . My biggest challenge was to pre-process the TIFF images to improve contrast. I never converted them to PDF, and I did not use Tika. Is Tika helpful in achieving your goal? Not meaning to detract from Tika, I find it excellent for other purposes. > OCR on PDFs should "just work" out of the box > --------------------------------------------- > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)