[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638183#comment-16638183 ]
Tim Allison edited comment on TIKA-2749 at 10/4/18 12:46 PM: ------------------------------------------------------------- The two basic options (see our [wiki on OCR and PDFs|https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR]): 1) run OCR on each inline image 2) render the page and then run OCR on that single image My strawman, heuristic, 100% hackery proposal is this: 0) trigger OCR if fewer than 10 words are extracted from a page 1) if <= 5 inline images, run OCR on each of the inline images (strategy 1) 2) if a page contains > 5 inline images, render the full page and run OCR on that (strategy 2) [~lfcnassif], I _think_ (0) above derives from one of your recommendations? Please chime in on this ticket. :D This issue will take some time. I don't plan to move out on it any time quickly. was (Author: talli...@mitre.org): The two basic options (see our [wiki on OCR and PDFs|https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR]: 1) run OCR on each inline image 2) render the page and then run OCR on that single image My strawman, heuristic, 100% hackery proposal is this: 0) trigger OCR if fewer than 10 words are extracted from a page 1) if <= 5 inline images, run OCR on each of the inline images (strategy 1) 2) if a page contains > 5 inline images, render the full page and run OCR on that (strategy 2) [~lfcnassif], I _think_ (0) above derives from one of your recommendations? Please chime in on this ticket. :D This issue will take some time. I don't plan to move out on it any time quickly. > OCR on PDFs should "just work" out of the box > --------------------------------------------- > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)