[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733501#comment-16733501 ]
Tim Allison commented on TIKA-2749: ----------------------------------- On caching, that would be neat, but I worry that that will be application specific...or, "handler" specific. On configurability of turning off OCR per file, that seems doable outside of the hack I recommended. On adding a metadata value for "has images that might be OCR-able"...if you use the /rmeta endpoint on documents generally, you should be able fairly easily to determine if there are image files...but I think we could do better for PDFs (the point of this ticket!)...So, y, that's a great recommendation. > OCR on PDFs should "just work" out of the box > --------------------------------------------- > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)