>From DejaVu (particular case) point of view possible flow can be as follows: 1. Extract images 2. For each image extract text using OCR 2.1 Detect language 2.2.Detect font type .....
So, language, font type may be used for providing metadata. I think it should be seamless as much as possible. It's also interesting what do you think/see/hope ... Best regards, Oleg On Mon, Jan 14, 2013 at 10:58 PM, Pei Chen (JIRA) <j...@apache.org> wrote: > > [ > https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13553107#comment-13553107] > > Pei Chen commented on TIKA-93: > ------------------------------ > > I tried their javaocr-20100605 release with just ascii scanned digits and > it seems to worked as advertised. It was fairly easy to use/setup- > However, I noticed that their latest release have a lot of work geared > towards android development. I didn't get a chance to try integrating it > with Tika yet however. > Are there any preferences on how it should flow in the context of Tika? > > > OCR support > > ----------- > > > > Key: TIKA-93 > > URL: https://issues.apache.org/jira/browse/TIKA-93 > > Project: Tika > > Issue Type: New Feature > > Components: parser > > Reporter: Jukka Zitting > > Priority: Minor > > > > I don't know of any decent open source pure Java OCR libraries, but > there are command line OCR tools like Tesseract ( > http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to > extract text content (where available) from image files. > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators > For more information on JIRA, see: http://www.atlassian.com/software/jira >