>From DejaVu (particular case) point of view possible flow can be as follows:
1. Extract images
2. For each image extract text using OCR
2.1 Detect language
2.2.Detect font type
.....

So, language, font type may be used for providing metadata.
I think it should be seamless as much as possible.

It's also interesting what do you think/see/hope ...

Best regards,

Oleg



On Mon, Jan 14, 2013 at 10:58 PM, Pei Chen (JIRA) <j...@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13553107#comment-13553107]
>
> Pei Chen commented on TIKA-93:
> ------------------------------
>
> I tried their javaocr-20100605 release with just ascii scanned digits and
> it seems to worked as advertised.  It was fairly easy to use/setup-
> However, I noticed that their latest release have a lot of work geared
> towards android development.  I didn't get a chance to try integrating it
> with Tika yet however.
> Are there any preferences on how it should flow in the context of Tika?
>
> > OCR support
> > -----------
> >
> >                 Key: TIKA-93
> >                 URL: https://issues.apache.org/jira/browse/TIKA-93
> >             Project: Tika
> >          Issue Type: New Feature
> >          Components: parser
> >            Reporter: Jukka Zitting
> >            Priority: Minor
> >
> > I don't know of any decent open source pure Java OCR libraries, but
> there are command line OCR tools like Tesseract (
> http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
> extract text content (where available) from image files.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>

Reply via email to