[ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895897#comment-13895897
 ] 

Nick Burch commented on TIKA-93:
--------------------------------

Generally speaking, when a parser finds embedded resources, it calls out to the 
Parser on the context to have it processed. You could therefore set your OCR 
Parser there, and it'd be called for all kinds of embedded resources. It can 
then OCR any suitable images it finds, and pass on everything else to another 
parser (eg DefaultParser) to have the non-OCR-able embedded parts handled (if 
required)

To handle OCRing of top level content, eg images, you'd need to register your 
OCR parser as the parser for those types, in place of (or possibly even 
wrapping) the default parser.

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to