[
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133962#comment-14133962
]
Luis Filipe Nassif commented on TIKA-93:
----------------------------------------
Another not related idea is to call the supported ImageParser inside
TesseractOCRParser so it could extract image metadata too. With this we can
list TesseractOCRParser in the default service provider parser list by default
and the image tests will pass. OCR could be disabled by default and be enabled
through ocrParserConfig.
> OCR support
> -----------
>
> Key: TIKA-93
> URL: https://issues.apache.org/jira/browse/TIKA-93
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Jukka Zitting
> Assignee: Chris A. Mattmann
> Priority: Minor
> Fix For: 1.7
>
> Attachments: Petr_tika-config.xml, TIKA-93.patch, TIKA-93.patch,
> TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch,
> TesseractOCRParser.patch, TesseractOCR_Tyler.patch,
> TesseractOCR_Tyler_v2.patch, testOCR.docx, testOCR.pdf, testOCR.pptx
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are
> command line OCR tools like Tesseract
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
> extract text content (where available) from image files.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)