[
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895080#comment-13895080
]
Grant Ingersoll commented on TIKA-93:
-------------------------------------
I thought about the Parser approach, but it doesn't really feel like a Parser.
That is, many different things may be images or have embedded images (PDFs,
actual images like JPG, etc., embedded images in Word/PPT docs), so I want to
take the MIME type and feed it, optionally to the OCR engine which extracts the
images and produces one more items of text, which will give me back something I
can then pass along to the Parser.
So, for instance, in the case of a PPT with embedded images, you would:
# Detect PPT
# Extract/OCR Images
# Feed to PPT/POI Parser
# Obtain glory
In a generic sense, what is somewhat needed is a pipeline approach. That being
said, I've already got one of those, I just want the library abstraction that
Tika gives me to plug and play my OCR tool and get text out of it.
An alternative would be that Parsers for MIME Types that allow for the content
to be an image can optionally take in an OCR Engine and as they do their
parsing, they look for images.
BTW, for JavaOCR, the main issue seems to be getting training data for the
image parsing. Tesseract, on the other hand, has a rich set of models out of
the box, but is written in C++ (although it has Java wrappers).
> OCR support
> -----------
>
> Key: TIKA-93
> URL: https://issues.apache.org/jira/browse/TIKA-93
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Jukka Zitting
> Priority: Minor
>
> I don't know of any decent open source pure Java OCR libraries, but there are
> command line OCR tools like Tesseract
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
> extract text content (where available) from image files.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)