[jira] [Commented] (TIKA-93) OCR support

Grant Ingersoll (JIRA) Fri, 07 Feb 2014 13:39:34 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895080#comment-13895080
 ]


Grant Ingersoll commented on TIKA-93:
-------------------------------------

I thought about the Parser approach, but it doesn't really feel like a Parser.  
That is, many different things may be images or have embedded images (PDFs, 
actual images like JPG, etc., embedded images in Word/PPT docs), so I want to 
take the MIME type and feed it, optionally to the OCR engine which extracts the 
images and produces one more items of text, which will give me back something I 
can then pass along to the Parser.

So, for instance, in the case of a PPT with embedded images, you would:
# Detect PPT
# Extract/OCR Images
# Feed to PPT/POI Parser
# Obtain glory

In a generic sense, what is somewhat needed is a pipeline approach.  That being 
said, I've already got one of those, I just want the library abstraction that 
Tika gives me to plug and play my OCR tool and get text out of it.

An alternative would be that Parsers for MIME Types that allow for the content 
to be an image can optionally take in an OCR Engine and as they do their 
parsing, they look for images.

BTW, for JavaOCR, the main issue seems to be getting training data for the 
image parsing.   Tesseract, on the other hand, has a rich set of models out of 
the box, but is written in C++ (although it has Java wrappers).

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-93) OCR support

Reply via email to