[jira] [Commented] (TIKA-93) OCR support

Petr Vas (JIRA) Tue, 19 Aug 2014 04:18:08 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102124#comment-14102124
 ]


Petr Vas commented on TIKA-93:
------------------------------

Ok. I have managed to get TerrasectOCRParser working through tika-server with 
custom tika-config.xml.

The only thing that I have had to change in code was initialization of 
TerrasectOCRConfig in parse method (line 114 in TerrasectOCRParser.java). 
Instead of returning after getting null config from ParseContext it initializes 
with new TesseractOCRConfig(). Line 114 in TerrasectOCRParser.java looks like 
this:
{code:java}             config = new TesseractOCRConfig();{code}
This made one of the test fail (testPPTXThumbnail in OOXMLParserTest) therefore 
this code must not be sent merged further in main, but if fits pertectly for my 
personal aims.

I have also managed to make use of both PDFBox and TerrasectOCRParser parsers 
for PDFs by disabling magic detection and binding PDFs that are to be OCRed to 
a specific MIME type (application/pdf-ocr). I can share my tika-config.xml in 
case this is of interest. I can see that there is work being done on making 
seamless integration between PDFBox and Terrasect as a part of GSoC 2014 ( 
PDFBOX-1912 ), but it is not over and it is not clear whether it would be ever 
over.

In general I am wondering about how can I define ParseContext in tika-server, 
so that I can skip hacking code and make terrasect configurable outside of 
source code? Any ideas/pointers here?

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.7
>
>         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, 
> TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, 
> TesseractOCR_Tyler.patch, TesseractOCR_Tyler_v2.patch, testOCR.docx, 
> testOCR.pdf, testOCR.pptx
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-93) OCR support

Reply via email to