[
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102124#comment-14102124
]
Petr Vas commented on TIKA-93:
------------------------------
Ok. I have managed to get TerrasectOCRParser working through tika-server with
custom tika-config.xml.
The only thing that I have had to change in code was initialization of
TerrasectOCRConfig in parse method (line 114 in TerrasectOCRParser.java).
Instead of returning after getting null config from ParseContext it initializes
with new TesseractOCRConfig(). Line 114 in TerrasectOCRParser.java looks like
this:
{code:java} config = new TesseractOCRConfig();{code}
This made one of the test fail (testPPTXThumbnail in OOXMLParserTest) therefore
this code must not be sent merged further in main, but if fits pertectly for my
personal aims.
I have also managed to make use of both PDFBox and TerrasectOCRParser parsers
for PDFs by disabling magic detection and binding PDFs that are to be OCRed to
a specific MIME type (application/pdf-ocr). I can share my tika-config.xml in
case this is of interest. I can see that there is work being done on making
seamless integration between PDFBox and Terrasect as a part of GSoC 2014 (
PDFBOX-1912 ), but it is not over and it is not clear whether it would be ever
over.
In general I am wondering about how can I define ParseContext in tika-server,
so that I can skip hacking code and make terrasect configurable outside of
source code? Any ideas/pointers here?
> OCR support
> -----------
>
> Key: TIKA-93
> URL: https://issues.apache.org/jira/browse/TIKA-93
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Jukka Zitting
> Assignee: Chris A. Mattmann
> Priority: Minor
> Fix For: 1.7
>
> Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch,
> TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch,
> TesseractOCR_Tyler.patch, TesseractOCR_Tyler_v2.patch, testOCR.docx,
> testOCR.pdf, testOCR.pptx
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are
> command line OCR tools like Tesseract
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
> extract text content (where available) from image files.
--
This message was sent by Atlassian JIRA
(v6.2#6252)