[
https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844488#comment-16844488
]
ASF GitHub Bot commented on TIKA-2293:
--------------------------------------
changetoblow commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler
Java version of TesseractOCRParser
URL: https://github.com/apache/tika/pull/158#issuecomment-494233047
I found that the main reason for this problem was that when tika parsed the
embedded image in word, it was finally parsed into a temporary file with the
suffix of TMP and sent to tess4j for identification, which was not recognized
by tess4j.So how to modify the type generated by tika to adapt to tess4j? Can
you give me some ideas
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Tess4jOCRParser - A simpler Java version of TesseractOCRParser
> ---------------------------------------------------------------
>
> Key: TIKA-2293
> URL: https://issues.apache.org/jira/browse/TIKA-2293
> Project: Tika
> Issue Type: Improvement
> Components: ocr
> Reporter: Thejan Wijesinghe
> Priority: Major
>
> Right now, TesseractOCRParser calls tesseract and imagemagick from command
> line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API
> instead of the runtime.exec way to executing tesseract out of process.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)