[
https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15940818#comment-15940818
]
Thamme Gowda commented on TIKA-2293:
------------------------------------
Thanks, [~gagravarr] and [[email protected]] for timely feedback.
Agree with all the feedback.
[~Thejan], great work, your efforts are appreciated as it helped to evaluate
and understand pros and cons with JNI based OCR libs.
I feel Tess4j could have been more modular to selectively include/exclude its
native libs and models, but they are not!
As suggested, please make this as an independent parser under your github repo.
For an example, you may refer to
https://github.com/thammegowda/tika-ner-corenlp.
We had a similar situation - GPL license and huge model files for NER.
We made it as an extension to tika and documented it on the wiki
https://wiki.apache.org/tika/TikaAndNER#Using_Stanford_CoreNLP_NER
> Tess4jOCRParser - A simpler Java version of TesseractOCRParser
> ---------------------------------------------------------------
>
> Key: TIKA-2293
> URL: https://issues.apache.org/jira/browse/TIKA-2293
> Project: Tika
> Issue Type: Improvement
> Components: ocr
> Reporter: Thejan Wijesinghe
> Fix For: 1.15
>
>
> Right now, TesseractOCRParser calls tesseract and imagemagick from command
> line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API
> instead of the runtime.exec way to executing tesseract out of process.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)