[ 
https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15940818#comment-15940818
 ] 

Thamme Gowda commented on TIKA-2293:
------------------------------------

Thanks, [~gagravarr] and [[email protected]] for timely feedback.
Agree with all the feedback.

[~Thejan], great work, your efforts are appreciated as it helped to evaluate 
and understand pros and cons with JNI based OCR libs.
I feel Tess4j could have been more modular to selectively include/exclude its 
native libs and models, but they are not! 

As suggested, please make this as an independent parser under your github repo.
For an example, you may refer to 
https://github.com/thammegowda/tika-ner-corenlp.
We had a similar situation - GPL license and huge model files for NER. 
We made it as an extension to tika and documented it on the wiki 
https://wiki.apache.org/tika/TikaAndNER#Using_Stanford_CoreNLP_NER

>  Tess4jOCRParser - A simpler Java version of TesseractOCRParser
> ---------------------------------------------------------------
>
>                 Key: TIKA-2293
>                 URL: https://issues.apache.org/jira/browse/TIKA-2293
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr
>            Reporter: Thejan Wijesinghe
>             Fix For: 1.15
>
>
> Right now, TesseractOCRParser calls tesseract and imagemagick from command 
> line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API 
> instead of the runtime.exec way to executing tesseract out of process.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to