[ 
https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15940850#comment-15940850
 ] 

Thamme Gowda commented on TIKA-2293:
------------------------------------

[[email protected]]
Totally agree with all the points on model files competing to take up many 
folds of size of Tika core/parser source code.

I think we just started bringing machine learning capabilities to tika. I 
forsee more and big model files (new deep learning OCR models, 
image/video/audio recognition, captioning, .... the list goes on). IMHO, these 
model files are also equally important.
In the long run, we end up having either too many REST services (thus making 
the system too broken) or native dependencies (making it tied to platforms) or 
the model files (thus making it too fat). We will hit the same discussion 
again, so I am wondering if we can also consider any alternative future proof 
solutions to deal with large model files. Perhaps making these models as 
optional extensions, and not including in core distribution?

>  Tess4jOCRParser - A simpler Java version of TesseractOCRParser
> ---------------------------------------------------------------
>
>                 Key: TIKA-2293
>                 URL: https://issues.apache.org/jira/browse/TIKA-2293
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr
>            Reporter: Thejan Wijesinghe
>             Fix For: 1.15
>
>
> Right now, TesseractOCRParser calls tesseract and imagemagick from command 
> line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API 
> instead of the runtime.exec way to executing tesseract out of process.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to