[
https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15940850#comment-15940850
]
Thamme Gowda commented on TIKA-2293:
------------------------------------
[[email protected]]
Totally agree with all the points on model files competing to take up many
folds of size of Tika core/parser source code.
I think we just started bringing machine learning capabilities to tika. I
forsee more and big model files (new deep learning OCR models,
image/video/audio recognition, captioning, .... the list goes on). IMHO, these
model files are also equally important.
In the long run, we end up having either too many REST services (thus making
the system too broken) or native dependencies (making it tied to platforms) or
the model files (thus making it too fat). We will hit the same discussion
again, so I am wondering if we can also consider any alternative future proof
solutions to deal with large model files. Perhaps making these models as
optional extensions, and not including in core distribution?
> Tess4jOCRParser - A simpler Java version of TesseractOCRParser
> ---------------------------------------------------------------
>
> Key: TIKA-2293
> URL: https://issues.apache.org/jira/browse/TIKA-2293
> Project: Tika
> Issue Type: Improvement
> Components: ocr
> Reporter: Thejan Wijesinghe
> Fix For: 1.15
>
>
> Right now, TesseractOCRParser calls tesseract and imagemagick from command
> line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API
> instead of the runtime.exec way to executing tesseract out of process.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)