[
https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15939037#comment-15939037
]
Thejan Wijesinghe commented on TIKA-2293:
-----------------------------------------
About automating the download process of trained data for languages:
Why?
1. .traineddata files are huge. it's impractical to include even a few language
packages in a bundle.
2. We don't want to burden everyone with the language packages that they don't
need.
How?
1. How about we use the command line for this matter.
a) If a TIKA user executes the setLanguage("Some language") method, We can
first search the tessdata folder (by parsing a commandline argument) whether We
can find the necessary traineddata file. If we can find it, We can proceed with
the OCR process.
b) If it is not found in the tessdata folder, We can simply parse another
command line argument to download the necessary traineddata files and move them
into the tessdata folder. Then We can proceed with the OCR process.
Benefits:
1. This way, We can assure that only the users who have the need to use other
language packages downloads it.
2. Since, We are automating that procedure, users don't have to worry about
downloading trained data and moving them into the tessdata folder.
please give me your feedback on my idea. If you see any other solution to this
better than mine, please let me know that as well.
> Tess4jOCRParser - A simpler Java version of TesseractOCRParser
> ---------------------------------------------------------------
>
> Key: TIKA-2293
> URL: https://issues.apache.org/jira/browse/TIKA-2293
> Project: Tika
> Issue Type: Improvement
> Components: ocr
> Reporter: Thejan Wijesinghe
> Fix For: 1.15
>
>
> Right now, TesseractOCRParser calls tesseract and imagemagick from command
> line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API
> instead of the runtime.exec way to executing tesseract out of process.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)