[ 
https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15939037#comment-15939037
 ] 

Thejan Wijesinghe commented on TIKA-2293:
-----------------------------------------

About automating the download process of trained data for languages:

Why?
1. .traineddata files are huge. it's impractical to include even a few language 
packages in a bundle. 

2. We don't want to burden everyone with the language packages that they don't 
need.
 
How? 
1. How about we  use the command line for this matter. 
a) If a TIKA user executes the setLanguage("Some language") method, We can 
first search the tessdata folder (by parsing a commandline argument) whether We 
can find the necessary traineddata file. If we can find it, We can proceed with 
the OCR process.
b) If it is not found in the tessdata folder, We can simply parse another 
command line argument to download the necessary traineddata files and move them 
into the tessdata folder. Then We can proceed with the OCR process.

Benefits:
1. This way, We can assure that only the users who have the need to use other 
language packages downloads it. 

2. Since, We are automating that procedure, users don't have to worry about 
downloading trained data and moving them into the tessdata folder.

please give me your feedback on my idea. If you see any other solution to this 
better than mine, please let me know that as well.  


>  Tess4jOCRParser - A simpler Java version of TesseractOCRParser
> ---------------------------------------------------------------
>
>                 Key: TIKA-2293
>                 URL: https://issues.apache.org/jira/browse/TIKA-2293
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr
>            Reporter: Thejan Wijesinghe
>             Fix For: 1.15
>
>
> Right now, TesseractOCRParser calls tesseract and imagemagick from command 
> line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API 
> instead of the runtime.exec way to executing tesseract out of process.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to