[jira] [Commented] (TIKA-2293) Tess4jOCRParser - A simpler Java version of TesseractOCRParser

Thejan Wijesinghe (JIRA) Wed, 22 Mar 2017 22:47:31 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937769#comment-15937769
 ]


Thejan Wijesinghe commented on TIKA-2293:
-----------------------------------------

Thank you Tim and Nick for your responses. It is a pleasure, you guys are there 
to help me. :)

Tess4J's low level API supports obtaining more information on the scanned words 
such as scanning accuracy, if the word is underlined, bold or italic and etc.

1a. Ghost4J is used only to convert pdf files to tiff or png. We can safely 
exclude that and yes, then we would be on our own when it comes to converting 
pdfs to OCRable formats. 

1b. Rococoa-core is a generic Java wrapper for Cocoa. Cocoa is Mac OS X's 
native API. I could exclude Rococoa-core and still could run tests without a 
problem in my linux machine. But I'm not sure about its effect on Mac OS X. 
(How is the support of TIKA for Mac OS. Sorry for asking this question. I have 
never tried TIKA on a MAC?)

3. Yes, for windows the necessarry dlls comes bundled with Tess4J, These dlls 
are built with VS2015 and therefore they depend on the Visual C++ 2015 
Redistributable Packages. So windows users needs to have Visual C++ 2015 
Redistributable Packages installed(which I presume most windows users have). 
According to [1], linux users needs to install libtesseract.so but I didn't 
have to because I used "Sudo apt-get install tesseract-ocr" but to my 
amazemant, even after purging tesseract-ocr, I still could run the 
Tess4JOCRParser tests successfully. Perhaps, purging didn't delete 
libtesseract.so from the system.   

4. Trained data for English aka eng.traineddata comes bundled with tess4j jar 
in a folder name tessdata. If the user needs to OCR an image in another 
language or a combination of languages other than English, He or She will have 
to download specific trained data from [2] and put that in the tessdata folder. 

4b. I am not sure whether we can give the users the luxury of having these 
language packages automatically downloaded, if the user set the language to 
something other than English. Can we create mechanism to download those 
language packages automatically (Similar to Maven downloading dependencies)? Is 
that practical? 

[1] http://tess4j.sourceforge.net/usage.html
[2] https://github.com/tesseract-ocr/tessdata


>  Tess4jOCRParser - A simpler Java version of TesseractOCRParser
> ---------------------------------------------------------------
>
>                 Key: TIKA-2293
>                 URL: https://issues.apache.org/jira/browse/TIKA-2293
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr
>            Reporter: Thejan Wijesinghe
>             Fix For: 1.15
>
>
> Right now, TesseractOCRParser calls tesseract and imagemagick from command 
> line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API 
> instead of the runtime.exec way to executing tesseract out of process.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (TIKA-2293) Tess4jOCRParser - A simpler Java version of TesseractOCRParser

Reply via email to