[
https://issues.apache.org/jira/browse/TIKA-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396332#comment-17396332
]
Abha edited comment on TIKA-3518 at 8/10/21, 1:00 AM:
------------------------------------------------------
Update --
So i tried 1.27 and I can see the error as -
Exception in thread "main" org.apache.tika.exception.TikaException:
TesseractOCRParser bad exit value 1 err msg: Error opening data file C:\Program
Files\Tesseract-OCR\eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your
"tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
This is for all the Tesseract versions starting from 4.0 and above, I have set
the TESSDATA_PREFIX and can see it has value during runtime, i tried replacing
eng.traineddata from the latest git download, it still gives the same error for
all the Tesseract versions, only works for 3.x versions
was (Author: abha.1012):
Please find my response inline -
{color:#FFAB00}When you say the processbuilder isn't creating the tmp file,
does that mean that tesseract is failing to run at all?{color}
- It is not failing to run, it is able to extract the metadata correctly, but
not able to extract the image content, since it's not able to create the tmp
output (txt) file and fails this check (see below link) hence no content
extraction -
https://github.com/apache/tika/blob/6f4365b9ef03ac99de21f10a6e3f2a98452c5007/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java#L289
{color:#FFAB00}have you tried 1.27?{color}
- Yes, it's the same issue for 1.27 as well
Starting from Tesseract version 4.0.0 this issue occurs, it works fine with
Tesseract 3.x version and 4.0.0alpha
Also, i am able to run TesseractOCR through commandline and it extracts the
content correctly.
> Tika 1.26 not Working with Tesseract 4.0 and Higher Version
> -----------------------------------------------------------
>
> Key: TIKA-3518
> URL: https://issues.apache.org/jira/browse/TIKA-3518
> Project: Tika
> Issue Type: Bug
> Components: ocr, tika-batch, tika-dl, tika-server
> Affects Versions: 1.26
> Reporter: Abha
> Priority: Major
>
> ProcessBuilder not creating tmp file for Tesseract 4.1 and Higher Versions
> With Tika 1.26 and JDK 1.8
> I am working on a project which integrates Tika and Tesseract OCR Tika
> Version is 1.26, JDK 1.8 Now for any Tesseract Version earlier than 4.0 works
> fine and extracts the image/pdf data correctly But upgrading the TesseractOCR
> to 4.1.1 or Higher results in no data extraction. I debugged the issue and
> found that the ProcessBuilder is not creating the temporary txt output file
> from which TesseractOCR extracts the result, resulting in the issue. Any idea
> if this is a version compatibility issue Or How to resolve this?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)