[
https://issues.apache.org/jira/browse/TIKA-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396332#comment-17396332
]
Abha edited comment on TIKA-3518 at 8/10/21, 1:09 AM:
------------------------------------------------------
Update –
So i tried 1.27 and I can see the error as -
Exception in thread "main" org.apache.tika.exception.TikaException:
TesseractOCRParser bad exit value 1 err msg: Error opening data file C:\Program
Files\Tesseract-OCR\eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your
"tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
This is for all the Tesseract versions starting from 4.0 and above, I have set
the TESSDATA_PREFIX as C:\Program Files\Tesseract-OCR\tessdata and can see it
has value during runtime, i tried replacing eng.traineddata from the latest git
download, it still gives the same error for all the Tesseract versions, only
works for 3.x versions
was (Author: abha.1012):
Update --
So i tried 1.27 and I can see the error as -
Exception in thread "main" org.apache.tika.exception.TikaException:
TesseractOCRParser bad exit value 1 err msg: Error opening data file C:\Program
Files\Tesseract-OCR\eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your
"tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
This is for all the Tesseract versions starting from 4.0 and above, I have set
the TESSDATA_PREFIX and can see it has value during runtime, i tried replacing
eng.traineddata from the latest git download, it still gives the same error for
all the Tesseract versions, only works for 3.x versions
> Tika 1.26 not Working with Tesseract 4.0 and Higher Version
> -----------------------------------------------------------
>
> Key: TIKA-3518
> URL: https://issues.apache.org/jira/browse/TIKA-3518
> Project: Tika
> Issue Type: Bug
> Components: ocr, tika-batch, tika-dl, tika-server
> Affects Versions: 1.26
> Reporter: Abha
> Priority: Major
>
> ProcessBuilder not creating tmp file for Tesseract 4.1 and Higher Versions
> With Tika 1.26 and JDK 1.8
> I am working on a project which integrates Tika and Tesseract OCR Tika
> Version is 1.26, JDK 1.8 Now for any Tesseract Version earlier than 4.0 works
> fine and extracts the image/pdf data correctly But upgrading the TesseractOCR
> to 4.1.1 or Higher results in no data extraction. I debugged the issue and
> found that the ProcessBuilder is not creating the temporary txt output file
> from which TesseractOCR extracts the result, resulting in the issue. Any idea
> if this is a version compatibility issue Or How to resolve this?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)