[jira] [Comment Edited] (TIKA-3518) Tika 1.26 not Working with Tesseract 4.0 and Higher Version

Abha (Jira) Mon, 09 Aug 2021 18:10:08 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396332#comment-17396332
 ]


Abha edited comment on TIKA-3518 at 8/10/21, 1:09 AM:
------------------------------------------------------

Update –

So i tried 1.27 and I can see the error as -

Exception in thread "main" org.apache.tika.exception.TikaException: 
TesseractOCRParser bad exit value 1 err msg: Error opening data file C:\Program 
Files\Tesseract-OCR\eng.traineddata
 Please make sure the TESSDATA_PREFIX environment variable is set to your 
"tessdata" directory.
 Failed loading language 'eng'
 Tesseract couldn't load any languages!
 Could not initialize tesseract.

 

This is for all the Tesseract versions starting from 4.0 and above, I have set 
the TESSDATA_PREFIX as C:\Program Files\Tesseract-OCR\tessdata and can see it 
has value during runtime, i tried replacing eng.traineddata from the latest git 
download, it still gives the same error for all the Tesseract versions, only 
works for 3.x versions

 

 


was (Author: abha.1012):
Update --

So i tried 1.27 and I can see the error as -

Exception in thread "main" org.apache.tika.exception.TikaException: 
TesseractOCRParser bad exit value 1 err msg: Error opening data file C:\Program 
Files\Tesseract-OCR\eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your 
"tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

 

This is for all the Tesseract versions starting from 4.0 and above, I have set 
the TESSDATA_PREFIX and can see it has value during runtime, i tried replacing 
eng.traineddata from the latest git download, it still gives the same error for 
all the Tesseract versions, only works for 3.x versions

 

 

> Tika 1.26 not Working with Tesseract 4.0 and Higher Version
> -----------------------------------------------------------
>
>                 Key: TIKA-3518
>                 URL: https://issues.apache.org/jira/browse/TIKA-3518
>             Project: Tika
>          Issue Type: Bug
>          Components: ocr, tika-batch, tika-dl, tika-server
>    Affects Versions: 1.26
>            Reporter: Abha
>            Priority: Major
>
> ProcessBuilder not creating tmp file for Tesseract 4.1 and Higher Versions 
> With Tika 1.26 and JDK 1.8
> I am working on a project which integrates Tika and Tesseract OCR Tika 
> Version is 1.26, JDK 1.8 Now for any Tesseract Version earlier than 4.0 works 
> fine and extracts the image/pdf data correctly But upgrading the TesseractOCR 
> to 4.1.1 or Higher results in no data extraction. I debugged the issue and 
> found that the ProcessBuilder is not creating the temporary txt output file 
> from which TesseractOCR extracts the result, resulting in the issue. Any idea 
> if this is a version compatibility issue Or How to resolve this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-3518) Tika 1.26 not Working with Tesseract 4.0 and Higher Version

Reply via email to