[jira] [Comment Edited] (TIKA-3518) Tika 1.26 not Working with Tesseract 4.0 and Higher Version

Abha (Jira) Mon, 09 Aug 2021 18:01:24 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396332#comment-17396332
 ]


Abha edited comment on TIKA-3518 at 8/10/21, 1:00 AM:
------------------------------------------------------

Update --

So i tried 1.27 and I can see the error as -

Exception in thread "main" org.apache.tika.exception.TikaException: 
TesseractOCRParser bad exit value 1 err msg: Error opening data file C:\Program 
Files\Tesseract-OCR\eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your 
"tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

 

This is for all the Tesseract versions starting from 4.0 and above, I have set 
the TESSDATA_PREFIX and can see it has value during runtime, i tried replacing 
eng.traineddata from the latest git download, it still gives the same error for 
all the Tesseract versions, only works for 3.x versions

 

 


was (Author: abha.1012):
Please find my response inline -

{color:#FFAB00}When you say the processbuilder isn't creating the tmp file, 
does that mean that tesseract is failing to run at all?{color}
- It is not failing to run, it is able to extract the metadata correctly, but 
not able to extract the image content, since it's not able to create the tmp 
output (txt) file and fails this check (see below link) hence no content 
extraction -
https://github.com/apache/tika/blob/6f4365b9ef03ac99de21f10a6e3f2a98452c5007/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java#L289
 

{color:#FFAB00}have you tried 1.27?{color}
- Yes, it's the same issue for 1.27 as well
 Starting from Tesseract version 4.0.0 this issue occurs, it works fine with 
Tesseract 3.x version and 4.0.0alpha

Also, i am able to run TesseractOCR through commandline and it extracts the 
content correctly.

 

 

 

> Tika 1.26 not Working with Tesseract 4.0 and Higher Version
> -----------------------------------------------------------
>
>                 Key: TIKA-3518
>                 URL: https://issues.apache.org/jira/browse/TIKA-3518
>             Project: Tika
>          Issue Type: Bug
>          Components: ocr, tika-batch, tika-dl, tika-server
>    Affects Versions: 1.26
>            Reporter: Abha
>            Priority: Major
>
> ProcessBuilder not creating tmp file for Tesseract 4.1 and Higher Versions 
> With Tika 1.26 and JDK 1.8
> I am working on a project which integrates Tika and Tesseract OCR Tika 
> Version is 1.26, JDK 1.8 Now for any Tesseract Version earlier than 4.0 works 
> fine and extracts the image/pdf data correctly But upgrading the TesseractOCR 
> to 4.1.1 or Higher results in no data extraction. I debugged the issue and 
> found that the ProcessBuilder is not creating the temporary txt output file 
> from which TesseractOCR extracts the result, resulting in the issue. Any idea 
> if this is a version compatibility issue Or How to resolve this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-3518) Tika 1.26 not Working with Tesseract 4.0 and Higher Version

Reply via email to