[jira] [Commented] (TIKA-3518) Tika 1.26 not Working with Tesseract 4.0 and Higher Version

Tim Allison (Jira) Mon, 09 Aug 2021 13:46:06 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396277#comment-17396277
 ]


Tim Allison commented on TIKA-3518:
-----------------------------------

There shouldn't be any new config changes. Hmmm...

When you say the processbuilder isn't creating the tmp file, does that mean 
that tesseract is failing to run at all?  

In 2.x, we added:
{noformat}
            throw new TikaException(
                    "TesseractOCRParser bad exit value " + exitValue + " err 
msg: " +
                            errBuilder.toString());

{noformat}

so that you could see what tesseract was not happy with.  It looks like we 
backported that to 1.27. 

I know that we fixed some windows based issues in 2.x, but this _shouldn't_ be 
a problem.   Thank you for your help in debugging this!

Apologies if I've missed this, but have you tried 1.27?

If you set logging to debug, what is the commandline used to call tesseract?  
Does it work if you copy and paste it into a cmd.exe window?

> Tika 1.26 not Working with Tesseract 4.0 and Higher Version
> -----------------------------------------------------------
>
>                 Key: TIKA-3518
>                 URL: https://issues.apache.org/jira/browse/TIKA-3518
>             Project: Tika
>          Issue Type: Bug
>          Components: ocr, tika-batch, tika-dl, tika-server
>    Affects Versions: 1.26
>            Reporter: Abha
>            Priority: Major
>
> ProcessBuilder not creating tmp file for Tesseract 4.1 and Higher Versions 
> With Tika 1.26 and JDK 1.8
> I am working on a project which integrates Tika and Tesseract OCR Tika 
> Version is 1.26, JDK 1.8 Now for any Tesseract Version earlier than 4.0 works 
> fine and extracts the image/pdf data correctly But upgrading the TesseractOCR 
> to 4.1.1 or Higher results in no data extraction. I debugged the issue and 
> found that the ProcessBuilder is not creating the temporary txt output file 
> from which TesseractOCR extracts the result, resulting in the issue. Any idea 
> if this is a version compatibility issue Or How to resolve this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3518) Tika 1.26 not Working with Tesseract 4.0 and Higher Version

Reply via email to