[ 
https://issues.apache.org/jira/browse/TIKA-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17401093#comment-17401093
 ] 

Tim Allison edited comment on TIKA-3518 at 8/18/21, 2:23 PM:
-------------------------------------------------------------

I think I figured out what is going on...   There is a bug.  At least for 
tesseract 5.x and at least on Windows, the tesseract data path must include 
"tessdata".  In our code if the tessdata path is not specified via the 
TikaConfig, but the tesseract path is specified, we use the tesseract path as 
the tessdata path (on the untested theory that tesseract would figure it out), 
but this is not right.  We need to append "/tessdata" to the tesseract path.

So, you are setting the tesseract path, and our code is currently using that 
path to set the tessdata environment variable in the spawned process.  Again, 
this is a bug that we need to fix.

If tesseract is on your path and you don't specify a custom tesseract path via 
the config.  It should just work.

Let me confirm this works on Linux, and I'll fix this before the 2.1.0 release.


was (Author: [email protected]):
I think I figured out what is going on...   There is a bug.  At least for 
tesseract 5.x and at least on Windows, the tesseract data path must include 
"tessdata".  In our code if the tessdata path is not specified via the 
TikaConfig, but the tesseract path is specified, we use the tesseract path as 
the tessdata path, but this is not right.  We need to append "/tessdata" to the 
tesseract path.

Let me confirm this works on Linux, and I'll fix this before the 2.1.0 release.

> Tika 1.26 not Working with Tesseract 4.0 and Higher Version
> -----------------------------------------------------------
>
>                 Key: TIKA-3518
>                 URL: https://issues.apache.org/jira/browse/TIKA-3518
>             Project: Tika
>          Issue Type: Bug
>          Components: ocr, tika-batch, tika-dl, tika-server
>    Affects Versions: 1.26
>            Reporter: Abha
>            Priority: Major
>
> ProcessBuilder not creating tmp file for Tesseract 4.1 and Higher Versions 
> With Tika 1.26 and JDK 1.8
> I am working on a project which integrates Tika and Tesseract OCR Tika 
> Version is 1.26, JDK 1.8 Now for any Tesseract Version earlier than 4.0 works 
> fine and extracts the image/pdf data correctly But upgrading the TesseractOCR 
> to 4.1.1 or Higher results in no data extraction. I debugged the issue and 
> found that the ProcessBuilder is not creating the temporary txt output file 
> from which TesseractOCR extracts the result, resulting in the issue. Any idea 
> if this is a version compatibility issue Or How to resolve this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to