[ 
https://issues.apache.org/jira/browse/TIKA-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17401093#comment-17401093
 ] 

Tim Allison commented on TIKA-3518:
-----------------------------------

I think I figured out what is going on...   There is a bug.  At least for 
tesseract 5.x and at least on Windows, the tesseract data path must include 
"tessdata".  In our code if the tessdata path is not specified via the 
TikaConfig, but the tesseract path is specified, we use the tesseract path as 
the tessdata path, but this is not right.  We need to append "/tessdata" to the 
tesseract path.

Let me confirm this works on Linux, and I'll fix this before the 2.1.0 release.

> Tika 1.26 not Working with Tesseract 4.0 and Higher Version
> -----------------------------------------------------------
>
>                 Key: TIKA-3518
>                 URL: https://issues.apache.org/jira/browse/TIKA-3518
>             Project: Tika
>          Issue Type: Bug
>          Components: ocr, tika-batch, tika-dl, tika-server
>    Affects Versions: 1.26
>            Reporter: Abha
>            Priority: Major
>
> ProcessBuilder not creating tmp file for Tesseract 4.1 and Higher Versions 
> With Tika 1.26 and JDK 1.8
> I am working on a project which integrates Tika and Tesseract OCR Tika 
> Version is 1.26, JDK 1.8 Now for any Tesseract Version earlier than 4.0 works 
> fine and extracts the image/pdf data correctly But upgrading the TesseractOCR 
> to 4.1.1 or Higher results in no data extraction. I debugged the issue and 
> found that the ProcessBuilder is not creating the temporary txt output file 
> from which TesseractOCR extracts the result, resulting in the issue. Any idea 
> if this is a version compatibility issue Or How to resolve this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to