[jira] [Comment Edited] (TIKA-3518) Tika 1.26 not Working with Tesseract 4.0 and Higher Version

Abha (Jira) Mon, 09 Aug 2021 13:19:06 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396258#comment-17396258
 ]


Abha edited comment on TIKA-3518 at 8/9/21, 8:18 PM:
-----------------------------------------------------

I tested it with JDK 11 and still the same issue.

The ProcessBuilder class in Java does not creates the file for Tesseract 4.0 
and higher versions. This is the sample code i am using -

{color:#0747a6}~Parser parser = new AutoDetectParser();~{color}
 {color:#0747a6}~BodyContentHandler handler = new 
BodyContentHandler(Integer.MAX_VALUE);~{color}

{color:#0747a6}~TesseractOCRConfig config = new TesseractOCRConfig();~{color}
 {color:#0747a6}~String tPath = "C:~
~Program Files (x86)~
~Tesseract-OCR"; // tesseract installation directory.~{color}
 {color:#0747a6}~config.setTesseractPath(tPath);~{color}

{color:#0747a6}~config.setPageSegMode("3");~{color}
 {color:#0747a6}~ParseContext parseContext = new ParseContext();~{color}
 {color:#0747a6}~parseContext.set(TesseractOCRConfig.class, config);~{color}
 {color:#0747a6}~//need to add this to make sure recursive parsing 
happens!~{color}
 {color:#0747a6}~parseContext.set(Parser.class, parser);~{color}

{color:#0747a6}~// Parsed file location directory.~{color}
 {color:#0747a6}~FileInputStream stream = new 
FileInputStream("E://Env//3.jpg");~{color}
 {color:#0747a6}~Metadata metadata = new Metadata();~{color}
 {color:#0747a6}~parser.parse(stream, handler, metadata, parseContext);~{color}
 {color:#0747a6}~System.out.println(metadata);~{color}
 {color:#0747a6}~String content = handler.toString();~{color}
 {color:#0747a6}~System.out.println("===============");~{color}
 {color:#0747a6}~System.out.println(content);~{color}

 

Is there any new config changes that needs to be added ? Can this problem be 
specific to a machine? Also this is being tested in Windows.


was (Author: abha.1012):
I tested it with JDK 11 and still the same issue.

The ProcessBuilder class in Java does not creates the file for Tesseract 4.0 
and higher versions. This is the sample code i am using -

{color:#0747a6}~Parser parser = new AutoDetectParser();~{color}
{color:#0747a6}~BodyContentHandler handler = new 
BodyContentHandler(Integer.MAX_VALUE);~{color}

{color:#0747a6}~TesseractOCRConfig config = new TesseractOCRConfig();~{color}
{color:#0747a6}~String tPath = "C:\\Program Files (x86)\\Tesseract-OCR"; // 
tesseract installation directory.~{color}
{color:#0747a6}~config.setTesseractPath(tPath);~{color}

{color:#0747a6}~config.setPageSegMode("3");~{color}
{color:#0747a6}~ParseContext parseContext = new ParseContext();~{color}
{color:#0747a6}~parseContext.set(TesseractOCRConfig.class, config);~{color}
{color:#0747a6}~//need to add this to make sure recursive parsing 
happens!~{color}
{color:#0747a6}~parseContext.set(Parser.class, parser);~{color}

{color:#0747a6}~// Parsed file location directory.~{color}
{color:#0747a6}~FileInputStream stream = new 
FileInputStream("E://Env//3.jpg");~{color}
{color:#0747a6}~Metadata metadata = new Metadata();~{color}
{color:#0747a6}~parser.parse(stream, handler, metadata, parseContext);~{color}
{color:#0747a6}~System.out.println(metadata);~{color}
{color:#0747a6}~String content = handler.toString();~{color}
{color:#0747a6}~System.out.println("===============");~{color}
{color:#0747a6}~System.out.println(content);~{color}

 

Is there any new config changes that needs to be added ? Can this problem be 
specific to a machine? 

> Tika 1.26 not Working with Tesseract 4.0 and Higher Version
> -----------------------------------------------------------
>
>                 Key: TIKA-3518
>                 URL: https://issues.apache.org/jira/browse/TIKA-3518
>             Project: Tika
>          Issue Type: Bug
>          Components: ocr, tika-batch, tika-dl, tika-server
>    Affects Versions: 1.26
>            Reporter: Abha
>            Priority: Major
>
> ProcessBuilder not creating tmp file for Tesseract 4.1 and Higher Versions 
> With Tika 1.26 and JDK 1.8
> I am working on a project which integrates Tika and Tesseract OCR Tika 
> Version is 1.26, JDK 1.8 Now for any Tesseract Version earlier than 4.0 works 
> fine and extracts the image/pdf data correctly But upgrading the TesseractOCR 
> to 4.1.1 or Higher results in no data extraction. I debugged the issue and 
> found that the ProcessBuilder is not creating the temporary txt output file 
> from which TesseractOCR extracts the result, resulting in the issue. Any idea 
> if this is a version compatibility issue Or How to resolve this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-3518) Tika 1.26 not Working with Tesseract 4.0 and Higher Version

Reply via email to