[
https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16843825#comment-16843825
]
ASF GitHub Bot commented on TIKA-2293:
--------------------------------------
changetoblow commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler
Java version of TesseractOCRParser
URL: https://github.com/apache/tika/pull/158#issuecomment-493919876
Hello, I now use the tess4j API to identify images in Tika, and there is no
problem when I only deal with images, but I have a problem when I deal with
word file, because there are embedded images in the word file, and the error
information is as follows:
`
17:33:28.423 [main] ERROR net.sourceforge.tess4j.Tesseract - Unsupported
image format. May need to install JAI Image I/O package.
https://github.com/jai-imageio/jai-imageio-core
java.lang.RuntimeException: Unsupported image format. May need to install
JAI Image I/O package.
https://github.com/jai-imageio/jai-imageio-core
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:214)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:194)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:397)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:242)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:391)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:264)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:206)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:139)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:156)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.Tika.parseToString(Tika.java:608)
at org.apache.tika.Tika.parseToString(Tika.java:723)
at com.tika.test.tt.main(tt.java:20)
17:33:28.423 [main] WARN org.apache.tika.parser.ocr.TesseractOCRParser -
java.lang.RuntimeException: Unsupported image format. May need to install JAI
Image I/O package.
https://github.com/jai-imageio/jai-imageio-core
net.sourceforge.tess4j.TesseractException: java.lang.RuntimeException:
Unsupported image format. May need to install JAI Image I/O package.
https://github.com/jai-imageio/jai-imageio-core
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:245)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:194)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:397)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:242)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:391)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:264)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:206)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:139)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:156)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.Tika.parseToString(Tika.java:608)
at org.apache.tika.Tika.parseToString(Tika.java:723)
at com.tika.test.tt.main(tt.java:20)
Caused by: java.lang.RuntimeException: Unsupported image format. May need to
install JAI Image I/O package.
https://github.com/jai-imageio/jai-imageio-core
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:214)
... 20 common frames omitted
`
I have checked in two ways , first by using maven repository i.e
<!--
https://mvnrepository.com/artifact/com.github.jai-imageio/jai-imageio-core -->
<dependency>
<groupId>com.github.jai-imageio</groupId>
<artifactId>jai-imageio-core</artifactId>
<version>1.4.0</version>
</dependency>
Secondly , I have also checked by including the jar i.e
jai-imageio-core-1.4.0.jar
In both ways I am getting the same error , how can I fix this?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Tess4jOCRParser - A simpler Java version of TesseractOCRParser
> ---------------------------------------------------------------
>
> Key: TIKA-2293
> URL: https://issues.apache.org/jira/browse/TIKA-2293
> Project: Tika
> Issue Type: Improvement
> Components: ocr
> Reporter: Thejan Wijesinghe
> Priority: Major
>
> Right now, TesseractOCRParser calls tesseract and imagemagick from command
> line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API
> instead of the runtime.exec way to executing tesseract out of process.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)