[
https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16843830#comment-16843830
]
ASF GitHub Bot commented on TIKA-2293:
--------------------------------------
changetoblow commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler
Java version of TesseractOCRParser
URL: https://github.com/apache/tika/pull/158#issuecomment-493923095
17:33:28.423 [main] ERROR net.sourceforge.tess4j.Tesseract - Unsupported
image format. May need to install JAI Image I/O package.
https://github.com/jai-imageio/jai-imageio-core
java.lang.RuntimeException: Unsupported image format. May need to install
JAI Image I/O package.
https://github.com/jai-imageio/jai-imageio-core
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:214)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:194)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:397)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:242)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:391)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:264)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:206)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:139)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:156)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.Tika.parseToString(Tika.java:608)
at org.apache.tika.Tika.parseToString(Tika.java:723)
at com.tika.test.tt.main(tt.java:20)
17:33:28.423 [main] WARN org.apache.tika.parser.ocr.TesseractOCRParser -
java.lang.RuntimeException: Unsupported image format. May need to install JAI
Image I/O package.
https://github.com/jai-imageio/jai-imageio-core
net.sourceforge.tess4j.TesseractException: java.lang.RuntimeException:
Unsupported image format. May need to install JAI Image I/O package.
https://github.com/jai-imageio/jai-imageio-core
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:245)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:194)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:397)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:242)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:104)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:391)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart(AbstractOOXMLExtractor.java:264)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:206)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:139)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:156)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.Tika.parseToString(Tika.java:608)
at org.apache.tika.Tika.parseToString(Tika.java:723)
at com.tika.test.tt.main(tt.java:20)
Caused by: java.lang.RuntimeException: Unsupported image format. May need to
install JAI Image I/O package.
https://github.com/jai-imageio/jai-imageio-core
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:214)
... 20 common frames omitted
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Tess4jOCRParser - A simpler Java version of TesseractOCRParser
> ---------------------------------------------------------------
>
> Key: TIKA-2293
> URL: https://issues.apache.org/jira/browse/TIKA-2293
> Project: Tika
> Issue Type: Improvement
> Components: ocr
> Reporter: Thejan Wijesinghe
> Priority: Major
>
> Right now, TesseractOCRParser calls tesseract and imagemagick from command
> line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API
> instead of the runtime.exec way to executing tesseract out of process.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)