[
https://issues.apache.org/jira/browse/NIFI-15098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lars Francke reassigned NIFI-15098:
-----------------------------------
Assignee: Lars Francke
> TestExtractMediaMetadata fails when Tesseract ENG data is missing
> -----------------------------------------------------------------
>
> Key: NIFI-15098
> URL: https://issues.apache.org/jira/browse/NIFI-15098
> Project: Apache NiFi
> Issue Type: Bug
> Affects Versions: 2.6.0
> Reporter: Lars Francke
> Assignee: Lars Francke
> Priority: Minor
>
> Running the {color:#000000}TestExtractMediaMetadata on a system which does
> have Tesseract installed but NOT the english tesseract data files
> fails:{color}
> {noformat}
> [pool-3-thread-1] INFO org.apache.tika.parser.ocr.TesseractOCRParser -
> Tesseract is installed and is being invoked. This can add greatly to
> processing time. If you do not want tesseract to be applied to your files
> see:
> https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
> [pool-3-thread-1] ERROR org.apache.nifi.processors.media.ExtractMediaMetadata
> - ExtractMediaMetadata[id=f5217b8a-4ac0-4876-83e0-179346ad855c] Failed to
> extract media metadata from FlowFile[0,16color-10x10.bmp,198B]:
> org.apache.nifi.processor.exception.ProcessException: java.io.IOException:
> org.apache.tika.exception.TikaException: TesseractOCRParser bad exit value 1
> err msg: Error opening data file /usr/share/tessdata/eng.traineddata
> Please make sure the TESSDATA_PREFIX environment variable is set to your
> "tessdata" directory.
> Failed loading language 'eng'
> Tesseract couldn't load any languages!
> Could not initialize
> tesseract.org.apache.nifi.processor.exception.ProcessException:
> java.io.IOException: org.apache.tika.exception.TikaException:
> TesseractOCRParser bad exit value 1 err msg: Error opening data file
> /usr/share/tessdata/eng.traineddata
> Please make sure the TESSDATA_PREFIX environment variable is set to your
> "tessdata" directory.
> Failed loading language 'eng'
> Tesseract couldn't load any languages!
> Could not initialize tesseract.
> [snip]
> at
> org.apache.nifi.processors.media.ExtractMediaMetadata.lambda$onTrigger$0(ExtractMediaMetadata.java:176)
> at
> org.apache.nifi.util.MockProcessSession.read(MockProcessSession.java:633)
> [snip]
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.runOCRProcess(TesseractOCRParser.java:493)
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:447)
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:334)
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:276)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
> at
> org.apache.tika.parser.image.AbstractImageParser.parse(AbstractImageParser.java:106)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:284)
> at
> org.apache.nifi.processors.media.ExtractMediaMetadata.tika_parse(ExtractMediaMetadata.java:200)
> at
> org.apache.nifi.processors.media.ExtractMediaMetadata.lambda$onTrigger$0(ExtractMediaMetadata.java:173)
> ... 93 moreorg.opentest4j.AssertionFailedError: Expected all Transferred
> FlowFiles to go to success but 1 were routed to failure
> [snip] {noformat}
> I see multiple options to fix this:
> * Ignore
> * At least document the behavior for the tests in question (testBmp and
> testJpg)
> * Assuming that OCR is not even intended for this to extract metadata we can
> disable OCR entirely
>
> ### Disabling OCR
> This is mentioned in the error message as well:
> [https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr]
>
> That has this snippet
> {code:java}
> TesseractOCRConfig config = new TesseractOCRConfig();
> config.setSkipOcr(true);
> ParseContext context = new ParseContext();
> context.set(TesseractOCRConfig.class, config);
>
> Parser parser = new AutoDetectParser();
> parser.parse(inputStream, handler, metadata, context); {code}
> I tried this snippet and it makes the tests green even without Tesseract data
> files installed.
> As the tests actually check for the extracted metadata OCR does not seem to
> be needed. As I assume this'll give a nice speed boost as well I believe
> this'd be my favorite solution. If you agree I can put up a PR.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)