Lars Francke created NIFI-15098:
-----------------------------------
Summary: TestExtractMediaMetadata fails when Tesseract ENG data is
missing
Key: NIFI-15098
URL: https://issues.apache.org/jira/browse/NIFI-15098
Project: Apache NiFi
Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Lars Francke
Running the {color:#000000}TestExtractMediaMetadata on a system which does have
Tesseract installed but NOT the english tesseract data files fails:{color}
{noformat}
[pool-3-thread-1] INFO org.apache.tika.parser.ocr.TesseractOCRParser -
Tesseract is installed and is being invoked. This can add greatly to processing
time. If you do not want tesseract to be applied to your files see:
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
[pool-3-thread-1] ERROR org.apache.nifi.processors.media.ExtractMediaMetadata -
ExtractMediaMetadata[id=f5217b8a-4ac0-4876-83e0-179346ad855c] Failed to extract
media metadata from FlowFile[0,16color-10x10.bmp,198B]:
org.apache.nifi.processor.exception.ProcessException: java.io.IOException:
org.apache.tika.exception.TikaException: TesseractOCRParser bad exit value 1
err msg: Error opening data file /usr/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your
"tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize
tesseract.org.apache.nifi.processor.exception.ProcessException:
java.io.IOException: org.apache.tika.exception.TikaException:
TesseractOCRParser bad exit value 1 err msg: Error opening data file
/usr/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your
"tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
[snip]
at
org.apache.nifi.processors.media.ExtractMediaMetadata.lambda$onTrigger$0(ExtractMediaMetadata.java:176)
at org.apache.nifi.util.MockProcessSession.read(MockProcessSession.java:633)
[snip]
at
org.apache.tika.parser.ocr.TesseractOCRParser.runOCRProcess(TesseractOCRParser.java:493)
at
org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:447)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:334)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:276)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
at
org.apache.tika.parser.image.AbstractImageParser.parse(AbstractImageParser.java:106)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:284)
at
org.apache.nifi.processors.media.ExtractMediaMetadata.tika_parse(ExtractMediaMetadata.java:200)
at
org.apache.nifi.processors.media.ExtractMediaMetadata.lambda$onTrigger$0(ExtractMediaMetadata.java:173)
... 93 moreorg.opentest4j.AssertionFailedError: Expected all Transferred
FlowFiles to go to success but 1 were routed to failure
[snip] {noformat}
I see multiple options to fix this:
* Ignore
* At least document the behavior for the tests in question (testBmp and
testJpg)
* Assuming that OCR is not even intended for this to extract metadata we can
disable OCR entirely
### Disabling OCR
This is mentioned in the error message as well:
[https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr]
That has this snippet
{code:java}
TesseractOCRConfig config = new TesseractOCRConfig();
config.setSkipOcr(true);
ParseContext context = new ParseContext();
context.set(TesseractOCRConfig.class, config);
Parser parser = new AutoDetectParser();
parser.parse(inputStream, handler, metadata, context); {code}
I tried this snippet and it makes the tests green even without Tesseract data
files installed.
As the tests actually check for the extracted metadata OCR does not seem to be
needed. As I assume this'll give a nice speed boost as well I believe this'd be
my favorite solution. If you agree I can put up a PR.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)