Lars Francke created NIFI-15098:
-----------------------------------

             Summary: TestExtractMediaMetadata fails when Tesseract ENG data is 
missing
                 Key: NIFI-15098
                 URL: https://issues.apache.org/jira/browse/NIFI-15098
             Project: Apache NiFi
          Issue Type: Bug
    Affects Versions: 2.6.0
            Reporter: Lars Francke


Running the {color:#000000}TestExtractMediaMetadata on a system which does have 
Tesseract installed but NOT the english tesseract data files fails:{color}
{noformat}
[pool-3-thread-1] INFO org.apache.tika.parser.ocr.TesseractOCRParser - 
Tesseract is installed and is being invoked. This can add greatly to processing 
time.  If you do not want tesseract to be applied to your files see: 
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
[pool-3-thread-1] ERROR org.apache.nifi.processors.media.ExtractMediaMetadata - 
ExtractMediaMetadata[id=f5217b8a-4ac0-4876-83e0-179346ad855c] Failed to extract 
media metadata from FlowFile[0,16color-10x10.bmp,198B]: 
org.apache.nifi.processor.exception.ProcessException: java.io.IOException: 
org.apache.tika.exception.TikaException: TesseractOCRParser bad exit value 1 
err msg: Error opening data file /usr/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your 
"tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize 
tesseract.org.apache.nifi.processor.exception.ProcessException: 
java.io.IOException: org.apache.tika.exception.TikaException: 
TesseractOCRParser bad exit value 1 err msg: Error opening data file 
/usr/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your 
"tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

[snip]

at 
org.apache.nifi.processors.media.ExtractMediaMetadata.lambda$onTrigger$0(ExtractMediaMetadata.java:176)
    at org.apache.nifi.util.MockProcessSession.read(MockProcessSession.java:633)

[snip]

    at 
org.apache.tika.parser.ocr.TesseractOCRParser.runOCRProcess(TesseractOCRParser.java:493)
    at 
org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:447)
    at 
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:334)
    at 
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:276)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
    at 
org.apache.tika.parser.image.AbstractImageParser.parse(AbstractImageParser.java:106)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:284)
    at 
org.apache.nifi.processors.media.ExtractMediaMetadata.tika_parse(ExtractMediaMetadata.java:200)
    at 
org.apache.nifi.processors.media.ExtractMediaMetadata.lambda$onTrigger$0(ExtractMediaMetadata.java:173)
    ... 93 moreorg.opentest4j.AssertionFailedError: Expected all Transferred 
FlowFiles to go to success but 1 were routed to failure

[snip] {noformat}
I see multiple options to fix this:
 * Ignore
 * At least document the behavior for the tests in question (testBmp and 
testJpg)
 * Assuming that OCR is not even intended for this to extract metadata we can 
disable OCR entirely

 

### Disabling OCR

This is mentioned in the error message as well: 
[https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr]

 

That has this snippet
{code:java}
        TesseractOCRConfig config = new TesseractOCRConfig();
        config.setSkipOcr(true);
        ParseContext context = new ParseContext();
        context.set(TesseractOCRConfig.class, config);
        
        Parser parser = new AutoDetectParser();
        parser.parse(inputStream, handler, metadata, context); {code}
I tried this snippet and it makes the tests green even without Tesseract data 
files installed.

As the tests actually check for the extracted metadata OCR does not seem to be 
needed. As I assume this'll give a nice speed boost as well I believe this'd be 
my favorite solution. If you agree I can put up a PR.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to