Konrad Holl created CONNECTORS-1287:
---------------------------------------

             Summary: Additional TikaOCR Configuration Options
                 Key: CONNECTORS-1287
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1287
             Project: ManifoldCF
          Issue Type: Improvement
          Components: Tika extractor
    Affects Versions: ManifoldCF 2.3
            Reporter: Konrad Holl
            Priority: Minor


For a client project I needed to enable OCR for images inside PDFs. 
Unfortunately ManifoldCF does not provide configuration options to handle this. 
It would be nice to have these options for the Tika content extraction:

1.      Enable PDF image extraction for OCR: 
https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29
2.      Set default language for tesseract: 
https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to