Konrad Holl created CONNECTORS-1287:
---------------------------------------
Summary: Additional TikaOCR Configuration Options
Key: CONNECTORS-1287
URL: https://issues.apache.org/jira/browse/CONNECTORS-1287
Project: ManifoldCF
Issue Type: Improvement
Components: Tika extractor
Affects Versions: ManifoldCF 2.3
Reporter: Konrad Holl
Priority: Minor
For a client project I needed to enable OCR for images inside PDFs.
Unfortunately ManifoldCF does not provide configuration options to handle this.
It would be nice to have these options for the Tika content extraction:
1. Enable PDF image extraction for OCR:
https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29
2. Set default language for tesseract:
https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)