[
https://issues.apache.org/jira/browse/CONNECTORS-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Karl Wright updated CONNECTORS-1287:
------------------------------------
Fix Version/s: (was: ManifoldCF 2.4)
ManifoldCF 2.5
> Additional TikaOCR Configuration Options
> ----------------------------------------
>
> Key: CONNECTORS-1287
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1287
> Project: ManifoldCF
> Issue Type: Improvement
> Components: Tika extractor
> Affects Versions: ManifoldCF 2.3
> Reporter: Konrad Holl
> Assignee: Karl Wright
> Priority: Minor
> Fix For: ManifoldCF 2.5
>
>
> For a client project I needed to enable OCR for images inside PDFs.
> Unfortunately ManifoldCF does not provide configuration options to handle
> this. It would be nice to have these options for the Tika content extraction:
> 1. Enable PDF image extraction for OCR:
> https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29
> 2. Set default language for tesseract:
> https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29
> Tika OCR is based on tesseract, an Open Source OCR library intially developed
> by Hewlett-Packard and later continued by Google. It is available from
> https://github.com/tesseract-ocr/tesseract . It needs to be installed with
> the tesseract binary available in the PATH environment variable -
> alternatively it can be set using an Tika API method. Once it is installed
> and Tika is configured correctly, it works like a charm.
> When indexing images or PDFs containing images instead of real text, OCR is
> necessary for making those documents searchable.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)