[ 
https://issues.apache.org/jira/browse/CONNECTORS-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1287:
------------------------------------
    Fix Version/s: ManifoldCF 2.4

> Additional TikaOCR Configuration Options
> ----------------------------------------
>
>                 Key: CONNECTORS-1287
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1287
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Tika extractor
>    Affects Versions: ManifoldCF 2.3
>            Reporter: Konrad Holl
>            Assignee: Karl Wright
>            Priority: Minor
>             Fix For: ManifoldCF 2.4
>
>
> For a client project I needed to enable OCR for images inside PDFs. 
> Unfortunately ManifoldCF does not provide configuration options to handle 
> this. It would be nice to have these options for the Tika content extraction:
> 1.    Enable PDF image extraction for OCR: 
> https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29
> 2.    Set default language for tesseract: 
> https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to