[
https://issues.apache.org/jira/browse/CONNECTORS-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187480#comment-15187480
]
Karl Wright commented on CONNECTORS-1287:
-----------------------------------------
It looks like there are severe cautions for OCR in Tika:
{code}
public void setExtractInlineImages(boolean extractInlineImages)
If true, extract inline embedded OBXImages. Beware: some PDF documents of
modest size (~4MB) can contain thousands of embedded images totaling > 2.5 GB.
Also, at least as of PDFBox 1.8.5, there can be surprisingly large memory
consumption and/or out of memory errors. Set to true with caution.
The default is false.
{code}
It looks like there is also native third-party (JNI?) code involved for OCR in
general:
{code}
public void setTesseractPath(String tesseractPath)
Set tesseract installation folder, needed if it is not on system path.
{code}
To date, we have not included any code in the connector world that would
require JNI libraries. I'd like to hear a more detailed description of what
exactly Tika requires to do OCR -- i.e. where does Tesseract come from?
> Additional TikaOCR Configuration Options
> ----------------------------------------
>
> Key: CONNECTORS-1287
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1287
> Project: ManifoldCF
> Issue Type: Improvement
> Components: Tika extractor
> Affects Versions: ManifoldCF 2.3
> Reporter: Konrad Holl
> Priority: Minor
>
> For a client project I needed to enable OCR for images inside PDFs.
> Unfortunately ManifoldCF does not provide configuration options to handle
> this. It would be nice to have these options for the Tika content extraction:
> 1. Enable PDF image extraction for OCR:
> https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29
> 2. Set default language for tesseract:
> https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)