[jira] [Commented] (CONNECTORS-1287) Additional TikaOCR Configuration Options

Karl Wright (JIRA) Wed, 09 Mar 2016 09:39:03 -0800

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187480#comment-15187480
 ]


Karl Wright commented on CONNECTORS-1287:
-----------------------------------------

It looks like there are severe cautions for OCR in Tika:

{code}
public void setExtractInlineImages(boolean extractInlineImages)
If true, extract inline embedded OBXImages. Beware: some PDF documents of 
modest size (~4MB) can contain thousands of embedded images totaling > 2.5 GB. 
Also, at least as of PDFBox 1.8.5, there can be surprisingly large memory 
consumption and/or out of memory errors. Set to true with caution.
The default is false.
{code}

It looks like there is also native third-party (JNI?) code involved for OCR in 
general:

{code}
public void setTesseractPath(String tesseractPath)
Set tesseract installation folder, needed if it is not on system path.
{code}

To date, we have not included any code in the connector world that would 
require JNI libraries.  I'd like to hear a more detailed description of what 
exactly Tika requires to do OCR -- i.e. where does Tesseract come from?


> Additional TikaOCR Configuration Options
> ----------------------------------------
>
>                 Key: CONNECTORS-1287
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1287
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Tika extractor
>    Affects Versions: ManifoldCF 2.3
>            Reporter: Konrad Holl
>            Priority: Minor
>
> For a client project I needed to enable OCR for images inside PDFs. 
> Unfortunately ManifoldCF does not provide configuration options to handle 
> this. It would be nice to have these options for the Tika content extraction:
> 1.    Enable PDF image extraction for OCR: 
> https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29
> 2.    Set default language for tesseract: 
> https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CONNECTORS-1287) Additional TikaOCR Configuration Options

Reply via email to