[jira] [Commented] (CONNECTORS-1287) Additional TikaOCR Configuration Options

Karl Wright (JIRA) Mon, 04 Apr 2016 22:38:36 -0700

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225715#comment-15225715
 ]


Karl Wright commented on CONNECTORS-1287:
-----------------------------------------

I'm pushing this on to MCF 2.5.

I think that as long as Tessaract is properly installed on all machines in a 
cluster, it's OK to have a JNI dependency as a requirement.  Model files, 
however, need to be worked out.  Specifically, if there is any need to select a 
model for the OCR configuration, the model files should be handled in a manner 
similar to how the OpenNLP integration does it: there's a well-known and 
configured folder that these model files must be found in.  I don't know enough 
about Tesseract to know if this is going to be a problem or not though.


> Additional TikaOCR Configuration Options
> ----------------------------------------
>
>                 Key: CONNECTORS-1287
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1287
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Tika extractor
>    Affects Versions: ManifoldCF 2.3
>            Reporter: Konrad Holl
>            Assignee: Karl Wright
>            Priority: Minor
>             Fix For: ManifoldCF 2.5
>
>
> For a client project I needed to enable OCR for images inside PDFs. 
> Unfortunately ManifoldCF does not provide configuration options to handle 
> this. It would be nice to have these options for the Tika content extraction:
> 1.    Enable PDF image extraction for OCR: 
> https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29
> 2.    Set default language for tesseract: 
> https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29
> Tika OCR is based on tesseract, an Open Source OCR library intially developed 
> by Hewlett-Packard and later continued by Google. It is available from 
> https://github.com/tesseract-ocr/tesseract . It needs to be installed with 
> the tesseract binary available in the PATH environment variable - 
> alternatively it can be set using an Tika API method. Once it is installed 
> and Tika is configured correctly, it works like a charm.
> When indexing images or PDFs containing images instead of real text, OCR is 
> necessary for making those documents searchable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CONNECTORS-1287) Additional TikaOCR Configuration Options

Reply via email to