[ 
https://issues.apache.org/jira/browse/CONNECTORS-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konrad Holl updated CONNECTORS-1287:
------------------------------------
    Description: 
For a client project I needed to enable OCR for images inside PDFs. 
Unfortunately ManifoldCF does not provide configuration options to handle this. 
It would be nice to have these options for the Tika content extraction:

1.      Enable PDF image extraction for OCR: 
https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29
2.      Set default language for tesseract: 
https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29

Tika OCR is based on tesseract, an Open Source OCR library intially developed 
by Hewlett-Packard and was continued by Google. It is available from 
https://github.com/tesseract-ocr/tesseract



  was:
For a client project I needed to enable OCR for images inside PDFs. 
Unfortunately ManifoldCF does not provide configuration options to handle this. 
It would be nice to have these options for the Tika content extraction:

1.      Enable PDF image extraction for OCR: 
https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29
2.      Set default language for tesseract: 
https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29

Tika OCR is based on tesseract, an Open Source OCR library intially developed 
by Hewlett-Packard and continued


> Additional TikaOCR Configuration Options
> ----------------------------------------
>
>                 Key: CONNECTORS-1287
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1287
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Tika extractor
>    Affects Versions: ManifoldCF 2.3
>            Reporter: Konrad Holl
>            Assignee: Karl Wright
>            Priority: Minor
>             Fix For: ManifoldCF 2.4
>
>
> For a client project I needed to enable OCR for images inside PDFs. 
> Unfortunately ManifoldCF does not provide configuration options to handle 
> this. It would be nice to have these options for the Tika content extraction:
> 1.    Enable PDF image extraction for OCR: 
> https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29
> 2.    Set default language for tesseract: 
> https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29
> Tika OCR is based on tesseract, an Open Source OCR library intially developed 
> by Hewlett-Packard and was continued by Google. It is available from 
> https://github.com/tesseract-ocr/tesseract



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to