[
https://issues.apache.org/jira/browse/NIFI-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248878#comment-15248878
]
Jeremy Dyer commented on NIFI-1718:
-----------------------------------
[~dgoldenberg] I came to create a jira for a NiFi Tesseract processor today and
stumbled across this jira. Seems I'm a few days late. I created a purely
Tesseract processor already accounts for all of the bullet points you listed
(and the ability to pass in raw configuration key/values) but it doesn't use
Tika as you have described here. I would be glad to contribute what I have but
wanted run it by you first since you specifically called out Tika and I'm not
using that. Would it be a big deal if my implementation didn't use Tika
explicitly or are you needing that for something else?
Just for reference here is a quick screen recording of what I have so far
https://www.linkedin.com/pulse/nifi-ocr-using-apache-read-childrens-books-jeremy-dyer
> Processor(s) to perform OCR
> ---------------------------
>
> Key: NIFI-1718
> URL: https://issues.apache.org/jira/browse/NIFI-1718
> Project: Apache NiFi
> Issue Type: New Feature
> Components: Core Framework
> Reporter: Dmitry Goldenberg
>
> This ticket is a follow-up to NIFI-1717.
> Apache Tika by default performs OCR on image files such as PNG, BMP, JPEG,
> GIF, etc. using Tesseract, assuming that it is installed and properly
> configured.
> Design issue: should ExtractMediaAttributes processor allow Tika to perform
> OCR or should OCR be handled elsewhere, whether by a processor or by a
> service? Could both models be allowed, where ExtractMediaAttributes supports
> OCR but there's also a separate PerformOCR processor and/or service?
> If OCR is supported on the ExtractMediaAttributes processor, it'd be best if
> it supported the following OCR related options (which are exposed by Tika's
> TesseractOCRConfig class):
> * tesseractPath - Path to tesseract installation folder, if not on system
> path.
> * language - Language ID (e.g. "eng"); language dictionary to be used.
> * pageSegMode - Tesseract page segmentation mode, defaults to 1.
> * minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
> * maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to
> Integer.MAX_VALUE.
> * timeout - Maximum time (in seconds) to wait for the OCR process
> termination; defaults to 120.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)