[jira] [Commented] (NIFI-1815) Tesseract OCR Processor

Jeremy Dyer (JIRA) Mon, 16 May 2016 05:58:08 -0700

    [ 
https://issues.apache.org/jira/browse/NIFI-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284460#comment-15284460
 ]


Jeremy Dyer commented on NIFI-1815:
-----------------------------------

Your changes look good to me. I like the idea of using the hashmap instead of 
an array. Could we change the default PAGE_SEGMENTATION_MODE to "3" instead of 
"0" however? "Fully automatic page segmentation, but no OSD" is the true 
Tesseract default page segmentation mode so would make sense to stay aligned 
with that. Other than that it looks good! Thanks! Once you make the change to 
the patch I'll collapse your patch with the existing git pr, resolve conflicts, 
and push all the changes up to Git

> Tesseract OCR Processor
> -----------------------
>
>                 Key: NIFI-1815
>                 URL: https://issues.apache.org/jira/browse/NIFI-1815
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Jeremy Dyer
>            Assignee: Jeremy Dyer
>         Attachments: 0006-changes-to-the-OCR-processor.patch
>
>
> This ticket is a follow-up to NIFI-1718 minus the use of the Tika library
> Expose OCR capabilities through a new processor which uses the Tesseract 
> library. Use of this processor would require that Tesseract be installed on 
> the NiFi host. Since the processor will have a system dependency care must be 
> taken to ensure that the overall NiFi cluster continues to function properly 
> in the absence of the Tesseract system dependency even though the OCR 
> processor itself will be unable to perform its duties. In the event that the 
> system dependencies are not detected the processor should display a 
> validation warning rather than failing or preventing the NiFi instance from 
> booting properly.
> Properties expose to configure Tesseract
> tesseractPath - Path to tesseract installation folder, if not on system path.
> language - Language ID (e.g. "eng"); language dictionary to be used.
> pageSegMode - Tesseract page segmentation mode, defaults to 1.
> minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
> maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to 
> Integer.MAX_VALUE.
> timeout - Maximum time (in seconds) to wait for the OCR process termination; 
> defaults to 120.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NIFI-1815) Tesseract OCR Processor

Reply via email to