[ 
https://issues.apache.org/jira/browse/NIFI-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265779#comment-15265779
 ] 

Karthik Narayanan commented on NIFI-1815:
-----------------------------------------

Jeremy quick thing i noticed is that you static fields for SUPPORTED_LANGUAGES.
would it cause issues if we have two flows, and user is trying to test the next 
version of Tesseract, installed in a different path. May be a new language. 
Would that affect the other flow? Would it be better to declare these 
properties not static? This is a question more for a learning purpose than a 
correction.
And wrt PAGE_SEG_MODE values is also an array. Is it possible it could be a map 
, like so.
PAGE_SEGMENTATION_MODES = new HashMap<String,Integer>();
PAGE_SEGMENTATION_MODES.put("0 = Orientation and script detection (OSD) 
only",0);
PAGE_SEGMENTATION_MODES.put("1 = Automatic page segmentation with OSD",1);
Then if we can pass this to PropertyDescripto builder, it can automatically use 
the key for display and value for setting the property value.
If this is not implemented, with some guidance i can see to implement it.

> Tesseract OCR Processor
> -----------------------
>
>                 Key: NIFI-1815
>                 URL: https://issues.apache.org/jira/browse/NIFI-1815
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Jeremy Dyer
>            Assignee: Jeremy Dyer
>
> This ticket is a follow-up to NIFI-1718 minus the use of the Tika library
> Expose OCR capabilities through a new processor which uses the Tesseract 
> library. Use of this processor would require that Tesseract be installed on 
> the NiFi host. Since the processor will have a system dependency care must be 
> taken to ensure that the overall NiFi cluster continues to function properly 
> in the absence of the Tesseract system dependency even though the OCR 
> processor itself will be unable to perform its duties. In the event that the 
> system dependencies are not detected the processor should display a 
> validation warning rather than failing or preventing the NiFi instance from 
> booting properly.
> Properties expose to configure Tesseract
> tesseractPath - Path to tesseract installation folder, if not on system path.
> language - Language ID (e.g. "eng"); language dictionary to be used.
> pageSegMode - Tesseract page segmentation mode, defaults to 1.
> minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
> maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to 
> Integer.MAX_VALUE.
> timeout - Maximum time (in seconds) to wait for the OCR process termination; 
> defaults to 120.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to