[ 
https://issues.apache.org/jira/browse/NIFI-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15339892#comment-15339892
 ] 

Oleg Zhurakousky commented on NIFI-1815:
----------------------------------------

Guys
We may need to rethink this a bit
While current PR looks good and processor performs what it supposed to do, 
tess4j pulls several libraries under licenses that are not compatible with ASF 
policies. Below are the libraries in question and their licenses:
{code}
itext - GNU Affero General Public License v3
ghost4j - GNU LESSER GENERAL PUBLIC LICENSE
jai* - BSD 3-clause License
rococoa - LGPL 3
{code}
Possible solutions could be to employ the same model that is used by JMS and 
Spring components of NIFi where the end user is responsible for pointing to the 
libraries as part of the configuration, so the distribution does not require 
them.
[~jeremy.dyer] ping me and we can discuss how to accomplish this. As I said, 
we're doing it already in both JMS and Spring components, so not the most 
complex issue.

> Tesseract OCR Processor
> -----------------------
>
>                 Key: NIFI-1815
>                 URL: https://issues.apache.org/jira/browse/NIFI-1815
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Jeremy Dyer
>            Assignee: Jeremy Dyer
>         Attachments: 0006-changes-to-the-OCR-processor.patch, 
> nifi_1815_1.x_patch.zip
>
>
> This ticket is a follow-up to NIFI-1718 minus the use of the Tika library
> Expose OCR capabilities through a new processor which uses the Tesseract 
> library. Use of this processor would require that Tesseract be installed on 
> the NiFi host. Since the processor will have a system dependency care must be 
> taken to ensure that the overall NiFi cluster continues to function properly 
> in the absence of the Tesseract system dependency even though the OCR 
> processor itself will be unable to perform its duties. In the event that the 
> system dependencies are not detected the processor should display a 
> validation warning rather than failing or preventing the NiFi instance from 
> booting properly.
> Properties expose to configure Tesseract
> tesseractPath - Path to tesseract installation folder, if not on system path.
> language - Language ID (e.g. "eng"); language dictionary to be used.
> pageSegMode - Tesseract page segmentation mode, defaults to 1.
> minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
> maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to 
> Integer.MAX_VALUE.
> timeout - Maximum time (in seconds) to wait for the OCR process termination; 
> defaults to 120.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to