[
https://issues.apache.org/jira/browse/NIFI-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303264#comment-15303264
]
ASF GitHub Bot commented on NIFI-1815:
--------------------------------------
Github user jdye64 commented on a diff in the pull request:
https://github.com/apache/nifi/pull/397#discussion_r64843410
--- Diff:
nifi-nar-bundles/nifi-ocr-bundle/nifi-ocr-processors/src/main/java/org/apache/nifi/processors/ocr/TesseractOCRProcessor.java
---
@@ -0,0 +1,361 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nifi.processors.ocr;
+
+import net.sourceforge.tess4j.ITesseract;
+import net.sourceforge.tess4j.Tesseract;
+import net.sourceforge.tess4j.TesseractException;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.nifi.annotation.behavior.InputRequirement;
+import org.apache.nifi.annotation.lifecycle.OnScheduled;
+import org.apache.nifi.components.PropertyDescriptor;
+import org.apache.nifi.components.AllowableValue;
+import org.apache.nifi.components.ValidationContext;
+import org.apache.nifi.components.ValidationResult;
+import org.apache.nifi.components.Validator;
+import org.apache.nifi.flowfile.FlowFile;
+import org.apache.nifi.annotation.documentation.CapabilityDescription;
+import org.apache.nifi.annotation.documentation.Tags;
+
+import org.apache.nifi.processor.AbstractProcessor;
+import org.apache.nifi.processor.exception.ProcessException;
+import org.apache.nifi.processor.io.StreamCallback;
+import org.apache.nifi.processor.util.StandardValidators;
+import org.apache.nifi.processor.Relationship;
+import org.apache.nifi.processor.ProcessorInitializationContext;
+import org.apache.nifi.processor.ProcessContext;
+import org.apache.nifi.processor.ProcessSession;
+
+import javax.imageio.ImageIO;
+import java.awt.image.BufferedImage;
+import java.io.InputStream;
+import java.io.IOException;
+import java.io.OutputStream;
+import java.io.File;
+import java.io.FileFilter;
+import java.util.Collections;
+import java.util.List;
+import java.util.Set;
+import java.util.Map;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.ArrayList;
+import java.util.concurrent.atomic.AtomicBoolean;
+
+@Tags({"ocr", "tesseract", "image", "text"})
+@InputRequirement(InputRequirement.Requirement.INPUT_REQUIRED)
+@CapabilityDescription("Extracts text from images using Optical Character
Recognition (OCR). The images are pulled from the incoming" +
+ " Flowfile's content. Supported image types are TIFF, JPEG, GIF,
PNG, BMP, and PDF. Any Flowfile that doesn't contain" +
+ " a supported image type in its content body will be routed to the
'unsupported image format' relationship and no OCR will be performed." +
+ " This processor uses Tesseract to perform its duties and part of
that requires that a valid Tesseract data (Tessdata) directory" +
+ " be specified in the 'Tessdata Directory' Property. This
processor considers a valid Tessdata directory to be an existing directory on
the" +
+ " local NiFi instance that contains one or more files ending with
the '.traineddata' extension. The list of supported languages" +
+ " is built from the Tessdata directory configured by listing all
files ending with '.traineddata' and considering those" +
+ " Tesseract language models. You can create you own Tesseract
language models and place them in your Tessedata directory" +
+ " and the processor will display it in the dropdown list of
languages available. All valid Tesseract configuration values" +
+ " may be passed to this processor by use of the 'Tesseract
configuration values' which accepts a comma separated list" +
+ " of key=value pairs representing Tesseract configurations.
'Tesseract configuration values' is where all of your tuning" +
+ " values can be passed in to help increase the accuracy of your
OCR operations based on your expected input images." +
+ " TesseractOCRProcessor only supports installations of Tesseract
version 3.0 and greater.")
+public class TesseractOCRProcessor extends AbstractProcessor {
+
+ public static Set<String> SUPPORTED_LANGUAGES;
+ private static final String TESS_LANG_EXTENSION = ".traineddata";
+ private static List<AllowableValue> PAGE_SEGMENTATION_MODES;
+ private static ITesseract tessInstance;
+ private List<PropertyDescriptor> descriptors;
+ private Set<Relationship> relationships;
+
+ static {
+ SUPPORTED_LANGUAGES = new HashSet<String>();
+ SUPPORTED_LANGUAGES.add("eng"); //Since this is the default value
we need to ensure it is present in the allowableValues.
+
+ PAGE_SEGMENTATION_MODES = new ArrayList<AllowableValue>();
+ PAGE_SEGMENTATION_MODES.add(new AllowableValue("0","0 =
Orientation and script detection (OSD) only"));
+ PAGE_SEGMENTATION_MODES.add(new AllowableValue("1","1 = Automatic
page segmentation with OSD"));
+ PAGE_SEGMENTATION_MODES.add(new AllowableValue("2","2 = Automatic
page segmentation, but no OSD, or OCR"));
+ PAGE_SEGMENTATION_MODES.add(new AllowableValue("3","3 = Fully
automatic page segmentation, but no OSD"));
+ PAGE_SEGMENTATION_MODES.add(new AllowableValue("4","4 = Assume a
single column of text of variable sizes"));
+ PAGE_SEGMENTATION_MODES.add(new AllowableValue("5","5 = Assume a
single uniform block of vertically aligned text"));
+ PAGE_SEGMENTATION_MODES.add(new AllowableValue("6","6 = Assume a
single uniform block of text"));
+ PAGE_SEGMENTATION_MODES.add(new AllowableValue("7","7 = Treat the
image as a single text line"));
+ PAGE_SEGMENTATION_MODES.add(new AllowableValue("8","8 = Treat the
image as a single word"));
+ PAGE_SEGMENTATION_MODES.add(new AllowableValue("9","9 = Treat the
image as a single word in a circle"));
+ PAGE_SEGMENTATION_MODES.add(new AllowableValue("10","10 = Treat
the image as a single character"));
+ }
+
+ public static final PropertyDescriptor TESS_DATA_PATH = new
PropertyDescriptor
+ .Builder().name("Tessdata Directory")
+ .description("Directory on the local NiFi instance where the
Tesseract languages and configurations are installed.")
+ .required(true)
+ .expressionLanguageSupported(true)
+
.defaultValue("/usr/local/Cellar/tesseract/3.04.00/share/tessdata")
+
.addValidator(StandardValidators.createDirectoryExistsValidator(true, false))
+ .addValidator(new TessdataDirectoryValidator())
+ .build();
+
+ /**
+ * Validates the TessData directory by ensuring that the specified
directory exists and also that at least
+ * once language is present. A language file ends with
TESS_LANG_EXTENSION
+ */
+ public static class TessdataDirectoryValidator implements Validator {
+
+ @Override
+ public ValidationResult validate(final String subject, final
String value, final ValidationContext context) {
+ if (context.isExpressionLanguageSupported(subject) &&
context.isExpressionLanguagePresent(value)) {
+ return new ValidationResult.Builder()
+
.subject(subject).input(value).explanation("Expression Language
Present").valid(true).build();
+ }
+
+ String reason = null;
+ try {
+ //There must be lanauges present to ensure the Tessdata
directory is valid.
+ File[] languages = getTesseractLanguages(value);
+ if (languages == null || languages.length == 0) {
+ reason = "No valid languages found in directory.
Languages end with '" + TESS_LANG_EXTENSION + "'";
+ }
+ } catch (final Exception e) {
+ reason = "Value is not a valid directory name";
+ }
+
+ return new
ValidationResult.Builder().subject(subject).input(value).explanation(reason).valid(reason
== null).build();
+ }
+ }
+
+ public static final PropertyDescriptor TESSERACT_LANGUAGE = new
PropertyDescriptor
+ .Builder().name("Tesseract Language")
+ .description("Language that Tesseract will use to perform OCR
on image coming in the incoming FlowFile's content")
+ .required(true)
+ .defaultValue(SUPPORTED_LANGUAGES.iterator().next())
+ .allowableValues(SUPPORTED_LANGUAGES)
+ .addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+ .build();
+
+ public static final PropertyDescriptor TESSERACT_PAGE_SEG_MODE = new
PropertyDescriptor
+ .Builder().name("Tesseract Page Segmentation Mode")
+ .description("Set Tesseract to only run a subset of layout
analysis and assume a certain form of image.")
+ .required(true)
+ .defaultValue(PAGE_SEGMENTATION_MODES.get(3).getValue())
+ .allowableValues(PAGE_SEGMENTATION_MODES)
+ .addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+ .build();
+
+ public static final PropertyDescriptor TESSERACT_CONFIGS = new
PropertyDescriptor
+ .Builder().name("Tesseract configuration values")
+ .description("Comma separated list of key=value pairs that
will be used to configure the Tesseract instance." +
+ " If a Tesseract configuration file is specified that
will take precedence over these configurations. Values" +
+ " placed into this property will not be validated so
take care to pass only valid Tesseract configuration values." +
+ " EX:
textord_min_linesize=3.25,tessedit_write_images=true")
+ .addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+ .build();
+
+ public static final Relationship REL_SUCCESS = new
Relationship.Builder()
+ .name("success")
+ .description("successfully completed OCR on image")
+ .build();
+
+ public static final Relationship REL_UNSUPPORTED_IMAGE_FORMAT = new
Relationship.Builder()
+ .name("unsupported image format")
+ .description("The image format in the FlowFile content is not
supported by Tesseract")
+ .build();
+
+ public static final Relationship REL_ORIGINAL = new
Relationship.Builder()
+ .name("original")
+ .description("The original image that OCR was performed on")
+ .build();
+
+ public static final Relationship REL_FAILURE = new
Relationship.Builder()
+ .name("failure")
+ .description("Failed to attempt OCR on input image")
+ .build();
+
+
+ @Override
+ protected void init(final ProcessorInitializationContext context) {
+ final List<PropertyDescriptor> descriptors = new ArrayList<>();
+ descriptors.add(TESS_DATA_PATH);
+ descriptors.add(TESSERACT_LANGUAGE);
+ descriptors.add(TESSERACT_PAGE_SEG_MODE);
+ descriptors.add(TESSERACT_CONFIGS);
+ this.descriptors = Collections.unmodifiableList(descriptors);
+
+ final Set<Relationship> relationships = new HashSet<>();
+ relationships.add(REL_SUCCESS);
+ relationships.add(REL_FAILURE);
+ relationships.add(REL_UNSUPPORTED_IMAGE_FORMAT);
+ relationships.add(REL_ORIGINAL);
+ this.relationships = Collections.unmodifiableSet(relationships);
+
+ }
+
+ @Override
+ public Set<Relationship> getRelationships() {
+ return this.relationships;
+ }
+
+ @Override
+ public final List<PropertyDescriptor>
getSupportedPropertyDescriptors() {
+
+ List<PropertyDescriptor> descriptorsNew = new ArrayList<>();
+
+ descriptorsNew.add(TESS_DATA_PATH);
+ descriptorsNew.add(new PropertyDescriptor.Builder()
+ .fromPropertyDescriptor(TESSERACT_LANGUAGE)
+ .allowableValues(SUPPORTED_LANGUAGES)
+ .build());
+ descriptorsNew.add(TESSERACT_PAGE_SEG_MODE);
+ descriptorsNew.add(TESSERACT_CONFIGS);
+
+ return descriptorsNew;
+ }
--- End diff --
@olegz the intention of this was to dynamically load the languages that
were installed in the Tesseract data directory when the "configure" is
displayed on the NiFi UI. Since it is a list of allowable values the idea was
to load those from the Tesseract data dir each time "configure" was clicked to
ensure the user was seeing the latest languages that were installed on the OS.
Does that make sense?
> Tesseract OCR Processor
> -----------------------
>
> Key: NIFI-1815
> URL: https://issues.apache.org/jira/browse/NIFI-1815
> Project: Apache NiFi
> Issue Type: Improvement
> Reporter: Jeremy Dyer
> Assignee: Jeremy Dyer
> Attachments: 0006-changes-to-the-OCR-processor.patch,
> nifi_1815_1.x_patch.zip
>
>
> This ticket is a follow-up to NIFI-1718 minus the use of the Tika library
> Expose OCR capabilities through a new processor which uses the Tesseract
> library. Use of this processor would require that Tesseract be installed on
> the NiFi host. Since the processor will have a system dependency care must be
> taken to ensure that the overall NiFi cluster continues to function properly
> in the absence of the Tesseract system dependency even though the OCR
> processor itself will be unable to perform its duties. In the event that the
> system dependencies are not detected the processor should display a
> validation warning rather than failing or preventing the NiFi instance from
> booting properly.
> Properties expose to configure Tesseract
> tesseractPath - Path to tesseract installation folder, if not on system path.
> language - Language ID (e.g. "eng"); language dictionary to be used.
> pageSegMode - Tesseract page segmentation mode, defaults to 1.
> minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
> maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to
> Integer.MAX_VALUE.
> timeout - Maximum time (in seconds) to wait for the OCR process termination;
> defaults to 120.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)