[
https://issues.apache.org/jira/browse/NIFI-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545378#comment-16545378
]
ASF GitHub Bot commented on NIFI-5406:
--------------------------------------
Github user markap14 commented on a diff in the pull request:
https://github.com/apache/nifi/pull/2876#discussion_r202720481
--- Diff:
nifi-nar-bundles/nifi-extension-utils/nifi-processor-utils/src/main/java/org/apache/nifi/processor/util/list/ListedEntityTracker.java
---
@@ -0,0 +1,322 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.processor.util.list;
+
+import com.fasterxml.jackson.core.type.TypeReference;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import org.apache.nifi.components.AllowableValue;
+import org.apache.nifi.components.PropertyDescriptor;
+import org.apache.nifi.components.ValidationContext;
+import org.apache.nifi.components.ValidationResult;
+import org.apache.nifi.components.state.Scope;
+import org.apache.nifi.distributed.cache.client.Deserializer;
+import org.apache.nifi.distributed.cache.client.DistributedMapCacheClient;
+import org.apache.nifi.distributed.cache.client.Serializer;
+import org.apache.nifi.expression.ExpressionLanguageScope;
+import org.apache.nifi.flowfile.FlowFile;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.nifi.processor.ProcessContext;
+import org.apache.nifi.processor.ProcessSession;
+import org.apache.nifi.processor.exception.ProcessException;
+import org.apache.nifi.processor.util.StandardValidators;
+import org.apache.nifi.util.StringUtils;
+
+import java.io.ByteArrayInputStream;
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+import java.util.function.Consumer;
+import java.util.function.Function;
+import java.util.function.Supplier;
+import java.util.stream.Collectors;
+import java.util.zip.GZIPInputStream;
+import java.util.zip.GZIPOutputStream;
+
+import static java.lang.String.format;
+import static
org.apache.nifi.processor.util.list.AbstractListProcessor.REL_SUCCESS;
+
+public class ListedEntityTracker<T extends ListableEntity> {
+
+ private ObjectMapper objectMapper = new ObjectMapper();
--- End diff --
Should probably mark member variable as `final` if not changing.
> Add new listing strategy by tracking listed entities to ListXXXX processors
> ---------------------------------------------------------------------------
>
> Key: NIFI-5406
> URL: https://issues.apache.org/jira/browse/NIFI-5406
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Extensions
> Reporter: Koji Kawamura
> Assignee: Koji Kawamura
> Priority: Major
>
> Current List processors (ListFile, ListFTP, ListSFTP ... etc) implementation
> relies on file last modified timestamp to pick new or updated files. This
> approach is efficient and lightweight in terms of state management, because
> it only tracks latest modified timestamp and last executed timestamp.
> However, timestamps do not work as expected in some file systems, causing
> List processors missing files periodically. See NIFI-3332 comments for
> details.
> In order to pick every entity that has not seen before or has been updated
> since it had seen last time, we need another set of processors using
> different approach, that is by tracking listed entities:
> * Add new abstract processor AbstractWatchEntries similar to
> AbstractListProcessor but uses different approach
> * Target entities have: name (path), size and last-modified-timestamp
> * Implementation Processors have following properties:
> ** 'Watch Time Window' to limit the maximum time period to hold the already
> listed entries. E.g. if set as '30min', the processor keeps entities listed
> in the last 30 mins.
> ** 'Minimum File Age' to defer listing entities potentially being written
> * Any entity added but not listed ever having last-modified-timestamp older
> than configured 'Watch Time Window' will not be listed. If user needs to pick
> these items, they have to make 'Watch Time Window' longer. It also increases
> the size of data the processor has to persist in the K/V store. Efficiency vs
> reliability trade-off.
> * The already-listed entities are persisted into one of supported K/V store
> through DistributedMapCacheClient service. User can chose what KVS to use
> from HBase, Redis, Couchbase and File (DistributedMapCacheServer with
> persistence file).
> * The reason to use KVS instead of ManagedState is, to avoid hammering
> Zookeeper too much with frequently updating Zk node with large amount of
> data. The number of already-listed entries can be huge depending on
> use-cases. Also, we can compress entities with DistributedMapCacheClient as
> it supports putting byte array, while ManagedState only supports Map<String,
> String>.
> * On each onTrigger:
> ** Processor performs listing. Listed entries meeting any of the following
> condition will be written to the 'success' output FlowFile:
> *** Not exists in the already-listed entities
> *** Having newer last-modified-timestamp
> *** Having different size
> ** Already listed entries those are old enough compared to 'Watch Time
> Window' are discarded from the already-listed entries.
> * Initial supporting target is Local file system, FTP and SFTP
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)