[ https://issues.apache.org/jira/browse/NIFI-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546068#comment-16546068 ]
ASF GitHub Bot commented on NIFI-5406: -------------------------------------- Github user ijokarumawak commented on a diff in the pull request: https://github.com/apache/nifi/pull/2876#discussion_r202901813 --- Diff: nifi-nar-bundles/nifi-extension-utils/nifi-processor-utils/src/main/java/org/apache/nifi/processor/util/list/ListedEntityTracker.java --- @@ -0,0 +1,322 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nifi.processor.util.list; + +import com.fasterxml.jackson.core.type.TypeReference; +import com.fasterxml.jackson.databind.ObjectMapper; +import org.apache.nifi.components.AllowableValue; +import org.apache.nifi.components.PropertyDescriptor; +import org.apache.nifi.components.ValidationContext; +import org.apache.nifi.components.ValidationResult; +import org.apache.nifi.components.state.Scope; +import org.apache.nifi.distributed.cache.client.Deserializer; +import org.apache.nifi.distributed.cache.client.DistributedMapCacheClient; +import org.apache.nifi.distributed.cache.client.Serializer; +import org.apache.nifi.expression.ExpressionLanguageScope; +import org.apache.nifi.flowfile.FlowFile; +import org.apache.nifi.logging.ComponentLog; +import org.apache.nifi.processor.ProcessContext; +import org.apache.nifi.processor.ProcessSession; +import org.apache.nifi.processor.exception.ProcessException; +import org.apache.nifi.processor.util.StandardValidators; +import org.apache.nifi.util.StringUtils; + +import java.io.ByteArrayInputStream; +import java.io.IOException; +import java.nio.charset.StandardCharsets; +import java.util.Collection; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.concurrent.TimeUnit; +import java.util.function.Consumer; +import java.util.function.Function; +import java.util.function.Supplier; +import java.util.stream.Collectors; +import java.util.zip.GZIPInputStream; +import java.util.zip.GZIPOutputStream; + +import static java.lang.String.format; +import static org.apache.nifi.processor.util.list.AbstractListProcessor.REL_SUCCESS; + +public class ListedEntityTracker<T extends ListableEntity> { + + private ObjectMapper objectMapper = new ObjectMapper(); + private volatile Map<String, ListedEntity> alreadyListedEntities; --- End diff -- Updated to use ConcurrentHashMap. Thanks. > Add new listing strategy by tracking listed entities to ListXXXX processors > --------------------------------------------------------------------------- > > Key: NIFI-5406 > URL: https://issues.apache.org/jira/browse/NIFI-5406 > Project: Apache NiFi > Issue Type: Improvement > Components: Extensions > Reporter: Koji Kawamura > Assignee: Koji Kawamura > Priority: Major > > Current List processors (ListFile, ListFTP, ListSFTP ... etc) implementation > relies on file last modified timestamp to pick new or updated files. This > approach is efficient and lightweight in terms of state management, because > it only tracks latest modified timestamp and last executed timestamp. > However, timestamps do not work as expected in some file systems, causing > List processors missing files periodically. See NIFI-3332 comments for > details. > In order to pick every entity that has not seen before or has been updated > since it had seen last time, we need another set of processors using > different approach, that is by tracking listed entities: > * Add new abstract processor AbstractWatchEntries similar to > AbstractListProcessor but uses different approach > * Target entities have: name (path), size and last-modified-timestamp > * Implementation Processors have following properties: > ** 'Watch Time Window' to limit the maximum time period to hold the already > listed entries. E.g. if set as '30min', the processor keeps entities listed > in the last 30 mins. > ** 'Minimum File Age' to defer listing entities potentially being written > * Any entity added but not listed ever having last-modified-timestamp older > than configured 'Watch Time Window' will not be listed. If user needs to pick > these items, they have to make 'Watch Time Window' longer. It also increases > the size of data the processor has to persist in the K/V store. Efficiency vs > reliability trade-off. > * The already-listed entities are persisted into one of supported K/V store > through DistributedMapCacheClient service. User can chose what KVS to use > from HBase, Redis, Couchbase and File (DistributedMapCacheServer with > persistence file). > * The reason to use KVS instead of ManagedState is, to avoid hammering > Zookeeper too much with frequently updating Zk node with large amount of > data. The number of already-listed entries can be huge depending on > use-cases. Also, we can compress entities with DistributedMapCacheClient as > it supports putting byte array, while ManagedState only supports Map<String, > String>. > * On each onTrigger: > ** Processor performs listing. Listed entries meeting any of the following > condition will be written to the 'success' output FlowFile: > *** Not exists in the already-listed entities > *** Having newer last-modified-timestamp > *** Having different size > ** Already listed entries those are old enough compared to 'Watch Time > Window' are discarded from the already-listed entries. > * Initial supporting target is Local file system, FTP and SFTP -- This message was sent by Atlassian JIRA (v7.6.3#76005)