[ 
https://issues.apache.org/jira/browse/NIFI-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546070#comment-16546070
 ] 

ASF GitHub Bot commented on NIFI-5406:
--------------------------------------

Github user ijokarumawak commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2876#discussion_r202902116
  
    --- Diff: 
nifi-nar-bundles/nifi-extension-utils/nifi-processor-utils/src/main/java/org/apache/nifi/processor/util/list/ListedEntity.java
 ---
    @@ -0,0 +1,44 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.nifi.processor.util.list;
    +
    +public class ListedEntity {
    +    /**
    +     * Milliseconds.
    +     */
    +    private long timestamp;
    +    /**
    +     * Bytes.
    +     */
    +    private long size;
    +
    +    public void setTimestamp(long timestamp) {
    --- End diff --
    
    You are right. I was using final fields and constructor, but when it's 
deserialized, I got an Exception around Jackson, so I added setters. I've added 
Jackson annotations to use custom constructor for deserialization. Now it's 
clear those are immutable.


> Add new listing strategy by tracking listed entities to ListXXXX processors
> ---------------------------------------------------------------------------
>
>                 Key: NIFI-5406
>                 URL: https://issues.apache.org/jira/browse/NIFI-5406
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Koji Kawamura
>            Assignee: Koji Kawamura
>            Priority: Major
>
> Current List processors (ListFile, ListFTP, ListSFTP ... etc) implementation 
> relies on file last modified timestamp to pick new or updated files. This 
> approach is efficient and lightweight in terms of state management, because 
> it only tracks latest modified timestamp and last executed timestamp. 
> However, timestamps do not work as expected in some file systems, causing 
> List processors missing files periodically. See NIFI-3332 comments for 
> details.
> In order to pick every entity that has not seen before or has been updated 
> since it had seen last time, we need another set of processors using 
> different approach, that is by tracking listed entities:
>  * Add new abstract processor AbstractWatchEntries similar to 
> AbstractListProcessor but uses different approach
>  * Target entities have: name (path), size and last-modified-timestamp
>  * Implementation Processors have following properties:
>  ** 'Watch Time Window' to limit the maximum time period to hold the already 
> listed entries. E.g. if set as '30min', the processor keeps entities listed 
> in the last 30 mins.
>  ** 'Minimum File Age' to defer listing entities potentially being written
>  * Any entity added but not listed ever having last-modified-timestamp older 
> than configured 'Watch Time Window' will not be listed. If user needs to pick 
> these items, they have to make 'Watch Time Window' longer. It also increases 
> the size of data the processor has to persist in the K/V store. Efficiency vs 
> reliability trade-off.
>  * The already-listed entities are persisted into one of supported K/V store 
> through DistributedMapCacheClient service. User can chose what KVS to use 
> from HBase, Redis, Couchbase and File (DistributedMapCacheServer with 
> persistence file).
>  * The reason to use KVS instead of ManagedState is, to avoid hammering 
> Zookeeper too much with frequently updating Zk node with large amount of 
> data. The number of already-listed entries can be huge depending on 
> use-cases. Also, we can compress entities with DistributedMapCacheClient as 
> it supports putting byte array, while ManagedState only supports Map<String, 
> String>.
>  * On each onTrigger:
>  ** Processor performs listing. Listed entries meeting any of the following 
> condition will be written to the 'success' output FlowFile:
>  *** Not exists in the already-listed entities
>  *** Having newer last-modified-timestamp
>  *** Having different size
>  ** Already listed entries those are old enough compared to 'Watch Time 
> Window' are discarded from the already-listed entries.
>  * Initial supporting target is Local file system, FTP and SFTP



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to