[ https://issues.apache.org/jira/browse/NIFI-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539707#comment-16539707 ]
ASF GitHub Bot commented on NIFI-5406: -------------------------------------- GitHub user ijokarumawak opened a pull request: https://github.com/apache/nifi/pull/2876 NIFI-5406: [WIP] Adding WatchEntities processors. Although it has many TODO items, I thought it's a good idea to share the new listing model early so that everyone can comment on it. Thanks. ---- Thank you for submitting a contribution to Apache NiFi. In order to streamline the review of the contribution we ask you to ensure the following steps have been taken: ### For all changes: - [ ] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message? - [ ] Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character. - [ ] Has your PR been rebased against the latest commit within the target branch (typically master)? - [ ] Is your initial contribution a single, squashed commit? ### For code changes: - [ ] Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder? - [ ] Have you written or updated unit tests to verify your changes? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly? - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly? - [ ] If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties? ### For documentation related changes: - [ ] Have you ensured that format looks appropriate for the output in which it is rendered? ### Note: Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ijokarumawak/nifi nifi-5406 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nifi/pull/2876.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2876 ---- commit 9577b6985ab49ee5c0bbaa9dc42df1b3bc6987a3 Author: Koji Kawamura <ijokarumawak@...> Date: 2018-07-10T08:00:38Z WIP: Adding WatchEntities processors. ---- > Add processors to list new or updated files by tracking listed entities > ----------------------------------------------------------------------- > > Key: NIFI-5406 > URL: https://issues.apache.org/jira/browse/NIFI-5406 > Project: Apache NiFi > Issue Type: Improvement > Components: Extensions > Reporter: Koji Kawamura > Assignee: Koji Kawamura > Priority: Major > > Current List processors (ListFile, ListFTP, ListSFTP ... etc) implementation > relies on file last modified timestamp to pick new or updated files. This > approach is efficient and lightweight in terms of state management, because > it only tracks latest modified timestamp and last executed timestamp. > However, timestamps do not work as expected in some file systems, causing > List processors missing files periodically. See NIFI-3332 comments for > details. > In order to pick every entity that has not seen before or has been updated > since it had seen last time, we need another set of processors using > different approach, that is by tracking listed entities: > * Add new abstract processor AbstractWatchEntries similar to > AbstractListProcessor but uses different approach > * Target entities have: name (path), size and last-modified-timestamp > * Implementation Processors have following properties: > ** 'Watch Time Window' to limit the maximum time period to hold the already > listed entries. E.g. if set as '30min', the processor keeps entities listed > in the last 30 mins. > ** 'Minimum File Age' to defer listing entities potentially being written > * Any entity added but not listed ever having last-modified-timestamp older > than configured 'Watch Time Window' will not be listed. If user needs to pick > these items, they have to make 'Watch Time Window' longer. It also increases > the size of data the processor has to persist in the K/V store. Efficiency vs > reliability trade-off. > * The already-listed entities are persisted into one of supported K/V store > through DistributedMapCacheClient service. User can chose what KVS to use > from HBase, Redis, Couchbase and File (DistributedMapCacheServer with > persistence file). > * The reason to use KVS instead of ManagedState is, to avoid hammering > Zookeeper too much with frequently updating Zk node with large amount of > data. The number of already-listed entries can be huge depending on > use-cases. Also, we can compress entities with DistributedMapCacheClient as > it supports putting byte array, while ManagedState only supports Map<String, > String>. > * On each onTrigger: > ** Processor performs listing. Listed entries meeting any of the following > condition will be written to the 'success' output FlowFile: > *** Not exists in the already-listed entities > *** Having newer last-modified-timestamp > *** Having different size > ** Already listed entries those are old enough compared to 'Watch Time > Window' are discarded from the already-listed entries. > * Initial supporting target is Local file system, FTP and SFTP -- This message was sent by Atlassian JIRA (v7.6.3#76005)