[
https://issues.apache.org/jira/browse/NIFI-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539707#comment-16539707
]
ASF GitHub Bot commented on NIFI-5406:
--------------------------------------
GitHub user ijokarumawak opened a pull request:
https://github.com/apache/nifi/pull/2876
NIFI-5406: [WIP] Adding WatchEntities processors.
Although it has many TODO items, I thought it's a good idea to share the
new listing model early so that everyone can comment on it. Thanks.
----
Thank you for submitting a contribution to Apache NiFi.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
### For all changes:
- [ ] Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
- [ ] Does your PR title start with NIFI-XXXX where XXXX is the JIRA number
you are trying to resolve? Pay particular attention to the hyphen "-" character.
- [ ] Has your PR been rebased against the latest commit within the target
branch (typically master)?
- [ ] Is your initial contribution a single, squashed commit?
### For code changes:
- [ ] Have you ensured that the full suite of tests is executed via mvn
-Pcontrib-check clean install at the root nifi folder?
- [ ] Have you written or updated unit tests to verify your changes?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] If applicable, have you updated the LICENSE file, including the main
LICENSE file under nifi-assembly?
- [ ] If applicable, have you updated the NOTICE file, including the main
NOTICE file found under nifi-assembly?
- [ ] If adding new Properties, have you added .displayName in addition to
.name (programmatic access) for each of the new properties?
### For documentation related changes:
- [ ] Have you ensured that format looks appropriate for the output in
which it is rendered?
### Note:
Please ensure that once the PR is submitted, you check travis-ci for build
issues and submit an update to your PR as soon as possible.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ijokarumawak/nifi nifi-5406
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/nifi/pull/2876.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2876
----
commit 9577b6985ab49ee5c0bbaa9dc42df1b3bc6987a3
Author: Koji Kawamura <ijokarumawak@...>
Date: 2018-07-10T08:00:38Z
WIP: Adding WatchEntities processors.
----
> Add processors to list new or updated files by tracking listed entities
> -----------------------------------------------------------------------
>
> Key: NIFI-5406
> URL: https://issues.apache.org/jira/browse/NIFI-5406
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Extensions
> Reporter: Koji Kawamura
> Assignee: Koji Kawamura
> Priority: Major
>
> Current List processors (ListFile, ListFTP, ListSFTP ... etc) implementation
> relies on file last modified timestamp to pick new or updated files. This
> approach is efficient and lightweight in terms of state management, because
> it only tracks latest modified timestamp and last executed timestamp.
> However, timestamps do not work as expected in some file systems, causing
> List processors missing files periodically. See NIFI-3332 comments for
> details.
> In order to pick every entity that has not seen before or has been updated
> since it had seen last time, we need another set of processors using
> different approach, that is by tracking listed entities:
> * Add new abstract processor AbstractWatchEntries similar to
> AbstractListProcessor but uses different approach
> * Target entities have: name (path), size and last-modified-timestamp
> * Implementation Processors have following properties:
> ** 'Watch Time Window' to limit the maximum time period to hold the already
> listed entries. E.g. if set as '30min', the processor keeps entities listed
> in the last 30 mins.
> ** 'Minimum File Age' to defer listing entities potentially being written
> * Any entity added but not listed ever having last-modified-timestamp older
> than configured 'Watch Time Window' will not be listed. If user needs to pick
> these items, they have to make 'Watch Time Window' longer. It also increases
> the size of data the processor has to persist in the K/V store. Efficiency vs
> reliability trade-off.
> * The already-listed entities are persisted into one of supported K/V store
> through DistributedMapCacheClient service. User can chose what KVS to use
> from HBase, Redis, Couchbase and File (DistributedMapCacheServer with
> persistence file).
> * The reason to use KVS instead of ManagedState is, to avoid hammering
> Zookeeper too much with frequently updating Zk node with large amount of
> data. The number of already-listed entries can be huge depending on
> use-cases. Also, we can compress entities with DistributedMapCacheClient as
> it supports putting byte array, while ManagedState only supports Map<String,
> String>.
> * On each onTrigger:
> ** Processor performs listing. Listed entries meeting any of the following
> condition will be written to the 'success' output FlowFile:
> *** Not exists in the already-listed entities
> *** Having newer last-modified-timestamp
> *** Having different size
> ** Already listed entries those are old enough compared to 'Watch Time
> Window' are discarded from the already-listed entries.
> * Initial supporting target is Local file system, FTP and SFTP
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)