[
https://issues.apache.org/jira/browse/NIFI-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838039#comment-16838039
]
Jasper Knulst commented on NIFI-6286:
-------------------------------------
Since it is not the first time similar proposals have been done (
https://mail-archives.apache.org/mod_mbox/nifi-dev/201803.mbox/%[email protected]%3E
) I suggest to create a totally separate +ScanHDFS+ processor that does take
inbound flowfile to dynamically list HDFS folders, but does not keep state.
Keeping state could work fine, even with 10.000's of permutations of
directory|regex|Filter_type|recursive etc., but only if you have an HBase or
other Distributed Cache backing it. It seems however that since 1.8 the
community just moved away from supporting a DC for the ListXXX processors,
there must be good reasons for it, and I don't want to reverse all that.
Using Zookeeper as State Store has limitations, so the only sensible thing is
to drop the state. It would still be possible to let users do their own state
check in the flow after ScanHDFS.
> Make listHDFS work as INPUT_ALLOWED processor
> ---------------------------------------------
>
> Key: NIFI-6286
> URL: https://issues.apache.org/jira/browse/NIFI-6286
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Core Framework
> Affects Versions: 1.9.2
> Reporter: Jasper Knulst
> Priority: Major
> Labels: features
>
> Currently the listHDFS processor has a prop 'Directory' (to start the listing
> from, recursively or not) which only allows 1 static value.
> There are many use cases where you would want to crawl many roots in
> sequence. There are 2 ways to do it.
> # Allow the 'Directory' prop to have multiple comma separated values
> # Refactor listHDFS as an INPUT_ALLOWED processor and make the 'Directory'
> prop take EL to get directory roots from upstream
> Option 1. has serious restrictions since it dictates that other config (like
> recursive, filter type and regex) would still be static and may get very
> complex, non-intuitive and require frequent re-configuration.
> Option 2. is the way to go.
> Some things to consider:
> -The way listHDFS behaves now should be preserved
> -It makes sense to dynamically set 'Directory', 'Recursiveness', 'Regex' and
> 'Filter type' in tandem to be able to detail the way each root directory
> is crawled
> -Switching 'Directory' also requires that not just 1 state is stored but
> states for each directory that ever passed
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)