I think Bryan’s point is a good one and when I first saw this question (and 
thought of the previous times it’s been asked), my initial response is to 
propose a second processor.

Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates differently 
from ListSFTP — it does not maintain state, and performs a one-time 
tabulation/chronicling of the state of that directory at the given point in 
time.

The responsibility to maintain and compare state across time is no longer a 
requirement. There could even be a setting in the processor to allow for 
“individual flowfile output” (i.e. act the same as ListSFTP and output one 
flowfile per item listed) or “summary flowfile output” where a single flowfile 
is generated containing the directory listing information for all the items 
there. (Another option is to output both on two different relationships).

I think this would enable the types of workflows that users have asked about in 
the past without compromising the mechanism by which List* processors work and 
adding undue complexity to those processors.

Absolutely crystal clear documentation (and a standard verb for the new 
processor family) would be necessary (not only because these processor solve 
different problems, but to avoid a million variants of “I used ScanSFTP 
processor and it’s not tracking state”/“How do I provide a directory in an 
attribute to ListSFTP” mailing list questions).


Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Mar 27, 2018, at 8:33 AM, Andrew Grande <apere...@gmail.com> wrote:
> 
> The key here is that ListXXX processor maintains state. A directory is part
> of such state. Allowing arbitrary directories via an expression would
> create never ending stream of new entries in the state storage, effectively
> engineering a distributed DoS attack on the NiFi node or shared ZK quorum
> (for when state is stored in there).
> 
> Maybe if we focus on thinking about assumptions and restrictions the
> processor should make to contain that risk...
> 
> Andrew
> 
> On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bbe...@gmail.com> wrote:
> 
>> I'm not sure that would solve the problem because you'd still be
>> limited to one directory. What most people are asking for is the
>> ability to use a dynamic directory from an incoming flow file.
>> 
>> I think we might be trying to fit two different use-cases into one
>> processor which might not make sense.
>> 
>> Scenario #1... There is a directory that is constantly receiving new
>> data and has a significant amount of files, and I want to periodically
>> find new files. This is what the current processors are optimized for.
>> 
>> Scenario #2... There is a directory that is mostly static with a
>> moderate/small number of files, and at points in my flow I want to
>> dynamically perform a listing of this directory and retrieve the
>> files. This is more geared towards the mentality of running a
>> job/workflow.
>> 
>> 
>> 
>> 
>> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler <ottobackwa...@gmail.com>
>> wrote:
>>> What if the changes where ‘on top of’ some base set of properties, like
>>> directory?
>>> Like a filter, where if present from the incoming file will have the
>> LIST*
>>> list only things
>>> that match a name or attribute?
>>> 
>>> 
>>> 
>>> On March 27, 2018 at 00:08:41, Joe Witt (joe.w...@gmail.com) wrote:
>>> 
>>> Scott
>>> 
>>> This idea has come up a couple of times and there is definitely
>>> something intriguing to it. Where I think this idea stalls out though
>>> is in implementation.
>>> 
>>> While I agree that the other List* processors might similarly benefit
>>> lets focus on ListFile. Today you tell ListFile what directory to
>>> start looking for files in. It goes off scanning that directory for
>>> hits and stores state about what it has already searched/seen. And it
>>> is important to keep track of how much it has already scanned because
>>> at times the search directory can be massive (100,000s of thousands or
>>> more files and directories to scan for example).
>>> 
>>> In the proposed model the directory to be scanned could be provided
>>> dynamically by looking at an attribute of an incoming flowfile (or
>>> other criteria can be provided - not just the directory to scan). In
>>> this case the ListFile processor goes on scanning against that now.
>>> What about the previous directory (or directories) it was told to
>>> scan? Does it still track those too? What if it starts scanning the
>>> newly provided directory, hasn't finished pulling all the data or new
>>> data is continually arriving, and it is told to switch to another
>>> directory.
>>> 
>>> I think if those questions can get solid answers and someone invests
>>> time in creating a PR then this could be pretty powerful. Would be
>>> good to see a written description of the use case(s) for this too.
>>> 
>>> Thanks
>>> Joe
>>> 
>>> On Mon, Mar 26, 2018 at 11:58 PM, scott <tcots8...@gmail.com> wrote:
>>>> Hello Devs,
>>>> 
>>>> I would like to request a feature to a major processor, ListSFTP. But
>>> before
>>>> I do down the official road, I wanted to ask if anyone thought it was a
>>>> terrible idea or impossible, etc. The request is to add support for an
>>>> incoming relationship to the ListSFTP processor specifically, but I
>> could
>>>> see it added to many of the commonly used head processes, such as
>>> ListFile.
>>>> I would envision functionality more like InvokeHTTP or ExecuteSQL, where
>>> an
>>>> incoming flow file could initiate the action, and the attributes in the
>>>> incoming flow file could be used to configure the processor actions.
>> It's
>>>> the configuration aspect that most appeals to me, because it opens it up
>>> to
>>>> being centrally or dynamically configured.
>>>> 
>>>> Thanks,
>>>> 
>>>> Scott
>>>> 
>> 

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to