+1 to Bryan/AndyL recommendation.

"Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates
differently from ListSFTP — it does not maintain state, and performs a
one-time tabulation/chronicling of the state of that directory at the
given point in time. "

On Tue, Mar 27, 2018 at 5:05 PM, Andy LoPresto <[email protected]> wrote:
> I think Bryan’s point is a good one and when I first saw this question (and
> thought of the previous times it’s been asked), my initial response is to
> propose a second processor.
>
> Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates
> differently from ListSFTP — it does not maintain state, and performs a
> one-time tabulation/chronicling of the state of that directory at the given
> point in time.
>
> The responsibility to maintain and compare state across time is no longer a
> requirement. There could even be a setting in the processor to allow for
> “individual flowfile output” (i.e. act the same as ListSFTP and output one
> flowfile per item listed) or “summary flowfile output” where a single
> flowfile is generated containing the directory listing information for all
> the items there. (Another option is to output both on two different
> relationships).
>
> I think this would enable the types of workflows that users have asked about
> in the past without compromising the mechanism by which List* processors
> work and adding undue complexity to those processors.
>
> Absolutely crystal clear documentation (and a standard verb for the new
> processor family) would be necessary (not only because these processor solve
> different problems, but to avoid a million variants of “I used ScanSFTP
> processor and it’s not tracking state”/“How do I provide a directory in an
> attribute to ListSFTP” mailing list questions).
>
>
> Andy LoPresto
> [email protected]
> [email protected]
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Mar 27, 2018, at 8:33 AM, Andrew Grande <[email protected]> wrote:
>
> The key here is that ListXXX processor maintains state. A directory is part
> of such state. Allowing arbitrary directories via an expression would
> create never ending stream of new entries in the state storage, effectively
> engineering a distributed DoS attack on the NiFi node or shared ZK quorum
> (for when state is stored in there).
>
> Maybe if we focus on thinking about assumptions and restrictions the
> processor should make to contain that risk...
>
> Andrew
>
> On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <[email protected]> wrote:
>
> I'm not sure that would solve the problem because you'd still be
> limited to one directory. What most people are asking for is the
> ability to use a dynamic directory from an incoming flow file.
>
> I think we might be trying to fit two different use-cases into one
> processor which might not make sense.
>
> Scenario #1... There is a directory that is constantly receiving new
> data and has a significant amount of files, and I want to periodically
> find new files. This is what the current processors are optimized for.
>
> Scenario #2... There is a directory that is mostly static with a
> moderate/small number of files, and at points in my flow I want to
> dynamically perform a listing of this directory and retrieve the
> files. This is more geared towards the mentality of running a
> job/workflow.
>
>
>
>
> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler <[email protected]>
> wrote:
>
> What if the changes where ‘on top of’ some base set of properties, like
> directory?
> Like a filter, where if present from the incoming file will have the
>
> LIST*
>
> list only things
> that match a name or attribute?
>
>
>
> On March 27, 2018 at 00:08:41, Joe Witt ([email protected]) wrote:
>
> Scott
>
> This idea has come up a couple of times and there is definitely
> something intriguing to it. Where I think this idea stalls out though
> is in implementation.
>
> While I agree that the other List* processors might similarly benefit
> lets focus on ListFile. Today you tell ListFile what directory to
> start looking for files in. It goes off scanning that directory for
> hits and stores state about what it has already searched/seen. And it
> is important to keep track of how much it has already scanned because
> at times the search directory can be massive (100,000s of thousands or
> more files and directories to scan for example).
>
> In the proposed model the directory to be scanned could be provided
> dynamically by looking at an attribute of an incoming flowfile (or
> other criteria can be provided - not just the directory to scan). In
> this case the ListFile processor goes on scanning against that now.
> What about the previous directory (or directories) it was told to
> scan? Does it still track those too? What if it starts scanning the
> newly provided directory, hasn't finished pulling all the data or new
> data is continually arriving, and it is told to switch to another
> directory.
>
> I think if those questions can get solid answers and someone invests
> time in creating a PR then this could be pretty powerful. Would be
> good to see a written description of the use case(s) for this too.
>
> Thanks
> Joe
>
> On Mon, Mar 26, 2018 at 11:58 PM, scott <[email protected]> wrote:
>
> Hello Devs,
>
> I would like to request a feature to a major processor, ListSFTP. But
>
> before
>
> I do down the official road, I wanted to ask if anyone thought it was a
> terrible idea or impossible, etc. The request is to add support for an
> incoming relationship to the ListSFTP processor specifically, but I
>
> could
>
> see it added to many of the commonly used head processes, such as
>
> ListFile.
>
> I would envision functionality more like InvokeHTTP or ExecuteSQL, where
>
> an
>
> incoming flow file could initiate the action, and the attributes in the
> incoming flow file could be used to configure the processor actions.
>
> It's
>
> the configuration aspect that most appeals to me, because it opens it up
>
> to
>
> being centrally or dynamically configured.
>
> Thanks,
>
> Scott
>
>
>

Reply via email to