+1 to Bryan/AndyL recommendation. "Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates differently from ListSFTP — it does not maintain state, and performs a one-time tabulation/chronicling of the state of that directory at the given point in time. "
On Tue, Mar 27, 2018 at 5:05 PM, Andy LoPresto <[email protected]> wrote: > I think Bryan’s point is a good one and when I first saw this question (and > thought of the previous times it’s been asked), my initial response is to > propose a second processor. > > Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates > differently from ListSFTP — it does not maintain state, and performs a > one-time tabulation/chronicling of the state of that directory at the given > point in time. > > The responsibility to maintain and compare state across time is no longer a > requirement. There could even be a setting in the processor to allow for > “individual flowfile output” (i.e. act the same as ListSFTP and output one > flowfile per item listed) or “summary flowfile output” where a single > flowfile is generated containing the directory listing information for all > the items there. (Another option is to output both on two different > relationships). > > I think this would enable the types of workflows that users have asked about > in the past without compromising the mechanism by which List* processors > work and adding undue complexity to those processors. > > Absolutely crystal clear documentation (and a standard verb for the new > processor family) would be necessary (not only because these processor solve > different problems, but to avoid a million variants of “I used ScanSFTP > processor and it’s not tracking state”/“How do I provide a directory in an > attribute to ListSFTP” mailing list questions). > > > Andy LoPresto > [email protected] > [email protected] > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69 > > On Mar 27, 2018, at 8:33 AM, Andrew Grande <[email protected]> wrote: > > The key here is that ListXXX processor maintains state. A directory is part > of such state. Allowing arbitrary directories via an expression would > create never ending stream of new entries in the state storage, effectively > engineering a distributed DoS attack on the NiFi node or shared ZK quorum > (for when state is stored in there). > > Maybe if we focus on thinking about assumptions and restrictions the > processor should make to contain that risk... > > Andrew > > On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <[email protected]> wrote: > > I'm not sure that would solve the problem because you'd still be > limited to one directory. What most people are asking for is the > ability to use a dynamic directory from an incoming flow file. > > I think we might be trying to fit two different use-cases into one > processor which might not make sense. > > Scenario #1... There is a directory that is constantly receiving new > data and has a significant amount of files, and I want to periodically > find new files. This is what the current processors are optimized for. > > Scenario #2... There is a directory that is mostly static with a > moderate/small number of files, and at points in my flow I want to > dynamically perform a listing of this directory and retrieve the > files. This is more geared towards the mentality of running a > job/workflow. > > > > > On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler <[email protected]> > wrote: > > What if the changes where ‘on top of’ some base set of properties, like > directory? > Like a filter, where if present from the incoming file will have the > > LIST* > > list only things > that match a name or attribute? > > > > On March 27, 2018 at 00:08:41, Joe Witt ([email protected]) wrote: > > Scott > > This idea has come up a couple of times and there is definitely > something intriguing to it. Where I think this idea stalls out though > is in implementation. > > While I agree that the other List* processors might similarly benefit > lets focus on ListFile. Today you tell ListFile what directory to > start looking for files in. It goes off scanning that directory for > hits and stores state about what it has already searched/seen. And it > is important to keep track of how much it has already scanned because > at times the search directory can be massive (100,000s of thousands or > more files and directories to scan for example). > > In the proposed model the directory to be scanned could be provided > dynamically by looking at an attribute of an incoming flowfile (or > other criteria can be provided - not just the directory to scan). In > this case the ListFile processor goes on scanning against that now. > What about the previous directory (or directories) it was told to > scan? Does it still track those too? What if it starts scanning the > newly provided directory, hasn't finished pulling all the data or new > data is continually arriving, and it is told to switch to another > directory. > > I think if those questions can get solid answers and someone invests > time in creating a PR then this could be pretty powerful. Would be > good to see a written description of the use case(s) for this too. > > Thanks > Joe > > On Mon, Mar 26, 2018 at 11:58 PM, scott <[email protected]> wrote: > > Hello Devs, > > I would like to request a feature to a major processor, ListSFTP. But > > before > > I do down the official road, I wanted to ask if anyone thought it was a > terrible idea or impossible, etc. The request is to add support for an > incoming relationship to the ListSFTP processor specifically, but I > > could > > see it added to many of the commonly used head processes, such as > > ListFile. > > I would envision functionality more like InvokeHTTP or ExecuteSQL, where > > an > > incoming flow file could initiate the action, and the attributes in the > incoming flow file could be used to configure the processor actions. > > It's > > the configuration aspect that most appeals to me, because it opens it up > > to > > being centrally or dynamically configured. > > Thanks, > > Scott > > >
