I think Bryan’s point is a good one and when I first saw this question (and thought of the previous times it’s been asked), my initial response is to propose a second processor.
Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates differently from ListSFTP — it does not maintain state, and performs a one-time tabulation/chronicling of the state of that directory at the given point in time. The responsibility to maintain and compare state across time is no longer a requirement. There could even be a setting in the processor to allow for “individual flowfile output” (i.e. act the same as ListSFTP and output one flowfile per item listed) or “summary flowfile output” where a single flowfile is generated containing the directory listing information for all the items there. (Another option is to output both on two different relationships). I think this would enable the types of workflows that users have asked about in the past without compromising the mechanism by which List* processors work and adding undue complexity to those processors. Absolutely crystal clear documentation (and a standard verb for the new processor family) would be necessary (not only because these processor solve different problems, but to avoid a million variants of “I used ScanSFTP processor and it’s not tracking state”/“How do I provide a directory in an attribute to ListSFTP” mailing list questions). Andy LoPresto alopre...@apache.org alopresto.apa...@gmail.com PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69 > On Mar 27, 2018, at 8:33 AM, Andrew Grande <apere...@gmail.com> wrote: > > The key here is that ListXXX processor maintains state. A directory is part > of such state. Allowing arbitrary directories via an expression would > create never ending stream of new entries in the state storage, effectively > engineering a distributed DoS attack on the NiFi node or shared ZK quorum > (for when state is stored in there). > > Maybe if we focus on thinking about assumptions and restrictions the > processor should make to contain that risk... > > Andrew > > On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bbe...@gmail.com> wrote: > >> I'm not sure that would solve the problem because you'd still be >> limited to one directory. What most people are asking for is the >> ability to use a dynamic directory from an incoming flow file. >> >> I think we might be trying to fit two different use-cases into one >> processor which might not make sense. >> >> Scenario #1... There is a directory that is constantly receiving new >> data and has a significant amount of files, and I want to periodically >> find new files. This is what the current processors are optimized for. >> >> Scenario #2... There is a directory that is mostly static with a >> moderate/small number of files, and at points in my flow I want to >> dynamically perform a listing of this directory and retrieve the >> files. This is more geared towards the mentality of running a >> job/workflow. >> >> >> >> >> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler <ottobackwa...@gmail.com> >> wrote: >>> What if the changes where ‘on top of’ some base set of properties, like >>> directory? >>> Like a filter, where if present from the incoming file will have the >> LIST* >>> list only things >>> that match a name or attribute? >>> >>> >>> >>> On March 27, 2018 at 00:08:41, Joe Witt (joe.w...@gmail.com) wrote: >>> >>> Scott >>> >>> This idea has come up a couple of times and there is definitely >>> something intriguing to it. Where I think this idea stalls out though >>> is in implementation. >>> >>> While I agree that the other List* processors might similarly benefit >>> lets focus on ListFile. Today you tell ListFile what directory to >>> start looking for files in. It goes off scanning that directory for >>> hits and stores state about what it has already searched/seen. And it >>> is important to keep track of how much it has already scanned because >>> at times the search directory can be massive (100,000s of thousands or >>> more files and directories to scan for example). >>> >>> In the proposed model the directory to be scanned could be provided >>> dynamically by looking at an attribute of an incoming flowfile (or >>> other criteria can be provided - not just the directory to scan). In >>> this case the ListFile processor goes on scanning against that now. >>> What about the previous directory (or directories) it was told to >>> scan? Does it still track those too? What if it starts scanning the >>> newly provided directory, hasn't finished pulling all the data or new >>> data is continually arriving, and it is told to switch to another >>> directory. >>> >>> I think if those questions can get solid answers and someone invests >>> time in creating a PR then this could be pretty powerful. Would be >>> good to see a written description of the use case(s) for this too. >>> >>> Thanks >>> Joe >>> >>> On Mon, Mar 26, 2018 at 11:58 PM, scott <tcots8...@gmail.com> wrote: >>>> Hello Devs, >>>> >>>> I would like to request a feature to a major processor, ListSFTP. But >>> before >>>> I do down the official road, I wanted to ask if anyone thought it was a >>>> terrible idea or impossible, etc. The request is to add support for an >>>> incoming relationship to the ListSFTP processor specifically, but I >> could >>>> see it added to many of the commonly used head processes, such as >>> ListFile. >>>> I would envision functionality more like InvokeHTTP or ExecuteSQL, where >>> an >>>> incoming flow file could initiate the action, and the attributes in the >>>> incoming flow file could be used to configure the processor actions. >> It's >>>> the configuration aspect that most appeals to me, because it opens it up >>> to >>>> being centrally or dynamically configured. >>>> >>>> Thanks, >>>> >>>> Scott >>>> >>
signature.asc
Description: Message signed with OpenPGP using GPGMail