Okay. I guess I didn't realize how Nifi dev felt about risk tolerance. I think I can work around it by adding duplicate filtering or implement some other state management solution. So, what's the next step?
Scott On Thu, Mar 29, 2018, 10:46 AM Bryan Bende <[email protected]> wrote: > Scott, > > You are correct that the overall discussion is about allowing incoming > flow files to ListSFTP. > > However, the previous discussion on this thread highlighted that the > main reason ListSFTP currently doesn't allow incoming flow files is > because of challenges when storing state. > > This led to the proposal of a new processor that allowed incoming flow > files, but did not store state, thus avoiding the challenges mentioned > above. If we were going to store state in this new processor, then > we'd be back to the exact same challenges. > > Providing an option to turn on state also doesn't really help, because > if there is an option provided to users,then the option will be used, > and it needs to work when it is used. > > If we can come up with something that stores state and works well for > all scenarios, then we aren't against it, we just need to handle the > challenges highlighted by Joe's original email. > > Regarding some of the other ideas... > > The current output of ListSFTP already includes flow file attributes > for each listing that include the full path, filename, last update > time, owner, group, permissions, and file size.... were you thinking > of something different than that? > > See the "Writes Attributes" section here: > > https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.ListSFTP/index.html > > Thanks, > > Bryan > > > > On Thu, Mar 29, 2018 at 12:43 PM, Andy LoPresto <[email protected]> > wrote: > > Scott, > > > > I think there are two conversations going on here. You are finding the > > requirements for your specific use case, and that’s great. But I echo > > Bryan’s point that a community processor for this scenario should not > store > > state at all. Sivaprasanna’s point that given dynamic directory input, > > storing state based on that can cause massive data ingestion problems > still > > stands. > > > > For your specific use case, you can prototype (or possibly even get to a > > stable and robust-enough point) using ExecuteScript to model the behavior > > you need. > > > > In regards to the desired output format, I would suggest a few items: > > > > * Avro requires a schema to be defined, and this raises the barrier to > use > > of the processor. Also, unless being sent to a processor that understands > > Avro, the result will need to be converted anyway using Record* > processors. > > * If the output is individual flowfiles on a 1:1 basis, each should have > as > > many attributes populated with the parsed information as possible (i.e. > > file.name, file.path, file.size, file.owner, file.permissions, etc.). > This > > allows for easily-consumable and routable flowfiles. > > * If the output is a full directory listing, I would suggest `ls -al` > type > > raw text output, or JSON (arbitrary human-readable and machine-readable > > format with many consuming/transforming processors). > > > > > > Andy LoPresto > > [email protected] > > [email protected] > > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69 > > > > On Mar 29, 2018, at 9:34 AM, scott <[email protected]> wrote: > > > > Sorry Bryan, but I disagree with you. Not storing state is NOT the main > > point of this new processor. The main point is to allow an incoming > > relationship flowfile to trigger the action, and allow variables to be > used > > from the attributes therein. > > > > I agree that if the NiFi community deems it too risky to distribute this > > processor with state keeping optionally available, even if the default > is to > > disable it, then so be it. If state is not included optionally, then how > > about making the output flowfile content include more than just the file > > names? Have it include last updated time along with the filename. If it > > searches recursively, you'll want to include the path to the file also. > > Maybe it would be best to output the results into a structured format, > such > > as AVRO? Or, maybe it would just be best to output one flowfile per > remote > > file found, and include updated time and fully qualified path as > attributes? > > > > Scott > > > > > > On 03/29/2018 04:32 AM, Bryan Bende wrote: > > > > The main point of the new processor is to NOT store state so that it > > becomes more reasonable to allow incoming flow files. > > > > You could probably implement your own custom processor that does both > > because you can make assumptions about how you are going to use it, but > if > > the NiFi community provides one then it needs to work well for all > > situations, such as dynamically listing hundreds of directories, which is > > problematic when state is involved. > > > > On Thu, Mar 29, 2018 at 1:05 AM Sivaprasanna <[email protected]> > > wrote: > > > > Should we really have to have an optional state saving functionality? If > > the user is unaware of the implications and proceed to store the state > then > > what Andrew Grande mentioned will happen - possibilities of never ending > > stream of state information being stored. If we still go with the > optional > > state management approach, documentation have to be clear in explaining > the > > implications. > > > > Sivaprasanna > > > > On Thu, 29 Mar 2018 at 9:28 AM, scott <[email protected]> wrote: > > > > Okay. So, a new processor called "ScanSFTP", allow incoming relationship > > where the content of the flow file is replaced with the list of matching > > files from the remote directory, then the list is filtered by the usual > > regex parameters like today. Optional state information is kept to > > additionally filter the list of files older than the newest file > > observed during the last run. Does that sound okay to everyone? If so, > > what's the next step? > > > > Scott > > > > > > On 03/27/2018 06:21 PM, scott wrote: > > > > This is a great discussion, and appreciate the interest in my problem. > > I think there are workarounds if you decide not to store state, but > > I'd recommend keeping it. I think state should be kept optionally, > > even turned off by default. Several times I've had issues where the > > state has cause me to miss files, because files get moved into the > > source folder out of order, and I've wished I could turn the state > > feature off. > > > > In my current use-case, I would not be frequently, dynamically > > changing the source directory, though I can see the use-cases where it > > would be. In my current use-case, I want to use an external database > > table to control the configuration of all my flows. I do this by first > > reading the content of the table for this particular flow ID, then > > assign the result as attributes to the flowfile, essentially creating > > variables I can use throughout the flow to control its behavior. This > > works great with flows that initiate with HTTP or SQL, but not > > ListSFTP or ListFile. > > > > Scott > > > > > > On 03/27/2018 02:05 PM, Andy LoPresto wrote: > > > > I think Bryan’s point is a good one and when I first saw this > > question (and thought of the previous times it’s been asked), my > > initial response is to propose a second processor. > > > > Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates > > differently from ListSFTP — it does not maintain state, and performs > > a one-time tabulation/chronicling of the state of that directory at > > the given point in time. > > > > The responsibility to maintain and compare state across time is no > > longer a requirement. There could even be a setting in the processor > > to allow for “individual flowfile output” (i.e. act the same as > > ListSFTP and output one flowfile per item listed) or “summary > > flowfile output” where a single flowfile is generated containing the > > directory listing information for all the items there. (Another > > option is to output both on two different relationships). > > > > I think this would enable the types of workflows that users have > > asked about in the past without compromising the mechanism by which > > List* processors work and adding undue complexity to those processors. > > > > Absolutely crystal clear documentation (and a standard verb for the > > new processor family) would be necessary (not only because these > > processor solve different problems, but to avoid a million variants > > of “I used ScanSFTP processor and it’s not tracking state”/“How do I > > provide a directory in an attribute to ListSFTP” mailing list > > questions). > > > > > > Andy LoPresto > > [email protected] <mailto:[email protected]> > > /[email protected] <mailto:[email protected]>/ > > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69 > > > > On Mar 27, 2018, at 8:33 AM, Andrew Grande <[email protected] > > <mailto:[email protected]>> wrote: > > > > The key here is that ListXXX processor maintains state. A directory > > is part > > of such state. Allowing arbitrary directories via an expression would > > create never ending stream of new entries in the state storage, > > effectively > > engineering a distributed DoS attack on the NiFi node or shared ZK > > quorum > > (for when state is stored in there). > > > > Maybe if we focus on thinking about assumptions and restrictions the > > processor should make to contain that risk... > > > > Andrew > > > > On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <[email protected] > > <mailto:[email protected]>> wrote: > > > > I'm not sure that would solve the problem because you'd still be > > limited to one directory. What most people are asking for is the > > ability to use a dynamic directory from an incoming flow file. > > > > I think we might be trying to fit two different use-cases into one > > processor which might not make sense. > > > > Scenario #1... There is a directory that is constantly receiving new > > data and has a significant amount of files, and I want to > > > > periodically > > > > find new files. This is what the current processors are optimized > > > > for. > > > > Scenario #2... There is a directory that is mostly static with a > > moderate/small number of files, and at points in my flow I want to > > dynamically perform a listing of this directory and retrieve the > > files. This is more geared towards the mentality of running a > > job/workflow. > > > > > > > > > > On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler > > <[email protected] <mailto:[email protected]>> > > wrote: > > > > What if the changes where ‘on top of’ some base set of properties, > > like > > directory? > > Like a filter, where if present from the incoming file will have > > > > the > > > > LIST* > > > > list only things > > that match a name or attribute? > > > > > > > > On March 27, 2018 at 00:08:41, Joe Witt ([email protected] > > <mailto:[email protected]>) wrote: > > > > Scott > > > > This idea has come up a couple of times and there is definitely > > something intriguing to it. Where I think this idea stalls out > > > > though > > > > is in implementation. > > > > While I agree that the other List* processors might similarly > > > > benefit > > > > lets focus on ListFile. Today you tell ListFile what directory to > > start looking for files in. It goes off scanning that directory for > > hits and stores state about what it has already searched/seen. And > > > > it > > > > is important to keep track of how much it has already scanned > > > > because > > > > at times the search directory can be massive (100,000s of thousands > > > > or > > > > more files and directories to scan for example). > > > > In the proposed model the directory to be scanned could be provided > > dynamically by looking at an attribute of an incoming flowfile (or > > other criteria can be provided - not just the directory to scan). > > > > In > > > > this case the ListFile processor goes on scanning against that now. > > What about the previous directory (or directories) it was told to > > scan? Does it still track those too? What if it starts scanning the > > newly provided directory, hasn't finished pulling all the data or > > > > new > > > > data is continually arriving, and it is told to switch to another > > directory. > > > > I think if those questions can get solid answers and someone > > > > invests > > > > time in creating a PR then this could be pretty powerful. Would be > > good to see a written description of the use case(s) for this too. > > > > Thanks > > Joe > > > > On Mon, Mar 26, 2018 at 11:58 PM, scott <[email protected] > > <mailto:[email protected]>> wrote: > > > > Hello Devs, > > > > I would like to request a feature to a major processor, ListSFTP. > > > > But > > > > before > > > > I do down the official road, I wanted to ask if anyone thought it > > was a > > terrible idea or impossible, etc. The request is to add support > > for an > > incoming relationship to the ListSFTP processor specifically, but > > > > I > > > > could > > > > see it added to many of the commonly used head processes, such as > > > > ListFile. > > > > I would envision functionality more like InvokeHTTP or > > ExecuteSQL, where > > > > an > > > > incoming flow file could initiate the action, and the attributes > > in the > > incoming flow file could be used to configure the processor > > > > actions. > > > > It's > > > > the configuration aspect that most appeals to me, because it > > opens it up > > > > to > > > > being centrally or dynamically configured. > > > > Thanks, > > > > Scott > > > > > > > > >
