Bryan, No, that was exactly what I was referencing regarding the attribute output. It would have been clearer if I had said it like you did. Thanks.
Andy LoPresto [email protected] [email protected] PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69 > On Mar 29, 2018, at 10:46 AM, Bryan Bende <[email protected]> wrote: > > Scott, > > You are correct that the overall discussion is about allowing incoming > flow files to ListSFTP. > > However, the previous discussion on this thread highlighted that the > main reason ListSFTP currently doesn't allow incoming flow files is > because of challenges when storing state. > > This led to the proposal of a new processor that allowed incoming flow > files, but did not store state, thus avoiding the challenges mentioned > above. If we were going to store state in this new processor, then > we'd be back to the exact same challenges. > > Providing an option to turn on state also doesn't really help, because > if there is an option provided to users,then the option will be used, > and it needs to work when it is used. > > If we can come up with something that stores state and works well for > all scenarios, then we aren't against it, we just need to handle the > challenges highlighted by Joe's original email. > > Regarding some of the other ideas... > > The current output of ListSFTP already includes flow file attributes > for each listing that include the full path, filename, last update > time, owner, group, permissions, and file size.... were you thinking > of something different than that? > > See the "Writes Attributes" section here: > https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.ListSFTP/index.html > > Thanks, > > Bryan > > > > On Thu, Mar 29, 2018 at 12:43 PM, Andy LoPresto <[email protected]> wrote: >> Scott, >> >> I think there are two conversations going on here. You are finding the >> requirements for your specific use case, and that’s great. But I echo >> Bryan’s point that a community processor for this scenario should not store >> state at all. Sivaprasanna’s point that given dynamic directory input, >> storing state based on that can cause massive data ingestion problems still >> stands. >> >> For your specific use case, you can prototype (or possibly even get to a >> stable and robust-enough point) using ExecuteScript to model the behavior >> you need. >> >> In regards to the desired output format, I would suggest a few items: >> >> * Avro requires a schema to be defined, and this raises the barrier to use >> of the processor. Also, unless being sent to a processor that understands >> Avro, the result will need to be converted anyway using Record* processors. >> * If the output is individual flowfiles on a 1:1 basis, each should have as >> many attributes populated with the parsed information as possible (i.e. >> file.name, file.path, file.size, file.owner, file.permissions, etc.). This >> allows for easily-consumable and routable flowfiles. >> * If the output is a full directory listing, I would suggest `ls -al` type >> raw text output, or JSON (arbitrary human-readable and machine-readable >> format with many consuming/transforming processors). >> >> >> Andy LoPresto >> [email protected] >> [email protected] >> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69 >> >> On Mar 29, 2018, at 9:34 AM, scott <[email protected]> wrote: >> >> Sorry Bryan, but I disagree with you. Not storing state is NOT the main >> point of this new processor. The main point is to allow an incoming >> relationship flowfile to trigger the action, and allow variables to be used >> from the attributes therein. >> >> I agree that if the NiFi community deems it too risky to distribute this >> processor with state keeping optionally available, even if the default is to >> disable it, then so be it. If state is not included optionally, then how >> about making the output flowfile content include more than just the file >> names? Have it include last updated time along with the filename. If it >> searches recursively, you'll want to include the path to the file also. >> Maybe it would be best to output the results into a structured format, such >> as AVRO? Or, maybe it would just be best to output one flowfile per remote >> file found, and include updated time and fully qualified path as attributes? >> >> Scott >> >> >> On 03/29/2018 04:32 AM, Bryan Bende wrote: >> >> The main point of the new processor is to NOT store state so that it >> becomes more reasonable to allow incoming flow files. >> >> You could probably implement your own custom processor that does both >> because you can make assumptions about how you are going to use it, but if >> the NiFi community provides one then it needs to work well for all >> situations, such as dynamically listing hundreds of directories, which is >> problematic when state is involved. >> >> On Thu, Mar 29, 2018 at 1:05 AM Sivaprasanna <[email protected]> >> wrote: >> >> Should we really have to have an optional state saving functionality? If >> the user is unaware of the implications and proceed to store the state then >> what Andrew Grande mentioned will happen - possibilities of never ending >> stream of state information being stored. If we still go with the optional >> state management approach, documentation have to be clear in explaining the >> implications. >> >> Sivaprasanna >> >> On Thu, 29 Mar 2018 at 9:28 AM, scott <[email protected]> wrote: >> >> Okay. So, a new processor called "ScanSFTP", allow incoming relationship >> where the content of the flow file is replaced with the list of matching >> files from the remote directory, then the list is filtered by the usual >> regex parameters like today. Optional state information is kept to >> additionally filter the list of files older than the newest file >> observed during the last run. Does that sound okay to everyone? If so, >> what's the next step? >> >> Scott >> >> >> On 03/27/2018 06:21 PM, scott wrote: >> >> This is a great discussion, and appreciate the interest in my problem. >> I think there are workarounds if you decide not to store state, but >> I'd recommend keeping it. I think state should be kept optionally, >> even turned off by default. Several times I've had issues where the >> state has cause me to miss files, because files get moved into the >> source folder out of order, and I've wished I could turn the state >> feature off. >> >> In my current use-case, I would not be frequently, dynamically >> changing the source directory, though I can see the use-cases where it >> would be. In my current use-case, I want to use an external database >> table to control the configuration of all my flows. I do this by first >> reading the content of the table for this particular flow ID, then >> assign the result as attributes to the flowfile, essentially creating >> variables I can use throughout the flow to control its behavior. This >> works great with flows that initiate with HTTP or SQL, but not >> ListSFTP or ListFile. >> >> Scott >> >> >> On 03/27/2018 02:05 PM, Andy LoPresto wrote: >> >> I think Bryan’s point is a good one and when I first saw this >> question (and thought of the previous times it’s been asked), my >> initial response is to propose a second processor. >> >> Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates >> differently from ListSFTP — it does not maintain state, and performs >> a one-time tabulation/chronicling of the state of that directory at >> the given point in time. >> >> The responsibility to maintain and compare state across time is no >> longer a requirement. There could even be a setting in the processor >> to allow for “individual flowfile output” (i.e. act the same as >> ListSFTP and output one flowfile per item listed) or “summary >> flowfile output” where a single flowfile is generated containing the >> directory listing information for all the items there. (Another >> option is to output both on two different relationships). >> >> I think this would enable the types of workflows that users have >> asked about in the past without compromising the mechanism by which >> List* processors work and adding undue complexity to those processors. >> >> Absolutely crystal clear documentation (and a standard verb for the >> new processor family) would be necessary (not only because these >> processor solve different problems, but to avoid a million variants >> of “I used ScanSFTP processor and it’s not tracking state”/“How do I >> provide a directory in an attribute to ListSFTP” mailing list >> questions). >> >> >> Andy LoPresto >> [email protected] <mailto:[email protected]> >> /[email protected] <mailto:[email protected]>/ >> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69 >> >> On Mar 27, 2018, at 8:33 AM, Andrew Grande <[email protected] >> <mailto:[email protected]>> wrote: >> >> The key here is that ListXXX processor maintains state. A directory >> is part >> of such state. Allowing arbitrary directories via an expression would >> create never ending stream of new entries in the state storage, >> effectively >> engineering a distributed DoS attack on the NiFi node or shared ZK >> quorum >> (for when state is stored in there). >> >> Maybe if we focus on thinking about assumptions and restrictions the >> processor should make to contain that risk... >> >> Andrew >> >> On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <[email protected] >> <mailto:[email protected]>> wrote: >> >> I'm not sure that would solve the problem because you'd still be >> limited to one directory. What most people are asking for is the >> ability to use a dynamic directory from an incoming flow file. >> >> I think we might be trying to fit two different use-cases into one >> processor which might not make sense. >> >> Scenario #1... There is a directory that is constantly receiving new >> data and has a significant amount of files, and I want to >> >> periodically >> >> find new files. This is what the current processors are optimized >> >> for. >> >> Scenario #2... There is a directory that is mostly static with a >> moderate/small number of files, and at points in my flow I want to >> dynamically perform a listing of this directory and retrieve the >> files. This is more geared towards the mentality of running a >> job/workflow. >> >> >> >> >> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler >> <[email protected] <mailto:[email protected]>> >> wrote: >> >> What if the changes where ‘on top of’ some base set of properties, >> like >> directory? >> Like a filter, where if present from the incoming file will have >> >> the >> >> LIST* >> >> list only things >> that match a name or attribute? >> >> >> >> On March 27, 2018 at 00:08:41, Joe Witt ([email protected] >> <mailto:[email protected]>) wrote: >> >> Scott >> >> This idea has come up a couple of times and there is definitely >> something intriguing to it. Where I think this idea stalls out >> >> though >> >> is in implementation. >> >> While I agree that the other List* processors might similarly >> >> benefit >> >> lets focus on ListFile. Today you tell ListFile what directory to >> start looking for files in. It goes off scanning that directory for >> hits and stores state about what it has already searched/seen. And >> >> it >> >> is important to keep track of how much it has already scanned >> >> because >> >> at times the search directory can be massive (100,000s of thousands >> >> or >> >> more files and directories to scan for example). >> >> In the proposed model the directory to be scanned could be provided >> dynamically by looking at an attribute of an incoming flowfile (or >> other criteria can be provided - not just the directory to scan). >> >> In >> >> this case the ListFile processor goes on scanning against that now. >> What about the previous directory (or directories) it was told to >> scan? Does it still track those too? What if it starts scanning the >> newly provided directory, hasn't finished pulling all the data or >> >> new >> >> data is continually arriving, and it is told to switch to another >> directory. >> >> I think if those questions can get solid answers and someone >> >> invests >> >> time in creating a PR then this could be pretty powerful. Would be >> good to see a written description of the use case(s) for this too. >> >> Thanks >> Joe >> >> On Mon, Mar 26, 2018 at 11:58 PM, scott <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hello Devs, >> >> I would like to request a feature to a major processor, ListSFTP. >> >> But >> >> before >> >> I do down the official road, I wanted to ask if anyone thought it >> was a >> terrible idea or impossible, etc. The request is to add support >> for an >> incoming relationship to the ListSFTP processor specifically, but >> >> I >> >> could >> >> see it added to many of the commonly used head processes, such as >> >> ListFile. >> >> I would envision functionality more like InvokeHTTP or >> ExecuteSQL, where >> >> an >> >> incoming flow file could initiate the action, and the attributes >> in the >> incoming flow file could be used to configure the processor >> >> actions. >> >> It's >> >> the configuration aspect that most appeals to me, because it >> opens it up >> >> to >> >> being centrally or dynamically configured. >> >> Thanks, >> >> Scott >> >> >> >>
signature.asc
Description: Message signed with OpenPGP using GPGMail
