Re: ListSFTP incoming relationship

Andy LoPresto Thu, 29 Mar 2018 12:45:10 -0700

Bryan,

No, that was exactly what I was referencing regarding the attribute output. It 
would have been clearer if I had said it like you did. Thanks.


Andy LoPresto
[email protected]
[email protected]
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Mar 29, 2018, at 10:46 AM, Bryan Bende <[email protected]> wrote:
> 
> Scott,
> 
> You are correct that the overall discussion is about allowing incoming
> flow files to ListSFTP.
> 
> However, the previous discussion on this thread highlighted that the
> main reason ListSFTP currently doesn't allow incoming flow files is
> because of challenges when storing state.
> 
> This led to the proposal of a new processor that allowed incoming flow
> files, but did not store state, thus avoiding the challenges mentioned
> above. If we were going to store state in this new processor, then
> we'd be back to the exact same challenges.
> 
> Providing an option to turn on state also doesn't really help, because
> if there is an option provided to users,then the option will be used,
> and it needs to work when it is used.
> 
> If we can come up with something that stores state and works well for
> all scenarios, then we aren't against it, we just need to handle the
> challenges highlighted by Joe's original email.
> 
> Regarding some of the other ideas...
> 
> The current output of ListSFTP already includes flow file attributes
> for each listing that include the full path, filename, last update
> time, owner, group, permissions, and file size.... were you thinking
> of something different than that?
> 
> See the "Writes Attributes" section here:
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.ListSFTP/index.html
> 
> Thanks,
> 
> Bryan
> 
> 
> 
> On Thu, Mar 29, 2018 at 12:43 PM, Andy LoPresto <[email protected]> wrote:
>> Scott,
>> 
>> I think there are two conversations going on here. You are finding the
>> requirements for your specific use case, and that’s great. But I echo
>> Bryan’s point that a community processor for this scenario should not store
>> state at all. Sivaprasanna’s point that given dynamic directory input,
>> storing state based on that can cause massive data ingestion problems still
>> stands.
>> 
>> For your specific use case, you can prototype (or possibly even get to a
>> stable and robust-enough point) using ExecuteScript to model the behavior
>> you need.
>> 
>> In regards to the desired output format, I would suggest a few items:
>> 
>> * Avro requires a schema to be defined, and this raises the barrier to use
>> of the processor. Also, unless being sent to a processor that understands
>> Avro, the result will need to be converted anyway using Record* processors.
>> * If the output is individual flowfiles on a 1:1 basis, each should have as
>> many attributes populated with the parsed information as possible (i.e.
>> file.name, file.path, file.size, file.owner, file.permissions, etc.). This
>> allows for easily-consumable and routable flowfiles.
>> * If the output is a full directory listing, I would suggest `ls -al` type
>> raw text output, or JSON (arbitrary human-readable and machine-readable
>> format with many consuming/transforming processors).
>> 
>> 
>> Andy LoPresto
>> [email protected]
>> [email protected]
>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>> 
>> On Mar 29, 2018, at 9:34 AM, scott <[email protected]> wrote:
>> 
>> Sorry Bryan, but I disagree with you. Not storing state is NOT the main
>> point of this new processor. The main point is to allow an incoming
>> relationship flowfile to trigger the action, and allow variables to be used
>> from the attributes therein.
>> 
>> I agree that if the NiFi community deems it too risky to distribute this
>> processor with state keeping optionally available, even if the default is to
>> disable it, then so be it. If state is not included optionally, then how
>> about making the output flowfile content include more than just the file
>> names? Have it include last updated time along with the filename. If it
>> searches recursively, you'll want to include the path to the file also.
>> Maybe it would be best to output the results into a structured format, such
>> as AVRO? Or, maybe it would just be best to output one flowfile per remote
>> file found, and include updated time and fully qualified path as attributes?
>> 
>> Scott
>> 
>> 
>> On 03/29/2018 04:32 AM, Bryan Bende wrote:
>> 
>> The main point of the new processor is to NOT store state so that it
>> becomes more reasonable to allow incoming flow files.
>> 
>> You could probably implement your own custom processor that does both
>> because you can make assumptions about how you are going to use it, but if
>> the NiFi community provides one then it needs to work well for all
>> situations, such as dynamically listing hundreds of directories, which is
>> problematic when state is involved.
>> 
>> On Thu, Mar 29, 2018 at 1:05 AM Sivaprasanna <[email protected]>
>> wrote:
>> 
>> Should we really have to have an optional state saving functionality? If
>> the user is unaware of the implications and proceed to store the state then
>> what Andrew Grande mentioned will happen - possibilities of never ending
>> stream of state information being stored. If we still go with the optional
>> state management approach, documentation have to be clear in explaining the
>> implications.
>> 
>> Sivaprasanna
>> 
>> On Thu, 29 Mar 2018 at 9:28 AM, scott <[email protected]> wrote:
>> 
>> Okay. So, a new processor called "ScanSFTP", allow incoming relationship
>> where the content of the flow file is replaced with the list of matching
>> files from the remote directory, then the list is filtered by the usual
>> regex parameters like today. Optional state information is kept to
>> additionally filter the list of files older than the newest file
>> observed during the last run. Does that sound okay to everyone? If so,
>> what's the next step?
>> 
>> Scott
>> 
>> 
>> On 03/27/2018 06:21 PM, scott wrote:
>> 
>> This is a great discussion, and appreciate the interest in my problem.
>> I think there are workarounds if you decide not to store state, but
>> I'd recommend keeping it. I think state should be kept optionally,
>> even turned off by default. Several times I've had issues where the
>> state has cause me to miss files, because files get moved into the
>> source folder out of order, and I've wished I could turn the state
>> feature off.
>> 
>> In my current use-case, I would not be frequently, dynamically
>> changing the source directory, though I can see the use-cases where it
>> would be. In my current use-case, I want to use an external database
>> table to control the configuration of all my flows. I do this by first
>> reading the content of the table for this particular flow ID, then
>> assign the result as attributes to the flowfile, essentially creating
>> variables I can use throughout the flow to control its behavior. This
>> works great with flows that initiate with HTTP or SQL, but not
>> ListSFTP or ListFile.
>> 
>> Scott
>> 
>> 
>> On 03/27/2018 02:05 PM, Andy LoPresto wrote:
>> 
>> I think Bryan’s point is a good one and when I first saw this
>> question (and thought of the previous times it’s been asked), my
>> initial response is to propose a second processor.
>> 
>> Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates
>> differently from ListSFTP — it does not maintain state, and performs
>> a one-time tabulation/chronicling of the state of that directory at
>> the given point in time.
>> 
>> The responsibility to maintain and compare state across time is no
>> longer a requirement. There could even be a setting in the processor
>> to allow for “individual flowfile output” (i.e. act the same as
>> ListSFTP and output one flowfile per item listed) or “summary
>> flowfile output” where a single flowfile is generated containing the
>> directory listing information for all the items there. (Another
>> option is to output both on two different relationships).
>> 
>> I think this would enable the types of workflows that users have
>> asked about in the past without compromising the mechanism by which
>> List* processors work and adding undue complexity to those processors.
>> 
>> Absolutely crystal clear documentation (and a standard verb for the
>> new processor family) would be necessary (not only because these
>> processor solve different problems, but to avoid a million variants
>> of “I used ScanSFTP processor and it’s not tracking state”/“How do I
>> provide a directory in an attribute to ListSFTP” mailing list
>> questions).
>> 
>> 
>> Andy LoPresto
>> [email protected] <mailto:[email protected]>
>> /[email protected] <mailto:[email protected]>/
>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>> 
>> On Mar 27, 2018, at 8:33 AM, Andrew Grande <[email protected]
>> <mailto:[email protected]>> wrote:
>> 
>> The key here is that ListXXX processor maintains state. A directory
>> is part
>> of such state. Allowing arbitrary directories via an expression would
>> create never ending stream of new entries in the state storage,
>> effectively
>> engineering a distributed DoS attack on the NiFi node or shared ZK
>> quorum
>> (for when state is stored in there).
>> 
>> Maybe if we focus on thinking about assumptions and restrictions the
>> processor should make to contain that risk...
>> 
>> Andrew
>> 
>> On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <[email protected]
>> <mailto:[email protected]>> wrote:
>> 
>> I'm not sure that would solve the problem because you'd still be
>> limited to one directory. What most people are asking for is the
>> ability to use a dynamic directory from an incoming flow file.
>> 
>> I think we might be trying to fit two different use-cases into one
>> processor which might not make sense.
>> 
>> Scenario #1... There is a directory that is constantly receiving new
>> data and has a significant amount of files, and I want to
>> 
>> periodically
>> 
>> find new files. This is what the current processors are optimized
>> 
>> for.
>> 
>> Scenario #2... There is a directory that is mostly static with a
>> moderate/small number of files, and at points in my flow I want to
>> dynamically perform a listing of this directory and retrieve the
>> files. This is more geared towards the mentality of running a
>> job/workflow.
>> 
>> 
>> 
>> 
>> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler
>> <[email protected] <mailto:[email protected]>>
>> wrote:
>> 
>> What if the changes where ‘on top of’ some base set of properties,
>> like
>> directory?
>> Like a filter, where if present from the incoming file will have
>> 
>> the
>> 
>> LIST*
>> 
>> list only things
>> that match a name or attribute?
>> 
>> 
>> 
>> On March 27, 2018 at 00:08:41, Joe Witt ([email protected]
>> <mailto:[email protected]>) wrote:
>> 
>> Scott
>> 
>> This idea has come up a couple of times and there is definitely
>> something intriguing to it. Where I think this idea stalls out
>> 
>> though
>> 
>> is in implementation.
>> 
>> While I agree that the other List* processors might similarly
>> 
>> benefit
>> 
>> lets focus on ListFile. Today you tell ListFile what directory to
>> start looking for files in. It goes off scanning that directory for
>> hits and stores state about what it has already searched/seen. And
>> 
>> it
>> 
>> is important to keep track of how much it has already scanned
>> 
>> because
>> 
>> at times the search directory can be massive (100,000s of thousands
>> 
>> or
>> 
>> more files and directories to scan for example).
>> 
>> In the proposed model the directory to be scanned could be provided
>> dynamically by looking at an attribute of an incoming flowfile (or
>> other criteria can be provided - not just the directory to scan).
>> 
>> In
>> 
>> this case the ListFile processor goes on scanning against that now.
>> What about the previous directory (or directories) it was told to
>> scan? Does it still track those too? What if it starts scanning the
>> newly provided directory, hasn't finished pulling all the data or
>> 
>> new
>> 
>> data is continually arriving, and it is told to switch to another
>> directory.
>> 
>> I think if those questions can get solid answers and someone
>> 
>> invests
>> 
>> time in creating a PR then this could be pretty powerful. Would be
>> good to see a written description of the use case(s) for this too.
>> 
>> Thanks
>> Joe
>> 
>> On Mon, Mar 26, 2018 at 11:58 PM, scott <[email protected]
>> <mailto:[email protected]>> wrote:
>> 
>> Hello Devs,
>> 
>> I would like to request a feature to a major processor, ListSFTP.
>> 
>> But
>> 
>> before
>> 
>> I do down the official road, I wanted to ask if anyone thought it
>> was a
>> terrible idea or impossible, etc. The request is to add support
>> for an
>> incoming relationship to the ListSFTP processor specifically, but
>> 
>> I
>> 
>> could
>> 
>> see it added to many of the commonly used head processes, such as
>> 
>> ListFile.
>> 
>> I would envision functionality more like InvokeHTTP or
>> ExecuteSQL, where
>> 
>> an
>> 
>> incoming flow file could initiate the action, and the attributes
>> in the
>> incoming flow file could be used to configure the processor
>> 
>> actions.
>> 
>> It's
>> 
>> the configuration aspect that most appeals to me, because it
>> opens it up
>> 
>> to
>> 
>> being centrally or dynamically configured.
>> 
>> Thanks,
>> 
>> Scott
>> 
>> 
>> 
>>

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: ListSFTP incoming relationship

Reply via email to