Re: ListSFTP incoming relationship

scott Sun, 01 Apr 2018 09:41:40 -0700

Okay. I guess I didn't realize how Nifi dev felt about risk tolerance. I
think I can work around it by adding duplicate filtering or implement some
other state management solution.
So, what's the next step?


Scott

On Thu, Mar 29, 2018, 10:46 AM Bryan Bende <[email protected]> wrote:

> Scott,
>
> You are correct that the overall discussion is about allowing incoming
> flow files to ListSFTP.
>
> However, the previous discussion on this thread highlighted that the
> main reason ListSFTP currently doesn't allow incoming flow files is
> because of challenges when storing state.
>
> This led to the proposal of a new processor that allowed incoming flow
> files, but did not store state, thus avoiding the challenges mentioned
> above. If we were going to store state in this new processor, then
> we'd be back to the exact same challenges.
>
> Providing an option to turn on state also doesn't really help, because
> if there is an option provided to users,then the option will be used,
> and it needs to work when it is used.
>
> If we can come up with something that stores state and works well for
> all scenarios, then we aren't against it, we just need to handle the
> challenges highlighted by Joe's original email.
>
> Regarding some of the other ideas...
>
> The current output of ListSFTP already includes flow file attributes
> for each listing that include the full path, filename, last update
> time, owner, group, permissions, and file size.... were you thinking
> of something different than that?
>
> See the "Writes Attributes" section here:
>
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.ListSFTP/index.html
>
> Thanks,
>
> Bryan
>
>
>
> On Thu, Mar 29, 2018 at 12:43 PM, Andy LoPresto <[email protected]>
> wrote:
> > Scott,
> >
> > I think there are two conversations going on here. You are finding the
> > requirements for your specific use case, and that’s great. But I echo
> > Bryan’s point that a community processor for this scenario should not
> store
> > state at all. Sivaprasanna’s point that given dynamic directory input,
> > storing state based on that can cause massive data ingestion problems
> still
> > stands.
> >
> > For your specific use case, you can prototype (or possibly even get to a
> > stable and robust-enough point) using ExecuteScript to model the behavior
> > you need.
> >
> > In regards to the desired output format, I would suggest a few items:
> >
> > * Avro requires a schema to be defined, and this raises the barrier to
> use
> > of the processor. Also, unless being sent to a processor that understands
> > Avro, the result will need to be converted anyway using Record*
> processors.
> > * If the output is individual flowfiles on a 1:1 basis, each should have
> as
> > many attributes populated with the parsed information as possible (i.e.
> > file.name, file.path, file.size, file.owner, file.permissions, etc.).
> This
> > allows for easily-consumable and routable flowfiles.
> > * If the output is a full directory listing, I would suggest `ls -al`
> type
> > raw text output, or JSON (arbitrary human-readable and machine-readable
> > format with many consuming/transforming processors).
> >
> >
> > Andy LoPresto
> > [email protected]
> > [email protected]
> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> >
> > On Mar 29, 2018, at 9:34 AM, scott <[email protected]> wrote:
> >
> > Sorry Bryan, but I disagree with you. Not storing state is NOT the main
> > point of this new processor. The main point is to allow an incoming
> > relationship flowfile to trigger the action, and allow variables to be
> used
> > from the attributes therein.
> >
> > I agree that if the NiFi community deems it too risky to distribute this
> > processor with state keeping optionally available, even if the default
> is to
> > disable it, then so be it. If state is not included optionally, then how
> > about making the output flowfile content include more than just the file
> > names? Have it include last updated time along with the filename. If it
> > searches recursively, you'll want to include the path to the file also.
> > Maybe it would be best to output the results into a structured format,
> such
> > as AVRO? Or, maybe it would just be best to output one flowfile per
> remote
> > file found, and include updated time and fully qualified path as
> attributes?
> >
> > Scott
> >
> >
> > On 03/29/2018 04:32 AM, Bryan Bende wrote:
> >
> > The main point of the new processor is to NOT store state so that it
> > becomes more reasonable to allow incoming flow files.
> >
> > You could probably implement your own custom processor that does both
> > because you can make assumptions about how you are going to use it, but
> if
> > the NiFi community provides one then it needs to work well for all
> > situations, such as dynamically listing hundreds of directories, which is
> > problematic when state is involved.
> >
> > On Thu, Mar 29, 2018 at 1:05 AM Sivaprasanna <[email protected]>
> > wrote:
> >
> > Should we really have to have an optional state saving functionality? If
> > the user is unaware of the implications and proceed to store the state
> then
> > what Andrew Grande mentioned will happen - possibilities of never ending
> > stream of state information being stored. If we still go with the
> optional
> > state management approach, documentation have to be clear in explaining
> the
> > implications.
> >
> > Sivaprasanna
> >
> > On Thu, 29 Mar 2018 at 9:28 AM, scott <[email protected]> wrote:
> >
> > Okay. So, a new processor called "ScanSFTP", allow incoming relationship
> > where the content of the flow file is replaced with the list of matching
> > files from the remote directory, then the list is filtered by the usual
> > regex parameters like today. Optional state information is kept to
> > additionally filter the list of files older than the newest file
> > observed during the last run. Does that sound okay to everyone? If so,
> > what's the next step?
> >
> > Scott
> >
> >
> > On 03/27/2018 06:21 PM, scott wrote:
> >
> > This is a great discussion, and appreciate the interest in my problem.
> > I think there are workarounds if you decide not to store state, but
> > I'd recommend keeping it. I think state should be kept optionally,
> > even turned off by default. Several times I've had issues where the
> > state has cause me to miss files, because files get moved into the
> > source folder out of order, and I've wished I could turn the state
> > feature off.
> >
> > In my current use-case, I would not be frequently, dynamically
> > changing the source directory, though I can see the use-cases where it
> > would be. In my current use-case, I want to use an external database
> > table to control the configuration of all my flows. I do this by first
> > reading the content of the table for this particular flow ID, then
> > assign the result as attributes to the flowfile, essentially creating
> > variables I can use throughout the flow to control its behavior. This
> > works great with flows that initiate with HTTP or SQL, but not
> > ListSFTP or ListFile.
> >
> > Scott
> >
> >
> > On 03/27/2018 02:05 PM, Andy LoPresto wrote:
> >
> > I think Bryan’s point is a good one and when I first saw this
> > question (and thought of the previous times it’s been asked), my
> > initial response is to propose a second processor.
> >
> > Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates
> > differently from ListSFTP — it does not maintain state, and performs
> > a one-time tabulation/chronicling of the state of that directory at
> > the given point in time.
> >
> > The responsibility to maintain and compare state across time is no
> > longer a requirement. There could even be a setting in the processor
> > to allow for “individual flowfile output” (i.e. act the same as
> > ListSFTP and output one flowfile per item listed) or “summary
> > flowfile output” where a single flowfile is generated containing the
> > directory listing information for all the items there. (Another
> > option is to output both on two different relationships).
> >
> > I think this would enable the types of workflows that users have
> > asked about in the past without compromising the mechanism by which
> > List* processors work and adding undue complexity to those processors.
> >
> > Absolutely crystal clear documentation (and a standard verb for the
> > new processor family) would be necessary (not only because these
> > processor solve different problems, but to avoid a million variants
> > of “I used ScanSFTP processor and it’s not tracking state”/“How do I
> > provide a directory in an attribute to ListSFTP” mailing list
> > questions).
> >
> >
> > Andy LoPresto
> > [email protected] <mailto:[email protected]>
> > /[email protected] <mailto:[email protected]>/
> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> >
> > On Mar 27, 2018, at 8:33 AM, Andrew Grande <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> > The key here is that ListXXX processor maintains state. A directory
> > is part
> > of such state. Allowing arbitrary directories via an expression would
> > create never ending stream of new entries in the state storage,
> > effectively
> > engineering a distributed DoS attack on the NiFi node or shared ZK
> > quorum
> > (for when state is stored in there).
> >
> > Maybe if we focus on thinking about assumptions and restrictions the
> > processor should make to contain that risk...
> >
> > Andrew
> >
> > On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> > I'm not sure that would solve the problem because you'd still be
> > limited to one directory. What most people are asking for is the
> > ability to use a dynamic directory from an incoming flow file.
> >
> > I think we might be trying to fit two different use-cases into one
> > processor which might not make sense.
> >
> > Scenario #1... There is a directory that is constantly receiving new
> > data and has a significant amount of files, and I want to
> >
> > periodically
> >
> > find new files. This is what the current processors are optimized
> >
> > for.
> >
> > Scenario #2... There is a directory that is mostly static with a
> > moderate/small number of files, and at points in my flow I want to
> > dynamically perform a listing of this directory and retrieve the
> > files. This is more geared towards the mentality of running a
> > job/workflow.
> >
> >
> >
> >
> > On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler
> > <[email protected] <mailto:[email protected]>>
> > wrote:
> >
> > What if the changes where ‘on top of’ some base set of properties,
> > like
> > directory?
> > Like a filter, where if present from the incoming file will have
> >
> > the
> >
> > LIST*
> >
> > list only things
> > that match a name or attribute?
> >
> >
> >
> > On March 27, 2018 at 00:08:41, Joe Witt ([email protected]
> > <mailto:[email protected]>) wrote:
> >
> > Scott
> >
> > This idea has come up a couple of times and there is definitely
> > something intriguing to it. Where I think this idea stalls out
> >
> > though
> >
> > is in implementation.
> >
> > While I agree that the other List* processors might similarly
> >
> > benefit
> >
> > lets focus on ListFile. Today you tell ListFile what directory to
> > start looking for files in. It goes off scanning that directory for
> > hits and stores state about what it has already searched/seen. And
> >
> > it
> >
> > is important to keep track of how much it has already scanned
> >
> > because
> >
> > at times the search directory can be massive (100,000s of thousands
> >
> > or
> >
> > more files and directories to scan for example).
> >
> > In the proposed model the directory to be scanned could be provided
> > dynamically by looking at an attribute of an incoming flowfile (or
> > other criteria can be provided - not just the directory to scan).
> >
> > In
> >
> > this case the ListFile processor goes on scanning against that now.
> > What about the previous directory (or directories) it was told to
> > scan? Does it still track those too? What if it starts scanning the
> > newly provided directory, hasn't finished pulling all the data or
> >
> > new
> >
> > data is continually arriving, and it is told to switch to another
> > directory.
> >
> > I think if those questions can get solid answers and someone
> >
> > invests
> >
> > time in creating a PR then this could be pretty powerful. Would be
> > good to see a written description of the use case(s) for this too.
> >
> > Thanks
> > Joe
> >
> > On Mon, Mar 26, 2018 at 11:58 PM, scott <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> > Hello Devs,
> >
> > I would like to request a feature to a major processor, ListSFTP.
> >
> > But
> >
> > before
> >
> > I do down the official road, I wanted to ask if anyone thought it
> > was a
> > terrible idea or impossible, etc. The request is to add support
> > for an
> > incoming relationship to the ListSFTP processor specifically, but
> >
> > I
> >
> > could
> >
> > see it added to many of the commonly used head processes, such as
> >
> > ListFile.
> >
> > I would envision functionality more like InvokeHTTP or
> > ExecuteSQL, where
> >
> > an
> >
> > incoming flow file could initiate the action, and the attributes
> > in the
> > incoming flow file could be used to configure the processor
> >
> > actions.
> >
> > It's
> >
> > the configuration aspect that most appeals to me, because it
> > opens it up
> >
> > to
> >
> > being centrally or dynamically configured.
> >
> > Thanks,
> >
> > Scott
> >
> >
> >
> >
>

Re: ListSFTP incoming relationship

Reply via email to