Re: ListSFTP incoming relationship

scott Wed, 28 Mar 2018 20:59:36 -0700

Okay. So, a new processor called "ScanSFTP", allow incoming relationshipwhere the content of the flow file is replaced with the list of matchingfiles from the remote directory, then the list is filtered by the usualregex parameters like today. Optional state information is kept toadditionally filter the list of files older than the newest fileobserved during the last run. Does that sound okay to everyone? If so,what's the next step?


Scott



On 03/27/2018 06:21 PM, scott wrote:

This is a great discussion, and appreciate the interest in my problem.I think there are workarounds if you decide not to store state, butI'd recommend keeping it. I think state should be kept optionally,even turned off by default. Several times I've had issues where thestate has cause me to miss files, because files get moved into thesource folder out of order, and I've wished I could turn the statefeature off.
In my current use-case, I would not be frequently, dynamicallychanging the source directory, though I can see the use-cases where itwould be. In my current use-case, I want to use an external databasetable to control the configuration of all my flows. I do this by firstreading the content of the table for this particular flow ID, thenassign the result as attributes to the flowfile, essentially creatingvariables I can use throughout the flow to control its behavior. Thisworks great with flows that initiate with HTTP or SQL, but notListSFTP or ListFile.
Scott


On 03/27/2018 02:05 PM, Andy LoPresto wrote:
I think Bryan’s point is a good one and when I first saw thisquestion (and thought of the previous times it’s been asked), myinitial response is to propose a second processor.
Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operatesdifferently from ListSFTP — it does not maintain state, and performsa one-time tabulation/chronicling of the state of that directory atthe given point in time.
The responsibility to maintain and compare state across time is nolonger a requirement. There could even be a setting in the processorto allow for “individual flowfile output” (i.e. act the same asListSFTP and output one flowfile per item listed) or “summaryflowfile output” where a single flowfile is generated containing thedirectory listing information for all the items there. (Anotheroption is to output both on two different relationships).
I think this would enable the types of workflows that users haveasked about in the past without compromising the mechanism by whichList* processors work and adding undue complexity to those processors.
Absolutely crystal clear documentation (and a standard verb for thenew processor family) would be necessary (not only because theseprocessor solve different problems, but to avoid a million variantsof “I used ScanSFTP processor and it’s not tracking state”/“How do Iprovide a directory in an attribute to ListSFTP” mailing listquestions).
Andy LoPresto
alopre...@apache.org <mailto:alopre...@apache.org>
/alopresto.apa...@gmail.com <mailto:alopresto.apa...@gmail.com>/
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
On Mar 27, 2018, at 8:33 AM, Andrew Grande <apere...@gmail.com<mailto:apere...@gmail.com>> wrote:
The key here is that ListXXX processor maintains state. A directoryis part
of such state. Allowing arbitrary directories via an expression would
create never ending stream of new entries in the state storage,effectivelyengineering a distributed DoS attack on the NiFi node or shared ZKquorum
(for when state is stored in there).

Maybe if we focus on thinking about assumptions and restrictions the
processor should make to contain that risk...

Andrew
On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bbe...@gmail.com<mailto:bbe...@gmail.com>> wrote:
I'm not sure that would solve the problem because you'd still be
limited to one directory. What most people are asking for is the
ability to use a dynamic directory from an incoming flow file.

I think we might be trying to fit two different use-cases into one
processor which might not make sense.

Scenario #1... There is a directory that is constantly receiving new
data and has a significant amount of files, and I want to periodically
find new files. This is what the current processors are optimized for.

Scenario #2... There is a directory that is mostly static with a
moderate/small number of files, and at points in my flow I want to
dynamically perform a listing of this directory and retrieve the
files. This is more geared towards the mentality of running a
job/workflow.
On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler<ottobackwa...@gmail.com <mailto:ottobackwa...@gmail.com>>
wrote:
What if the changes where ‘on top of’ some base set of properties,like
directory?
Like a filter, where if present from the incoming file will have the
LIST*
list only things
that match a name or attribute?
On March 27, 2018 at 00:08:41, Joe Witt (joe.w...@gmail.com<mailto:joe.w...@gmail.com>) wrote:
Scott

This idea has come up a couple of times and there is definitely
something intriguing to it. Where I think this idea stalls out though
is in implementation.

While I agree that the other List* processors might similarly benefit
lets focus on ListFile. Today you tell ListFile what directory to
start looking for files in. It goes off scanning that directory for
hits and stores state about what it has already searched/seen. And it
is important to keep track of how much it has already scanned because
at times the search directory can be massive (100,000s of thousands or
more files and directories to scan for example).

In the proposed model the directory to be scanned could be provided
dynamically by looking at an attribute of an incoming flowfile (or
other criteria can be provided - not just the directory to scan). In
this case the ListFile processor goes on scanning against that now.
What about the previous directory (or directories) it was told to
scan? Does it still track those too? What if it starts scanning the
newly provided directory, hasn't finished pulling all the data or new
data is continually arriving, and it is told to switch to another
directory.

I think if those questions can get solid answers and someone invests
time in creating a PR then this could be pretty powerful. Would be
good to see a written description of the use case(s) for this too.

Thanks
Joe
On Mon, Mar 26, 2018 at 11:58 PM, scott <tcots8...@gmail.com<mailto:tcots8...@gmail.com>> wrote:
Hello Devs,

I would like to request a feature to a major processor, ListSFTP. But
before
I do down the official road, I wanted to ask if anyone thought itwas aterrible idea or impossible, etc. The request is to add supportfor an
incoming relationship to the ListSFTP processor specifically, but I
could
see it added to many of the commonly used head processes, such as
ListFile.
I would envision functionality more like InvokeHTTP orExecuteSQL, where
an
incoming flow file could initiate the action, and the attributesin the
incoming flow file could be used to configure the processor actions.
It's
the configuration aspect that most appeals to me, because itopens it up
to
being centrally or dynamically configured.

Thanks,

Scott

Re: ListSFTP incoming relationship

Reply via email to