There was no interest shown by the community so we moved on.
> On Nov 3, 2015, at 3:53 AM, Joe Witt <[email protected]> wrote: > > Rick, > > I am finally taking a moment to clear out some dangling threads. I > just looked into this one and the link appears to be gone. Have you > chosen to withdraw this proposal at this time? > > Thanks > Joe > >> On Fri, Sep 25, 2015 at 4:25 AM, Rick Braddy <[email protected]> wrote: >> Yes. Replication of directory tree via Nifi similar to rsync. >> >> -----Original Message----- >> From: Joe Skora [mailto:[email protected]] >> Sent: Thursday, September 24, 2015 10:16 PM >> To: [email protected] >> Subject: Re: Proposal: New file processors: GetFIleData and PutFileData >> >> It may be an oversimplification, but for the purposes of understanding, is >> the intent to mirror directory tree with NiFi similar to rsync? >> >>> On Wed, Sep 23, 2015 at 11:26 PM, Rick Braddy <[email protected]> wrote: >>> >>> Joe, >>> >>> Thanks for the quick response. >>> >>> Yes, I can add to the Wiki once access has been granted. Further responses: >>> >>>>> GetFile and PutFile do support recursive walking/reconstruction >>>>> based >>> on relative paths >>> >>> Based on my recent testing of 0.3.0, GetFile does walk the configured >>> directory tree, picking up the files it finds; however, only files are >>> sent to PutFile, which places them all into a single target folder >>> (not a directory tree - no directory information is sent by GetFile >>> nor processed by PutFile from what I have seen, so I do not believe it >>> reconstructs the directory tree at all today). >>> >>>>> I do think your proposal modified to consider the design pattern of >>> ListFile/FetchFile would be super powerful. >>> >>> We have another processor GetFileList that uses "find" to traverse a >>> target folder tree and feeds the resulting newline delimited >>> file/directory stream as FlowFiles into GetFileData. Perhaps that >>> processor could be evolved into a suitable ListFiles processor. >>> >>> I believe GetFileList/GetFileData correspond roughly to the >>> ListFile/FetchFile concept, based on a cursory review of >>> ListHDFS/FetchHDFS. If it's a matter of renaming that's obviously >>> trivial at this point. I'm assuming there are other facets to that >>> List/Fetch design pattern - is it documented anywhere I can review to learn >>> more? >>> >>> So when we have a ListFile/FetchFile what is the corresponding "Put" >>> side of the flow to be? Perhaps simply PutFile enhanced to handle >>> FlowFiles from both basic GetFile and the richer FetchFile (modified >>> GetFileData) types of FlowFiles and behaviors would suffice. >>> >>>>> Just need to make sure backpressure works through the flow so that >>>>> you >>> could literally handle the delivery of a file which is of itself >>> larger than the repo by capturing and sending a chunk of it at a time for >>> instance. >>> >>> Agreed. Are there any best practices documented for configuring >>> backpressure properly? >>> >>> Thanks. >>> >>> Rick >>> >>> -----Original Message----- >>> From: Joe Witt [mailto:[email protected]] >>> Sent: Wednesday, September 23, 2015 6:25 PM >>> To: [email protected] >>> Subject: Re: Proposal: New file processors: GetFIleData and >>> PutFileData >>> >>> Rick >>> >>> This is a perfectly fine place to start the thread. If you'd like to >>> create a wiki feature proposal for it too like we're doing with a lot >>> of the other things at this level we can give you access to create one >>> here [1]. >>> >>> Not at all trying to take away from the points you were making but >>> GetFile and PutFile do support recursive walking/reconstruction based >>> on relative paths. By no means is that as comprehensive as you're >>> going for here though - just an FYI. >>> >>> These sound like good things. In particular I find your concept for >>> handling arbitrarily large data interesting. Just need to make sure >>> backpressure works through the flow so that you could literally handle >>> the delivery of a file which is of itself larger than the repo by >>> capturing and sending a chunk of it at a time for instance. So from a >>> brief historical perspective the GetFile / PutFile processors were >>> literally the first two processors ever build for NiFi back when it >>> had no GUI, no provenance, no nothin' that was cool. These are the >>> OGs of NiFi. They been improved a bit over the years but not much. >>> Why? Because their utility was largely limited to trivial archiving >>> cases. We have recently had discussions about making them more >>> powerful through the concept of ListFile/FetchFile like adam mentions >>> and as we've started doing with things like HDFS. A much better model >>> for sure. Still not as powerful as what you're cooking up though. I >>> do think your proposal modified to consider the design pattern of >>> ListFile/FetchFile would be super powerful. In your case ListFile for >>> a single larger file for instance could produce N listings that point >>> to the same file on disk but for different offset/ranges. This would >>> be *very* interesting. I am a bit concerned about how to have this >>> nicely handle competing consumer problems but...we can cross that bridge >>> later. >>> >>> If you're willing to tackle this we can definitely work with you to >>> bring it in. It is a non-trivial contribution for sure. Folks often >>> do not consider all the nasty gotchas that can occur in something as >>> seemingly simple as File IO. >>> >>> Thanks >>> Joe >>> >>> [1] >>> https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposal >>> s >>> >>>> On Wed, Sep 23, 2015 at 1:42 PM, Rick Braddy <[email protected]> wrote: >>>> This thread proposes community review/comments of modified versions >>>> of >>> GetFile and PutFile for potential future adoption by the Nifi community. >>> For those who want to jump straight to the code, here's the review >>> repository location for the current version: >>> https://github.com/rickbraddy/nifishare. >>>> >>>> As background, we needed a way to replicate entire directory trees >>>> of >>> files via Nifi, where multiple directory trees can be specified at >>> run-time as part of an overall Nifi graph. As Nifi is rooted in >>> file-based processing, it seems reasonable to continue advancing its >>> abilities to ingest, process, transform and replicate files in the >>> most flexible manner possible. While this proposal is not a be all >>> end all in that regard, it moves the needle in the right direction by >>> making file-processing in Nifi more dynamic, enabling flows to >>> determine how files (and directories) should be processed, which does >>> well beyond today's basic file ingress/egress process capabilities >>> (which certainly have their place and uses). Whether it's via this >>> proposal and code or another, clearly Nifi can benefit from this type of >>> functionality. >>>> >>>> Here's a more detailed explanation of the rationale for developing >>>> these >>> Nifi file processor derivatives and their initial implementation: >>>> >>>> GetFileData >>>> ---------------- >>>> The GetFile processor monitors a single directory tree for file >>>> changes >>> and creates FlowFiles for every changed file in that configured tree. >>> It does a good job of getting files from a configurable folder than >>> need to be injected into a graph. GetFile falls short of other >>> requirements that arise for general-purpose file processing: >>>> >>>> - Operates from a single, pre-configured source directory (not >>> dynamically configurable at run-time as part of a flow) >>>> >>>> - Scheduled on a periodic basis only, not event-triggered when >>> there's something to do >>>> >>>> - Does not support sending an entire directory tree (only files >>> are sent, not directories) >>>> >>>> - Is a "source" processor node only, cannot be used within >>> other Nifi flow logic that dynamically determines which files or >>> directories to get and send as FlowFiles >>>> >>>> - Assumes each file is smaller than the content repository, >>> which causes large files (hundreds of MB's, GBs, TBs) to overrun or >>> dominate the content repository >>>> >>>> A modified version of GetFile (currently) named GetFileData has been >>> developed and is proposed as the basis for a new Nifi processor that >>> will supplement file ingestion with these features: >>>> >>>> - Operates based upon inbound FlowFiles that contains the >>> filesystem path to a file or directory >>>> >>>> - Scheduled by incoming FlowFiles containing a file or >>> directory path, only runs when there's something to do >>>> >>>> - Supports sending directory tree as a series of directory and >>> file paths; e.g., ExecuteProcess("find /mypath -print") => >>> SplitText(newline) => ModifyAttribute(add "file.roodir=/mypath") => >>> GetFIleData ... >>>> >>>> - Participates within simple or complex flows to fetch and send >>> files and directories >>>> >>>> - (To be developed) Is designed to handle any size file, by >>> breaking files larger than a "chunkingThreshold" into a series of >>> multiple smaller files that can be reassembled on the other end (by >>> PutFileData) >>>> >>>> PutFileData >>>> --------------- >>>> The PutFile processor accepts incoming FlowFiles and writes those >>>> files >>> to a single target directory. It does a good job of handling and >>> resolving conflicts, but falls short of other requirements that arise >>> for general-purpose file processing: >>>> >>>> - Does not support directories, only files >>>> >>>> - Only supports a single, preconfigured target directory >>>> >>>> - Cannot reconstruct and entire directory tree based upon >>> relative file paths (all files go into a single target directory) >>>> >>>> - Assumes each file is small enough to fit into the content >>> repository >>>> >>>> A modified version of PutFile (currently) named PutFileData has been >>> developed and is proposed as the basis for a new Nifi processor that >>> will supplement file egress with these features: >>>> >>>> - Supports directories and files >>>> >>>> - Supports reconstruction of entire directory tree based upon >>> relative file paths, enabling reconstruction of an entire directory >>> free originating from GetFileData >>>> >>>> - (To be developed) Is designed to handle any size file, by >>> reassembling multi-part files into very large files (TB's) that do not >>> fit within the content repository >>>> >>>> Should the community have an interest in these processors (we can >>>> name >>> them something different, if needed), these contributions are now >>> available. In the meantime, we shall continue developing these >>> processor to meet our specific use cases, adding the chunking >>> functionality and QA certifying them for production use at scale. >>>> >>>> Looking forward to comments, feedback and recommendations. >>>> >>>> Here's the Github repo link again: >>>> https://github.com/rickbraddy/nifishare >>>> >>>> Best, >>>> Rick >>>> >>>> P.S. If there's a better vehicle for communicating these types of >>> proposals, please advise. >>>
