Rick, I am finally taking a moment to clear out some dangling threads. I just looked into this one and the link appears to be gone. Have you chosen to withdraw this proposal at this time?
Thanks Joe On Fri, Sep 25, 2015 at 4:25 AM, Rick Braddy <[email protected]> wrote: > Yes. Replication of directory tree via Nifi similar to rsync. > > -----Original Message----- > From: Joe Skora [mailto:[email protected]] > Sent: Thursday, September 24, 2015 10:16 PM > To: [email protected] > Subject: Re: Proposal: New file processors: GetFIleData and PutFileData > > It may be an oversimplification, but for the purposes of understanding, is > the intent to mirror directory tree with NiFi similar to rsync? > > On Wed, Sep 23, 2015 at 11:26 PM, Rick Braddy <[email protected]> wrote: > >> Joe, >> >> Thanks for the quick response. >> >> Yes, I can add to the Wiki once access has been granted. Further responses: >> >> >> GetFile and PutFile do support recursive walking/reconstruction >> >> based >> on relative paths >> >> Based on my recent testing of 0.3.0, GetFile does walk the configured >> directory tree, picking up the files it finds; however, only files are >> sent to PutFile, which places them all into a single target folder >> (not a directory tree - no directory information is sent by GetFile >> nor processed by PutFile from what I have seen, so I do not believe it >> reconstructs the directory tree at all today). >> >> >> I do think your proposal modified to consider the design pattern of >> ListFile/FetchFile would be super powerful. >> >> We have another processor GetFileList that uses "find" to traverse a >> target folder tree and feeds the resulting newline delimited >> file/directory stream as FlowFiles into GetFileData. Perhaps that >> processor could be evolved into a suitable ListFiles processor. >> >> I believe GetFileList/GetFileData correspond roughly to the >> ListFile/FetchFile concept, based on a cursory review of >> ListHDFS/FetchHDFS. If it's a matter of renaming that's obviously >> trivial at this point. I'm assuming there are other facets to that >> List/Fetch design pattern - is it documented anywhere I can review to learn >> more? >> >> So when we have a ListFile/FetchFile what is the corresponding "Put" >> side of the flow to be? Perhaps simply PutFile enhanced to handle >> FlowFiles from both basic GetFile and the richer FetchFile (modified >> GetFileData) types of FlowFiles and behaviors would suffice. >> >> >> Just need to make sure backpressure works through the flow so that >> >> you >> could literally handle the delivery of a file which is of itself >> larger than the repo by capturing and sending a chunk of it at a time for >> instance. >> >> Agreed. Are there any best practices documented for configuring >> backpressure properly? >> >> Thanks. >> >> Rick >> >> -----Original Message----- >> From: Joe Witt [mailto:[email protected]] >> Sent: Wednesday, September 23, 2015 6:25 PM >> To: [email protected] >> Subject: Re: Proposal: New file processors: GetFIleData and >> PutFileData >> >> Rick >> >> This is a perfectly fine place to start the thread. If you'd like to >> create a wiki feature proposal for it too like we're doing with a lot >> of the other things at this level we can give you access to create one >> here [1]. >> >> Not at all trying to take away from the points you were making but >> GetFile and PutFile do support recursive walking/reconstruction based >> on relative paths. By no means is that as comprehensive as you're >> going for here though - just an FYI. >> >> These sound like good things. In particular I find your concept for >> handling arbitrarily large data interesting. Just need to make sure >> backpressure works through the flow so that you could literally handle >> the delivery of a file which is of itself larger than the repo by >> capturing and sending a chunk of it at a time for instance. So from a >> brief historical perspective the GetFile / PutFile processors were >> literally the first two processors ever build for NiFi back when it >> had no GUI, no provenance, no nothin' that was cool. These are the >> OGs of NiFi. They been improved a bit over the years but not much. >> Why? Because their utility was largely limited to trivial archiving >> cases. We have recently had discussions about making them more >> powerful through the concept of ListFile/FetchFile like adam mentions >> and as we've started doing with things like HDFS. A much better model >> for sure. Still not as powerful as what you're cooking up though. I >> do think your proposal modified to consider the design pattern of >> ListFile/FetchFile would be super powerful. In your case ListFile for >> a single larger file for instance could produce N listings that point >> to the same file on disk but for different offset/ranges. This would >> be *very* interesting. I am a bit concerned about how to have this >> nicely handle competing consumer problems but...we can cross that bridge >> later. >> >> If you're willing to tackle this we can definitely work with you to >> bring it in. It is a non-trivial contribution for sure. Folks often >> do not consider all the nasty gotchas that can occur in something as >> seemingly simple as File IO. >> >> Thanks >> Joe >> >> [1] >> https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposal >> s >> >> On Wed, Sep 23, 2015 at 1:42 PM, Rick Braddy <[email protected]> wrote: >> > This thread proposes community review/comments of modified versions >> > of >> GetFile and PutFile for potential future adoption by the Nifi community. >> For those who want to jump straight to the code, here's the review >> repository location for the current version: >> https://github.com/rickbraddy/nifishare. >> > >> > As background, we needed a way to replicate entire directory trees >> > of >> files via Nifi, where multiple directory trees can be specified at >> run-time as part of an overall Nifi graph. As Nifi is rooted in >> file-based processing, it seems reasonable to continue advancing its >> abilities to ingest, process, transform and replicate files in the >> most flexible manner possible. While this proposal is not a be all >> end all in that regard, it moves the needle in the right direction by >> making file-processing in Nifi more dynamic, enabling flows to >> determine how files (and directories) should be processed, which does >> well beyond today's basic file ingress/egress process capabilities >> (which certainly have their place and uses). Whether it's via this >> proposal and code or another, clearly Nifi can benefit from this type of >> functionality. >> > >> > Here's a more detailed explanation of the rationale for developing >> > these >> Nifi file processor derivatives and their initial implementation: >> > >> > GetFileData >> > ---------------- >> > The GetFile processor monitors a single directory tree for file >> > changes >> and creates FlowFiles for every changed file in that configured tree. >> It does a good job of getting files from a configurable folder than >> need to be injected into a graph. GetFile falls short of other >> requirements that arise for general-purpose file processing: >> > >> > - Operates from a single, pre-configured source directory (not >> dynamically configurable at run-time as part of a flow) >> > >> > - Scheduled on a periodic basis only, not event-triggered when >> there's something to do >> > >> > - Does not support sending an entire directory tree (only files >> are sent, not directories) >> > >> > - Is a "source" processor node only, cannot be used within >> other Nifi flow logic that dynamically determines which files or >> directories to get and send as FlowFiles >> > >> > - Assumes each file is smaller than the content repository, >> which causes large files (hundreds of MB's, GBs, TBs) to overrun or >> dominate the content repository >> > >> > A modified version of GetFile (currently) named GetFileData has been >> developed and is proposed as the basis for a new Nifi processor that >> will supplement file ingestion with these features: >> > >> > - Operates based upon inbound FlowFiles that contains the >> filesystem path to a file or directory >> > >> > - Scheduled by incoming FlowFiles containing a file or >> directory path, only runs when there's something to do >> > >> > - Supports sending directory tree as a series of directory and >> file paths; e.g., ExecuteProcess("find /mypath -print") => >> SplitText(newline) => ModifyAttribute(add "file.roodir=/mypath") => >> GetFIleData ... >> > >> > - Participates within simple or complex flows to fetch and send >> files and directories >> > >> > - (To be developed) Is designed to handle any size file, by >> breaking files larger than a "chunkingThreshold" into a series of >> multiple smaller files that can be reassembled on the other end (by >> PutFileData) >> > >> > PutFileData >> > --------------- >> > The PutFile processor accepts incoming FlowFiles and writes those >> > files >> to a single target directory. It does a good job of handling and >> resolving conflicts, but falls short of other requirements that arise >> for general-purpose file processing: >> > >> > - Does not support directories, only files >> > >> > - Only supports a single, preconfigured target directory >> > >> > - Cannot reconstruct and entire directory tree based upon >> relative file paths (all files go into a single target directory) >> > >> > - Assumes each file is small enough to fit into the content >> repository >> > >> > A modified version of PutFile (currently) named PutFileData has been >> developed and is proposed as the basis for a new Nifi processor that >> will supplement file egress with these features: >> > >> > - Supports directories and files >> > >> > - Supports reconstruction of entire directory tree based upon >> relative file paths, enabling reconstruction of an entire directory >> free originating from GetFileData >> > >> > - (To be developed) Is designed to handle any size file, by >> reassembling multi-part files into very large files (TB's) that do not >> fit within the content repository >> > >> > Should the community have an interest in these processors (we can >> > name >> them something different, if needed), these contributions are now >> available. In the meantime, we shall continue developing these >> processor to meet our specific use cases, adding the chunking >> functionality and QA certifying them for production use at scale. >> > >> > Looking forward to comments, feedback and recommendations. >> > >> > Here's the Github repo link again: >> > https://github.com/rickbraddy/nifishare >> > >> > Best, >> > Rick >> > >> > P.S. If there's a better vehicle for communicating these types of >> proposals, please advise. >> > >> > >>
