It may be an oversimplification, but for the purposes of understanding, is the intent to mirror directory tree with NiFi similar to rsync?
On Wed, Sep 23, 2015 at 11:26 PM, Rick Braddy <[email protected]> wrote: > Joe, > > Thanks for the quick response. > > Yes, I can add to the Wiki once access has been granted. Further responses: > > >> GetFile and PutFile do support recursive walking/reconstruction based > on relative paths > > Based on my recent testing of 0.3.0, GetFile does walk the configured > directory tree, picking up the files it finds; however, only files are sent > to PutFile, which places them all into a single target folder (not a > directory tree - no directory information is sent by GetFile nor processed > by PutFile from what I have seen, so I do not believe it reconstructs the > directory tree at all today). > > >> I do think your proposal modified to consider the design pattern of > ListFile/FetchFile would be super powerful. > > We have another processor GetFileList that uses "find" to traverse a > target folder tree and feeds the resulting newline delimited file/directory > stream as FlowFiles into GetFileData. Perhaps that processor could be > evolved into a suitable ListFiles processor. > > I believe GetFileList/GetFileData correspond roughly to the > ListFile/FetchFile concept, based on a cursory review of > ListHDFS/FetchHDFS. If it's a matter of renaming that's obviously trivial > at this point. I'm assuming there are other facets to that List/Fetch > design pattern - is it documented anywhere I can review to learn more? > > So when we have a ListFile/FetchFile what is the corresponding "Put" side > of the flow to be? Perhaps simply PutFile enhanced to handle FlowFiles > from both basic GetFile and the richer FetchFile (modified GetFileData) > types of FlowFiles and behaviors would suffice. > > >> Just need to make sure backpressure works through the flow so that you > could literally handle the delivery of a file which is of itself larger > than the repo by capturing and sending a chunk of it at a time for instance. > > Agreed. Are there any best practices documented for configuring > backpressure properly? > > Thanks. > > Rick > > -----Original Message----- > From: Joe Witt [mailto:[email protected]] > Sent: Wednesday, September 23, 2015 6:25 PM > To: [email protected] > Subject: Re: Proposal: New file processors: GetFIleData and PutFileData > > Rick > > This is a perfectly fine place to start the thread. If you'd like to > create a wiki feature proposal for it too like we're doing with a lot of > the other things at this level we can give you access to create one here > [1]. > > Not at all trying to take away from the points you were making but GetFile > and PutFile do support recursive walking/reconstruction based on relative > paths. By no means is that as comprehensive as you're going for here > though - just an FYI. > > These sound like good things. In particular I find your concept for > handling arbitrarily large data interesting. Just need to make sure > backpressure works through the flow so that you could literally handle the > delivery of a file which is of itself larger than the repo by capturing and > sending a chunk of it at a time for instance. So from a brief historical > perspective the GetFile / PutFile processors were literally the first two > processors ever build for NiFi back when it had no GUI, no provenance, no > nothin' that was cool. These are the OGs of NiFi. They been improved a > bit over the years but not much. > Why? Because their utility was largely limited to trivial archiving > cases. We have recently had discussions about making them more powerful > through the concept of ListFile/FetchFile like adam mentions and as we've > started doing with things like HDFS. A much better model for sure. Still > not as powerful as what you're cooking up though. I do think your proposal > modified to consider the design pattern of ListFile/FetchFile would be > super powerful. In your case ListFile for a single larger file for > instance could produce N listings that point to the same file on disk but > for different offset/ranges. This would be *very* interesting. I am a bit > concerned about how to have this nicely handle competing consumer problems > but...we can cross that bridge later. > > If you're willing to tackle this we can definitely work with you to bring > it in. It is a non-trivial contribution for sure. Folks often do not > consider all the nasty gotchas that can occur in something as seemingly > simple as File IO. > > Thanks > Joe > > [1] > https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposals > > On Wed, Sep 23, 2015 at 1:42 PM, Rick Braddy <[email protected]> wrote: > > This thread proposes community review/comments of modified versions of > GetFile and PutFile for potential future adoption by the Nifi community. > For those who want to jump straight to the code, here's the review > repository location for the current version: > https://github.com/rickbraddy/nifishare. > > > > As background, we needed a way to replicate entire directory trees of > files via Nifi, where multiple directory trees can be specified at run-time > as part of an overall Nifi graph. As Nifi is rooted in file-based > processing, it seems reasonable to continue advancing its abilities to > ingest, process, transform and replicate files in the most flexible manner > possible. While this proposal is not a be all end all in that regard, it > moves the needle in the right direction by making file-processing in Nifi > more dynamic, enabling flows to determine how files (and directories) > should be processed, which does well beyond today's basic file > ingress/egress process capabilities (which certainly have their place and > uses). Whether it's via this proposal and code or another, clearly Nifi > can benefit from this type of functionality. > > > > Here's a more detailed explanation of the rationale for developing these > Nifi file processor derivatives and their initial implementation: > > > > GetFileData > > ---------------- > > The GetFile processor monitors a single directory tree for file changes > and creates FlowFiles for every changed file in that configured tree. It > does a good job of getting files from a configurable folder than need to be > injected into a graph. GetFile falls short of other requirements that arise > for general-purpose file processing: > > > > - Operates from a single, pre-configured source directory (not > dynamically configurable at run-time as part of a flow) > > > > - Scheduled on a periodic basis only, not event-triggered when > there's something to do > > > > - Does not support sending an entire directory tree (only files > are sent, not directories) > > > > - Is a "source" processor node only, cannot be used within > other Nifi flow logic that dynamically determines which files or > directories to get and send as FlowFiles > > > > - Assumes each file is smaller than the content repository, > which causes large files (hundreds of MB's, GBs, TBs) to overrun or > dominate the content repository > > > > A modified version of GetFile (currently) named GetFileData has been > developed and is proposed as the basis for a new Nifi processor that will > supplement file ingestion with these features: > > > > - Operates based upon inbound FlowFiles that contains the > filesystem path to a file or directory > > > > - Scheduled by incoming FlowFiles containing a file or > directory path, only runs when there's something to do > > > > - Supports sending directory tree as a series of directory and > file paths; e.g., ExecuteProcess("find /mypath -print") => > SplitText(newline) => ModifyAttribute(add "file.roodir=/mypath") => > GetFIleData ... > > > > - Participates within simple or complex flows to fetch and send > files and directories > > > > - (To be developed) Is designed to handle any size file, by > breaking files larger than a "chunkingThreshold" into a series of multiple > smaller files that can be reassembled on the other end (by PutFileData) > > > > PutFileData > > --------------- > > The PutFile processor accepts incoming FlowFiles and writes those files > to a single target directory. It does a good job of handling and resolving > conflicts, but falls short of other requirements that arise for > general-purpose file processing: > > > > - Does not support directories, only files > > > > - Only supports a single, preconfigured target directory > > > > - Cannot reconstruct and entire directory tree based upon > relative file paths (all files go into a single target directory) > > > > - Assumes each file is small enough to fit into the content > repository > > > > A modified version of PutFile (currently) named PutFileData has been > developed and is proposed as the basis for a new Nifi processor that will > supplement file egress with these features: > > > > - Supports directories and files > > > > - Supports reconstruction of entire directory tree based upon > relative file paths, enabling reconstruction of an entire directory free > originating from GetFileData > > > > - (To be developed) Is designed to handle any size file, by > reassembling multi-part files into very large files (TB's) that do not fit > within the content repository > > > > Should the community have an interest in these processors (we can name > them something different, if needed), these contributions are now > available. In the meantime, we shall continue developing these processor > to meet our specific use cases, adding the chunking functionality and QA > certifying them for production use at scale. > > > > Looking forward to comments, feedback and recommendations. > > > > Here's the Github repo link again: > > https://github.com/rickbraddy/nifishare > > > > Best, > > Rick > > > > P.S. If there's a better vehicle for communicating these types of > proposals, please advise. > > > > >
