Sounds good. For future purposes documentation on the processes which work best (so far) can be found here: https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide
By creating a JIRA and attaching a patch in patch reviewed state that helps for sure. As does the Github PR process. Both create a sort of 'pull' that allows the community to then work on items as available. There are some List/Fetch items being worked so perhaps some of your ideas will be addressed there. Thanks Joe On Tue, Nov 3, 2015 at 10:49 AM, Rick Braddy <[email protected]> wrote: > There was no interest shown by the community so we moved on. > >> On Nov 3, 2015, at 3:53 AM, Joe Witt <[email protected]> wrote: >> >> Rick, >> >> I am finally taking a moment to clear out some dangling threads. I >> just looked into this one and the link appears to be gone. Have you >> chosen to withdraw this proposal at this time? >> >> Thanks >> Joe >> >>> On Fri, Sep 25, 2015 at 4:25 AM, Rick Braddy <[email protected]> wrote: >>> Yes. Replication of directory tree via Nifi similar to rsync. >>> >>> -----Original Message----- >>> From: Joe Skora [mailto:[email protected]] >>> Sent: Thursday, September 24, 2015 10:16 PM >>> To: [email protected] >>> Subject: Re: Proposal: New file processors: GetFIleData and PutFileData >>> >>> It may be an oversimplification, but for the purposes of understanding, is >>> the intent to mirror directory tree with NiFi similar to rsync? >>> >>>> On Wed, Sep 23, 2015 at 11:26 PM, Rick Braddy <[email protected]> wrote: >>>> >>>> Joe, >>>> >>>> Thanks for the quick response. >>>> >>>> Yes, I can add to the Wiki once access has been granted. Further responses: >>>> >>>>>> GetFile and PutFile do support recursive walking/reconstruction >>>>>> based >>>> on relative paths >>>> >>>> Based on my recent testing of 0.3.0, GetFile does walk the configured >>>> directory tree, picking up the files it finds; however, only files are >>>> sent to PutFile, which places them all into a single target folder >>>> (not a directory tree - no directory information is sent by GetFile >>>> nor processed by PutFile from what I have seen, so I do not believe it >>>> reconstructs the directory tree at all today). >>>> >>>>>> I do think your proposal modified to consider the design pattern of >>>> ListFile/FetchFile would be super powerful. >>>> >>>> We have another processor GetFileList that uses "find" to traverse a >>>> target folder tree and feeds the resulting newline delimited >>>> file/directory stream as FlowFiles into GetFileData. Perhaps that >>>> processor could be evolved into a suitable ListFiles processor. >>>> >>>> I believe GetFileList/GetFileData correspond roughly to the >>>> ListFile/FetchFile concept, based on a cursory review of >>>> ListHDFS/FetchHDFS. If it's a matter of renaming that's obviously >>>> trivial at this point. I'm assuming there are other facets to that >>>> List/Fetch design pattern - is it documented anywhere I can review to >>>> learn more? >>>> >>>> So when we have a ListFile/FetchFile what is the corresponding "Put" >>>> side of the flow to be? Perhaps simply PutFile enhanced to handle >>>> FlowFiles from both basic GetFile and the richer FetchFile (modified >>>> GetFileData) types of FlowFiles and behaviors would suffice. >>>> >>>>>> Just need to make sure backpressure works through the flow so that >>>>>> you >>>> could literally handle the delivery of a file which is of itself >>>> larger than the repo by capturing and sending a chunk of it at a time for >>>> instance. >>>> >>>> Agreed. Are there any best practices documented for configuring >>>> backpressure properly? >>>> >>>> Thanks. >>>> >>>> Rick >>>> >>>> -----Original Message----- >>>> From: Joe Witt [mailto:[email protected]] >>>> Sent: Wednesday, September 23, 2015 6:25 PM >>>> To: [email protected] >>>> Subject: Re: Proposal: New file processors: GetFIleData and >>>> PutFileData >>>> >>>> Rick >>>> >>>> This is a perfectly fine place to start the thread. If you'd like to >>>> create a wiki feature proposal for it too like we're doing with a lot >>>> of the other things at this level we can give you access to create one >>>> here [1]. >>>> >>>> Not at all trying to take away from the points you were making but >>>> GetFile and PutFile do support recursive walking/reconstruction based >>>> on relative paths. By no means is that as comprehensive as you're >>>> going for here though - just an FYI. >>>> >>>> These sound like good things. In particular I find your concept for >>>> handling arbitrarily large data interesting. Just need to make sure >>>> backpressure works through the flow so that you could literally handle >>>> the delivery of a file which is of itself larger than the repo by >>>> capturing and sending a chunk of it at a time for instance. So from a >>>> brief historical perspective the GetFile / PutFile processors were >>>> literally the first two processors ever build for NiFi back when it >>>> had no GUI, no provenance, no nothin' that was cool. These are the >>>> OGs of NiFi. They been improved a bit over the years but not much. >>>> Why? Because their utility was largely limited to trivial archiving >>>> cases. We have recently had discussions about making them more >>>> powerful through the concept of ListFile/FetchFile like adam mentions >>>> and as we've started doing with things like HDFS. A much better model >>>> for sure. Still not as powerful as what you're cooking up though. I >>>> do think your proposal modified to consider the design pattern of >>>> ListFile/FetchFile would be super powerful. In your case ListFile for >>>> a single larger file for instance could produce N listings that point >>>> to the same file on disk but for different offset/ranges. This would >>>> be *very* interesting. I am a bit concerned about how to have this >>>> nicely handle competing consumer problems but...we can cross that bridge >>>> later. >>>> >>>> If you're willing to tackle this we can definitely work with you to >>>> bring it in. It is a non-trivial contribution for sure. Folks often >>>> do not consider all the nasty gotchas that can occur in something as >>>> seemingly simple as File IO. >>>> >>>> Thanks >>>> Joe >>>> >>>> [1] >>>> https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposal >>>> s >>>> >>>>> On Wed, Sep 23, 2015 at 1:42 PM, Rick Braddy <[email protected]> wrote: >>>>> This thread proposes community review/comments of modified versions >>>>> of >>>> GetFile and PutFile for potential future adoption by the Nifi community. >>>> For those who want to jump straight to the code, here's the review >>>> repository location for the current version: >>>> https://github.com/rickbraddy/nifishare. >>>>> >>>>> As background, we needed a way to replicate entire directory trees >>>>> of >>>> files via Nifi, where multiple directory trees can be specified at >>>> run-time as part of an overall Nifi graph. As Nifi is rooted in >>>> file-based processing, it seems reasonable to continue advancing its >>>> abilities to ingest, process, transform and replicate files in the >>>> most flexible manner possible. While this proposal is not a be all >>>> end all in that regard, it moves the needle in the right direction by >>>> making file-processing in Nifi more dynamic, enabling flows to >>>> determine how files (and directories) should be processed, which does >>>> well beyond today's basic file ingress/egress process capabilities >>>> (which certainly have their place and uses). Whether it's via this >>>> proposal and code or another, clearly Nifi can benefit from this type of >>>> functionality. >>>>> >>>>> Here's a more detailed explanation of the rationale for developing >>>>> these >>>> Nifi file processor derivatives and their initial implementation: >>>>> >>>>> GetFileData >>>>> ---------------- >>>>> The GetFile processor monitors a single directory tree for file >>>>> changes >>>> and creates FlowFiles for every changed file in that configured tree. >>>> It does a good job of getting files from a configurable folder than >>>> need to be injected into a graph. GetFile falls short of other >>>> requirements that arise for general-purpose file processing: >>>>> >>>>> - Operates from a single, pre-configured source directory (not >>>> dynamically configurable at run-time as part of a flow) >>>>> >>>>> - Scheduled on a periodic basis only, not event-triggered when >>>> there's something to do >>>>> >>>>> - Does not support sending an entire directory tree (only files >>>> are sent, not directories) >>>>> >>>>> - Is a "source" processor node only, cannot be used within >>>> other Nifi flow logic that dynamically determines which files or >>>> directories to get and send as FlowFiles >>>>> >>>>> - Assumes each file is smaller than the content repository, >>>> which causes large files (hundreds of MB's, GBs, TBs) to overrun or >>>> dominate the content repository >>>>> >>>>> A modified version of GetFile (currently) named GetFileData has been >>>> developed and is proposed as the basis for a new Nifi processor that >>>> will supplement file ingestion with these features: >>>>> >>>>> - Operates based upon inbound FlowFiles that contains the >>>> filesystem path to a file or directory >>>>> >>>>> - Scheduled by incoming FlowFiles containing a file or >>>> directory path, only runs when there's something to do >>>>> >>>>> - Supports sending directory tree as a series of directory and >>>> file paths; e.g., ExecuteProcess("find /mypath -print") => >>>> SplitText(newline) => ModifyAttribute(add "file.roodir=/mypath") => >>>> GetFIleData ... >>>>> >>>>> - Participates within simple or complex flows to fetch and send >>>> files and directories >>>>> >>>>> - (To be developed) Is designed to handle any size file, by >>>> breaking files larger than a "chunkingThreshold" into a series of >>>> multiple smaller files that can be reassembled on the other end (by >>>> PutFileData) >>>>> >>>>> PutFileData >>>>> --------------- >>>>> The PutFile processor accepts incoming FlowFiles and writes those >>>>> files >>>> to a single target directory. It does a good job of handling and >>>> resolving conflicts, but falls short of other requirements that arise >>>> for general-purpose file processing: >>>>> >>>>> - Does not support directories, only files >>>>> >>>>> - Only supports a single, preconfigured target directory >>>>> >>>>> - Cannot reconstruct and entire directory tree based upon >>>> relative file paths (all files go into a single target directory) >>>>> >>>>> - Assumes each file is small enough to fit into the content >>>> repository >>>>> >>>>> A modified version of PutFile (currently) named PutFileData has been >>>> developed and is proposed as the basis for a new Nifi processor that >>>> will supplement file egress with these features: >>>>> >>>>> - Supports directories and files >>>>> >>>>> - Supports reconstruction of entire directory tree based upon >>>> relative file paths, enabling reconstruction of an entire directory >>>> free originating from GetFileData >>>>> >>>>> - (To be developed) Is designed to handle any size file, by >>>> reassembling multi-part files into very large files (TB's) that do not >>>> fit within the content repository >>>>> >>>>> Should the community have an interest in these processors (we can >>>>> name >>>> them something different, if needed), these contributions are now >>>> available. In the meantime, we shall continue developing these >>>> processor to meet our specific use cases, adding the chunking >>>> functionality and QA certifying them for production use at scale. >>>>> >>>>> Looking forward to comments, feedback and recommendations. >>>>> >>>>> Here's the Github repo link again: >>>>> https://github.com/rickbraddy/nifishare >>>>> >>>>> Best, >>>>> Rick >>>>> >>>>> P.S. If there's a better vehicle for communicating these types of >>>> proposals, please advise. >>>>
