Rick This is a perfectly fine place to start the thread. If you'd like to create a wiki feature proposal for it too like we're doing with a lot of the other things at this level we can give you access to create one here [1].
Not at all trying to take away from the points you were making but GetFile and PutFile do support recursive walking/reconstruction based on relative paths. By no means is that as comprehensive as you're going for here though - just an FYI. These sound like good things. In particular I find your concept for handling arbitrarily large data interesting. Just need to make sure backpressure works through the flow so that you could literally handle the delivery of a file which is of itself larger than the repo by capturing and sending a chunk of it at a time for instance. So from a brief historical perspective the GetFile / PutFile processors were literally the first two processors ever build for NiFi back when it had no GUI, no provenance, no nothin' that was cool. These are the OGs of NiFi. They been improved a bit over the years but not much. Why? Because their utility was largely limited to trivial archiving cases. We have recently had discussions about making them more powerful through the concept of ListFile/FetchFile like adam mentions and as we've started doing with things like HDFS. A much better model for sure. Still not as powerful as what you're cooking up though. I do think your proposal modified to consider the design pattern of ListFile/FetchFile would be super powerful. In your case ListFile for a single larger file for instance could produce N listings that point to the same file on disk but for different offset/ranges. This would be *very* interesting. I am a bit concerned about how to have this nicely handle competing consumer problems but...we can cross that bridge later. If you're willing to tackle this we can definitely work with you to bring it in. It is a non-trivial contribution for sure. Folks often do not consider all the nasty gotchas that can occur in something as seemingly simple as File IO. Thanks Joe [1] https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposals On Wed, Sep 23, 2015 at 1:42 PM, Rick Braddy <[email protected]> wrote: > This thread proposes community review/comments of modified versions of > GetFile and PutFile for potential future adoption by the Nifi community. For > those who want to jump straight to the code, here's the review repository > location for the current version: https://github.com/rickbraddy/nifishare. > > As background, we needed a way to replicate entire directory trees of files > via Nifi, where multiple directory trees can be specified at run-time as part > of an overall Nifi graph. As Nifi is rooted in file-based processing, it > seems reasonable to continue advancing its abilities to ingest, process, > transform and replicate files in the most flexible manner possible. While > this proposal is not a be all end all in that regard, it moves the needle in > the right direction by making file-processing in Nifi more dynamic, enabling > flows to determine how files (and directories) should be processed, which > does well beyond today's basic file ingress/egress process capabilities > (which certainly have their place and uses). Whether it's via this proposal > and code or another, clearly Nifi can benefit from this type of functionality. > > Here's a more detailed explanation of the rationale for developing these Nifi > file processor derivatives and their initial implementation: > > GetFileData > ---------------- > The GetFile processor monitors a single directory tree for file changes and > creates FlowFiles for every changed file in that configured tree. It does a > good job of getting files from a configurable folder than need to be injected > into a graph. GetFile falls short of other requirements that arise for > general-purpose file processing: > > - Operates from a single, pre-configured source directory (not > dynamically configurable at run-time as part of a flow) > > - Scheduled on a periodic basis only, not event-triggered when > there's something to do > > - Does not support sending an entire directory tree (only files are > sent, not directories) > > - Is a "source" processor node only, cannot be used within other > Nifi flow logic that dynamically determines which files or directories to get > and send as FlowFiles > > - Assumes each file is smaller than the content repository, which > causes large files (hundreds of MB's, GBs, TBs) to overrun or dominate the > content repository > > A modified version of GetFile (currently) named GetFileData has been > developed and is proposed as the basis for a new Nifi processor that will > supplement file ingestion with these features: > > - Operates based upon inbound FlowFiles that contains the filesystem > path to a file or directory > > - Scheduled by incoming FlowFiles containing a file or directory > path, only runs when there's something to do > > - Supports sending directory tree as a series of directory and file > paths; e.g., ExecuteProcess("find /mypath -print") => SplitText(newline) => > ModifyAttribute(add "file.roodir=/mypath") => GetFIleData ... > > - Participates within simple or complex flows to fetch and send > files and directories > > - (To be developed) Is designed to handle any size file, by breaking > files larger than a "chunkingThreshold" into a series of multiple smaller > files that can be reassembled on the other end (by PutFileData) > > PutFileData > --------------- > The PutFile processor accepts incoming FlowFiles and writes those files to a > single target directory. It does a good job of handling and resolving > conflicts, but falls short of other requirements that arise for > general-purpose file processing: > > - Does not support directories, only files > > - Only supports a single, preconfigured target directory > > - Cannot reconstruct and entire directory tree based upon relative > file paths (all files go into a single target directory) > > - Assumes each file is small enough to fit into the content > repository > > A modified version of PutFile (currently) named PutFileData has been > developed and is proposed as the basis for a new Nifi processor that will > supplement file egress with these features: > > - Supports directories and files > > - Supports reconstruction of entire directory tree based upon > relative file paths, enabling reconstruction of an entire directory free > originating from GetFileData > > - (To be developed) Is designed to handle any size file, by > reassembling multi-part files into very large files (TB's) that do not fit > within the content repository > > Should the community have an interest in these processors (we can name them > something different, if needed), these contributions are now available. In > the meantime, we shall continue developing these processor to meet our > specific use cases, adding the chunking functionality and QA certifying them > for production use at scale. > > Looking forward to comments, feedback and recommendations. > > Here's the Github repo link again: https://github.com/rickbraddy/nifishare > > Best, > Rick > > P.S. If there's a better vehicle for communicating these types of proposals, > please advise. > >
