It may be an oversimplification, but for the purposes of understanding, is
the intent to mirror directory tree with NiFi similar to rsync?

On Wed, Sep 23, 2015 at 11:26 PM, Rick Braddy <[email protected]> wrote:

> Joe,
>
> Thanks for the quick response.
>
> Yes, I can add to the Wiki once access has been granted. Further responses:
>
> >> GetFile and PutFile do support recursive walking/reconstruction based
> on relative paths
>
> Based on my recent testing of 0.3.0, GetFile does walk the configured
> directory tree, picking up the files it finds; however, only files are sent
> to PutFile, which places them all into a single target folder (not a
> directory tree - no directory information is sent by GetFile nor processed
> by PutFile from what I have seen, so I do not believe it reconstructs the
> directory tree at all today).
>
> >> I do think your proposal modified to consider the design pattern of
> ListFile/FetchFile would be super powerful.
>
> We have another processor GetFileList that uses "find" to traverse a
> target folder tree and feeds the resulting newline delimited file/directory
> stream as FlowFiles into GetFileData.  Perhaps that processor could be
> evolved into a suitable ListFiles processor.
>
> I believe GetFileList/GetFileData correspond roughly to the
> ListFile/FetchFile concept, based on a cursory review of
> ListHDFS/FetchHDFS.  If it's a matter of renaming that's obviously trivial
> at this point.  I'm assuming there are other facets to that List/Fetch
> design pattern - is it documented anywhere I can review to learn more?
>
> So when we have a ListFile/FetchFile what is the corresponding "Put" side
> of the flow to be?  Perhaps simply PutFile enhanced to handle FlowFiles
> from both basic GetFile and the richer FetchFile (modified GetFileData)
> types of FlowFiles and behaviors would suffice.
>
> >> Just need to make sure backpressure works through the flow so that you
> could literally handle the delivery of a file which is of itself larger
> than the repo by capturing and sending a chunk of it at a time for instance.
>
> Agreed. Are there any best practices documented for configuring
> backpressure properly?
>
> Thanks.
>
> Rick
>
> -----Original Message-----
> From: Joe Witt [mailto:[email protected]]
> Sent: Wednesday, September 23, 2015 6:25 PM
> To: [email protected]
> Subject: Re: Proposal: New file processors: GetFIleData and PutFileData
>
> Rick
>
> This is a perfectly fine place to start the thread.  If you'd like to
> create a wiki feature proposal for it too like we're doing with a lot of
> the other things at this level we can give you access to create one here
> [1].
>
> Not at all trying to take away from the points you were making but GetFile
> and PutFile do support recursive walking/reconstruction based on relative
> paths.  By no means is that as comprehensive as you're going for here
> though - just an FYI.
>
> These sound like good things.  In particular I find your concept for
> handling arbitrarily large data interesting.  Just need to make sure
> backpressure works through the flow so that you could literally handle the
> delivery of a file which is of itself larger than the repo by capturing and
> sending a chunk of it at a time for instance.  So from a brief historical
> perspective the GetFile / PutFile processors were literally the first two
> processors ever build for NiFi back when it had no GUI, no provenance, no
> nothin' that was cool.  These are the OGs of NiFi.  They been improved a
> bit over the years but not much.
> Why?  Because their utility was largely limited to trivial archiving
> cases.  We have recently had discussions about making them more powerful
> through the concept of ListFile/FetchFile like adam mentions and as we've
> started doing with things like HDFS.  A much better model for sure.  Still
> not as powerful as what you're cooking up though.  I do think your proposal
> modified to consider the design pattern of ListFile/FetchFile would be
> super powerful.  In your case ListFile for a single larger file for
> instance could produce N listings that point to the same file on disk but
> for different offset/ranges.  This would be *very* interesting.  I am a bit
> concerned about how to have this nicely handle competing consumer problems
> but...we can cross that bridge later.
>
> If you're willing to tackle this we can definitely work with you to bring
> it in.  It is a non-trivial contribution for sure.  Folks often do not
> consider all the nasty gotchas that can occur in something as seemingly
> simple as File IO.
>
> Thanks
> Joe
>
> [1]
> https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposals
>
> On Wed, Sep 23, 2015 at 1:42 PM, Rick Braddy <[email protected]> wrote:
> > This thread proposes community review/comments of modified versions of
> GetFile and PutFile for potential future adoption by the Nifi community.
> For those who want to jump straight to the code, here's the review
> repository location for the current version:
> https://github.com/rickbraddy/nifishare.
> >
> > As background, we needed a way to replicate entire directory trees of
> files via Nifi, where multiple directory trees can be specified at run-time
> as part of an overall Nifi graph. As Nifi is rooted in file-based
> processing, it seems reasonable to continue advancing its abilities to
> ingest, process, transform and replicate files in the most flexible manner
> possible.  While this proposal is not a be all end all in that regard, it
> moves the needle in the right direction by making file-processing in Nifi
> more dynamic, enabling flows to determine how files (and directories)
> should be processed, which does well beyond today's basic file
> ingress/egress process capabilities (which certainly have their place and
> uses).  Whether it's via this proposal and code or another, clearly Nifi
> can benefit from this type of functionality.
> >
> > Here's a more detailed explanation of the rationale for developing these
> Nifi file processor derivatives and their initial implementation:
> >
> > GetFileData
> > ----------------
> > The GetFile processor monitors a single directory tree for file changes
> and creates FlowFiles for every changed file in that configured tree. It
> does a good job of getting files from a configurable folder than need to be
> injected into a graph. GetFile falls short of other requirements that arise
> for general-purpose file processing:
> >
> > -          Operates from a single, pre-configured source directory (not
> dynamically configurable at run-time as part of a flow)
> >
> > -          Scheduled on a periodic basis only, not event-triggered when
> there's something to do
> >
> > -          Does not support sending an entire directory tree (only files
> are sent, not directories)
> >
> > -          Is a "source" processor node only, cannot be used within
> other Nifi flow logic that dynamically determines which files or
> directories to get and send as FlowFiles
> >
> > -          Assumes each file is smaller than the content repository,
> which causes large files (hundreds of MB's, GBs, TBs) to overrun or
> dominate the content repository
> >
> > A modified version of GetFile (currently) named GetFileData has been
> developed and is proposed as the basis for a new Nifi processor that will
> supplement file ingestion with these features:
> >
> > -          Operates based upon inbound FlowFiles that contains the
> filesystem path to a file or directory
> >
> > -          Scheduled by incoming FlowFiles containing a file or
> directory path, only runs when there's something to do
> >
> > -          Supports sending directory tree as a series of directory and
> file paths; e.g., ExecuteProcess("find /mypath -print") =>
> SplitText(newline) => ModifyAttribute(add "file.roodir=/mypath") =>
> GetFIleData ...
> >
> > -          Participates within simple or complex flows to fetch and send
> files and directories
> >
> > -          (To be developed) Is designed to handle any size file, by
> breaking files larger than a "chunkingThreshold" into a series of multiple
> smaller files that can be reassembled on the other end (by PutFileData)
> >
> > PutFileData
> > ---------------
> > The PutFile processor accepts incoming FlowFiles and writes those files
> to a single target directory.  It does a good job of handling and resolving
> conflicts, but falls short of other requirements that arise for
> general-purpose file processing:
> >
> > -          Does not support directories, only files
> >
> > -          Only supports a single, preconfigured target directory
> >
> > -          Cannot reconstruct and entire directory tree based upon
> relative file paths (all files go into a single target directory)
> >
> > -          Assumes each file is small enough to fit into the content
> repository
> >
> > A modified version of PutFile (currently) named PutFileData has been
> developed and is proposed as the basis for a new Nifi processor that will
> supplement file egress with these features:
> >
> > -          Supports directories and files
> >
> > -          Supports reconstruction of entire directory tree based upon
> relative file paths, enabling reconstruction of an entire directory free
> originating from GetFileData
> >
> > -          (To be developed) Is designed to handle any size file, by
> reassembling multi-part files into very large files (TB's) that do not fit
> within the content repository
> >
> > Should the community have an interest in these processors (we can name
> them something different, if needed), these contributions are now
> available.  In the meantime, we shall continue developing these processor
> to meet our specific use cases, adding the chunking functionality and QA
> certifying them for production use at scale.
> >
> > Looking forward to comments, feedback and recommendations.
> >
> > Here's the Github repo link again:
> > https://github.com/rickbraddy/nifishare
> >
> > Best,
> > Rick
> >
> > P.S. If there's a better vehicle for communicating these types of
> proposals, please advise.
> >
> >
>

Reply via email to