This thread proposes community review/comments of modified versions of GetFile 
and PutFile for potential future adoption by the Nifi community.  For those who 
want to jump straight to the code, here's the review repository location for 
the current version:  https://github.com/rickbraddy/nifishare.

As background, we needed a way to replicate entire directory trees of files via 
Nifi, where multiple directory trees can be specified at run-time as part of an 
overall Nifi graph. As Nifi is rooted in file-based processing, it seems 
reasonable to continue advancing its abilities to ingest, process, transform 
and replicate files in the most flexible manner possible.  While this proposal 
is not a be all end all in that regard, it moves the needle in the right 
direction by making file-processing in Nifi more dynamic, enabling flows to 
determine how files (and directories) should be processed, which does well 
beyond today's basic file ingress/egress process capabilities (which certainly 
have their place and uses).  Whether it's via this proposal and code or 
another, clearly Nifi can benefit from this type of functionality.

Here's a more detailed explanation of the rationale for developing these Nifi 
file processor derivatives and their initial implementation:

GetFileData
----------------
The GetFile processor monitors a single directory tree for file changes and 
creates FlowFiles for every changed file in that configured tree. It does a 
good job of getting files from a configurable folder than need to be injected 
into a graph. GetFile falls short of other requirements that arise for 
general-purpose file processing:

-          Operates from a single, pre-configured source directory (not 
dynamically configurable at run-time as part of a flow)

-          Scheduled on a periodic basis only, not event-triggered when there's 
something to do

-          Does not support sending an entire directory tree (only files are 
sent, not directories)

-          Is a "source" processor node only, cannot be used within other Nifi 
flow logic that dynamically determines which files or directories to get and 
send as FlowFiles

-          Assumes each file is smaller than the content repository, which 
causes large files (hundreds of MB's, GBs, TBs) to overrun or dominate the 
content repository

A modified version of GetFile (currently) named GetFileData has been developed 
and is proposed as the basis for a new Nifi processor that will supplement file 
ingestion with these features:

-          Operates based upon inbound FlowFiles that contains the filesystem 
path to a file or directory

-          Scheduled by incoming FlowFiles containing a file or directory path, 
only runs when there's something to do

-          Supports sending directory tree as a series of directory and file 
paths; e.g., ExecuteProcess("find /mypath -print") => SplitText(newline) => 
ModifyAttribute(add "file.roodir=/mypath") => GetFIleData ...

-          Participates within simple or complex flows to fetch and send files 
and directories

-          (To be developed) Is designed to handle any size file, by breaking 
files larger than a "chunkingThreshold" into a series of multiple smaller files 
that can be reassembled on the other end (by PutFileData)

PutFileData
---------------
The PutFile processor accepts incoming FlowFiles and writes those files to a 
single target directory.  It does a good job of handling and resolving 
conflicts, but falls short of other requirements that arise for general-purpose 
file processing:

-          Does not support directories, only files

-          Only supports a single, preconfigured target directory

-          Cannot reconstruct and entire directory tree based upon relative 
file paths (all files go into a single target directory)

-          Assumes each file is small enough to fit into the content repository

A modified version of PutFile (currently) named PutFileData has been developed 
and is proposed as the basis for a new Nifi processor that will supplement file 
egress with these features:

-          Supports directories and files

-          Supports reconstruction of entire directory tree based upon relative 
file paths, enabling reconstruction of an entire directory free originating 
from GetFileData

-          (To be developed) Is designed to handle any size file, by 
reassembling multi-part files into very large files (TB's) that do not fit 
within the content repository

Should the community have an interest in these processors (we can name them 
something different, if needed), these contributions are now available.  In the 
meantime, we shall continue developing these processor to meet our specific use 
cases, adding the chunking functionality and QA certifying them for production 
use at scale.

Looking forward to comments, feedback and recommendations.

Here's the Github repo link again:  https://github.com/rickbraddy/nifishare

Best,
Rick

P.S. If there's a better vehicle for communicating these types of proposals, 
please advise.


Reply via email to