This thread proposes community review/comments of modified versions of GetFile
and PutFile for potential future adoption by the Nifi community. For those who
want to jump straight to the code, here's the review repository location for
the current version: https://github.com/rickbraddy/nifishare.
As background, we needed a way to replicate entire directory trees of files via
Nifi, where multiple directory trees can be specified at run-time as part of an
overall Nifi graph. As Nifi is rooted in file-based processing, it seems
reasonable to continue advancing its abilities to ingest, process, transform
and replicate files in the most flexible manner possible. While this proposal
is not a be all end all in that regard, it moves the needle in the right
direction by making file-processing in Nifi more dynamic, enabling flows to
determine how files (and directories) should be processed, which does well
beyond today's basic file ingress/egress process capabilities (which certainly
have their place and uses). Whether it's via this proposal and code or
another, clearly Nifi can benefit from this type of functionality.
Here's a more detailed explanation of the rationale for developing these Nifi
file processor derivatives and their initial implementation:
GetFileData
----------------
The GetFile processor monitors a single directory tree for file changes and
creates FlowFiles for every changed file in that configured tree. It does a
good job of getting files from a configurable folder than need to be injected
into a graph. GetFile falls short of other requirements that arise for
general-purpose file processing:
- Operates from a single, pre-configured source directory (not
dynamically configurable at run-time as part of a flow)
- Scheduled on a periodic basis only, not event-triggered when there's
something to do
- Does not support sending an entire directory tree (only files are
sent, not directories)
- Is a "source" processor node only, cannot be used within other Nifi
flow logic that dynamically determines which files or directories to get and
send as FlowFiles
- Assumes each file is smaller than the content repository, which
causes large files (hundreds of MB's, GBs, TBs) to overrun or dominate the
content repository
A modified version of GetFile (currently) named GetFileData has been developed
and is proposed as the basis for a new Nifi processor that will supplement file
ingestion with these features:
- Operates based upon inbound FlowFiles that contains the filesystem
path to a file or directory
- Scheduled by incoming FlowFiles containing a file or directory path,
only runs when there's something to do
- Supports sending directory tree as a series of directory and file
paths; e.g., ExecuteProcess("find /mypath -print") => SplitText(newline) =>
ModifyAttribute(add "file.roodir=/mypath") => GetFIleData ...
- Participates within simple or complex flows to fetch and send files
and directories
- (To be developed) Is designed to handle any size file, by breaking
files larger than a "chunkingThreshold" into a series of multiple smaller files
that can be reassembled on the other end (by PutFileData)
PutFileData
---------------
The PutFile processor accepts incoming FlowFiles and writes those files to a
single target directory. It does a good job of handling and resolving
conflicts, but falls short of other requirements that arise for general-purpose
file processing:
- Does not support directories, only files
- Only supports a single, preconfigured target directory
- Cannot reconstruct and entire directory tree based upon relative
file paths (all files go into a single target directory)
- Assumes each file is small enough to fit into the content repository
A modified version of PutFile (currently) named PutFileData has been developed
and is proposed as the basis for a new Nifi processor that will supplement file
egress with these features:
- Supports directories and files
- Supports reconstruction of entire directory tree based upon relative
file paths, enabling reconstruction of an entire directory free originating
from GetFileData
- (To be developed) Is designed to handle any size file, by
reassembling multi-part files into very large files (TB's) that do not fit
within the content repository
Should the community have an interest in these processors (we can name them
something different, if needed), these contributions are now available. In the
meantime, we shall continue developing these processor to meet our specific use
cases, adding the chunking functionality and QA certifying them for production
use at scale.
Looking forward to comments, feedback and recommendations.
Here's the Github repo link again: https://github.com/rickbraddy/nifishare
Best,
Rick
P.S. If there's a better vehicle for communicating these types of proposals,
please advise.