Peter Kimberley created NIFI-11951:
--------------------------------------
Summary: Support for FlowFile attributes in TAR and ZIP archives
Key: NIFI-11951
URL: https://issues.apache.org/jira/browse/NIFI-11951
Project: Apache NiFi
Issue Type: New Feature
Components: Core Framework
Affects Versions: 1.23.0
Environment: Docker
Reporter: Peter Kimberley
We have an environment with two separate NiFi clusters, with no direct
connectivity between them. We need to pass lots of small data packets between
these instances.
I've tested the _FlowFile Stream v3_ format with {{MergeContent }}at the
sending end, and the same format at the receiving end - works great. Automatic
packing of FlowFile attributes is exactly what we need. We have another
requirement though, which is to ultimately archive these merged bundles in a
format that's essentially platform agnostic - i.e. can be read in its original
form using standard tooling (think Bash/Python scripts) or third-party
applications. The _FlowFile Stream v3_ format isn't really suitable for this I
believe, as only NiFi can read it. I suppose technically one could invoke the
relevant Java class to read it, but that's not workable where certain
third-party tools are involved.
Avro format is an option, however the FlowFile attributes are aggregated
(merged or combined, depending on configuration) into a single merged FlowFile.
We need the original attributes preserved for each individual FlowFile within
the merged archive.
The formats I have in mind are TAR and ZIP, both of which are already supported
by {{MergeContent }}and {{{}UnpackContent{}}}. The missing part is the storage
and retrieval of FlowFile attributes, which are currently discarded by the
relevant TAR/ZIP implementations of these processors.
My proposal is to extend the basic TAR and ZIP functionality, giving the user
the option of storing FlowFile attributes in files within the archive, using
the FlowFile archive entry name as the base and adding a user-configurable
extension. For instance, {{MergeContent }}would produce a file like:
{{merged.zip:}}
{{|_ abc.txt}}
{{|_ abc.txt.attributes}}
{{|_ xyz.txt}}
{{|_ xyz.txt.attributes}}
Similarly, {{UnpackContent}} would read the archive and if configured to parse
attributes from files, it would for each FlowFile entry, read the next file in
sequence as an attributes file and merge any attributes defined in that file
with any existing attributes written by that processor.
The user would be able to configure a {{MergeContent RecordWriter}} for writing
out attributes, which would provide the flexibility of choosing their output
format (for instance, CSV or JSON). They would likewise be able to configure a
{{RecordReader }}in {{{}UnpackContent{}}}.
The extension of attribute storage and retrieval for these common archive
formats would enhance the ability of dataflow admins to store FlowFiles, along
with their attributes, in such a way that they are (and remain) readable by
current and future archiving systems - without being dependent on NiFi.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)