Peter Kimberley created NIFI-11951:
--------------------------------------

             Summary: Support for FlowFile attributes in TAR and ZIP archives
                 Key: NIFI-11951
                 URL: https://issues.apache.org/jira/browse/NIFI-11951
             Project: Apache NiFi
          Issue Type: New Feature
          Components: Core Framework
    Affects Versions: 1.23.0
         Environment: Docker
            Reporter: Peter Kimberley


We have an environment with two separate NiFi clusters, with no direct 
connectivity between them. We need to pass lots of small data packets between 
these instances.

I've tested the _FlowFile Stream v3_ format with {{MergeContent }}at the 
sending end, and the same format at the receiving end - works great. Automatic 
packing of FlowFile attributes is exactly what we need. We have another 
requirement though, which is to ultimately archive these merged bundles in a 
format that's essentially platform agnostic - i.e. can be read in its original 
form using standard tooling (think Bash/Python scripts) or third-party 
applications. The _FlowFile Stream v3_ format isn't really suitable for this I 
believe, as only NiFi can read it. I suppose technically one could invoke the 
relevant Java class to read it, but that's not workable where certain 
third-party tools are involved.

Avro format is an option, however the FlowFile attributes are aggregated 
(merged or combined, depending on configuration) into a single merged FlowFile. 
We need the original attributes preserved for each individual FlowFile within 
the merged archive.

The formats I have in mind are TAR and ZIP, both of which are already supported 
by {{MergeContent }}and {{{}UnpackContent{}}}. The missing part is the storage 
and retrieval of FlowFile attributes, which are currently discarded by the 
relevant TAR/ZIP implementations of these processors.

My proposal is to extend the basic TAR and ZIP functionality, giving the user 
the option of storing FlowFile attributes in files within the archive, using 
the FlowFile archive entry name as the base and adding a user-configurable 
extension. For instance, {{MergeContent }}would produce a file like:

{{merged.zip:}}
{{|_ abc.txt}}
{{|_ abc.txt.attributes}}
{{|_ xyz.txt}}
{{|_ xyz.txt.attributes}}

Similarly, {{UnpackContent}} would read the archive and if configured to parse 
attributes from files, it would for each FlowFile entry, read the next file in 
sequence as an attributes file and merge any attributes defined in that file 
with any existing attributes written by that processor.

The user would be able to configure a {{MergeContent RecordWriter}} for writing 
out attributes, which would provide the flexibility of choosing their output 
format (for instance, CSV or JSON). They would likewise be able to configure a 
{{RecordReader }}in {{{}UnpackContent{}}}.

The extension of attribute storage and retrieval for these common archive 
formats would enhance the ability of dataflow admins to store FlowFiles, along 
with their attributes, in such a way that they are (and remain) readable by 
current and future archiving systems - without being dependent on NiFi.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to