[
https://issues.apache.org/jira/browse/NIFI-11951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17762833#comment-17762833
]
Michael W Moser commented on NIFI-11951:
----------------------------------------
[~p-kimberley] Have a look at the MergeContent "Merge Format" choice "FlowFile
Tar, v1". I don't think it supports storing more than 1 flowfile in the TAR
(which may be why it was replaced), but you could make a generic TAR of
"FlowFile Tar, v1" files.
> Support for FlowFile attributes in TAR and ZIP archives
> -------------------------------------------------------
>
> Key: NIFI-11951
> URL: https://issues.apache.org/jira/browse/NIFI-11951
> Project: Apache NiFi
> Issue Type: New Feature
> Components: Core Framework
> Affects Versions: 1.23.0
> Environment: Docker
> Reporter: Peter Kimberley
> Priority: Major
>
> We have an environment with two separate NiFi clusters, with no direct
> connectivity between them. We need to pass lots of small data packets between
> these instances.
> I've tested the _FlowFile Stream v3_ format with MergeContent at the sending
> end, and the same format at the receiving end - works great. Automatic
> packing of FlowFile attributes is exactly what we need. We have another
> requirement though, which is to ultimately archive these merged bundles in a
> format that's essentially platform agnostic - i.e. can be read in its
> original form using standard tooling (think Bash/Python scripts) or
> third-party applications. The _FlowFile Stream v3_ format isn't really
> suitable for this I believe, as only NiFi can read it. I suppose technically
> one could invoke the relevant Java class to read it, but that's not workable
> where certain third-party tools are involved.
> Avro format is an option, however the FlowFile attributes are aggregated
> (merged or combined, depending on configuration) into a single merged
> FlowFile. We need the original attributes preserved for each individual
> FlowFile within the merged archive.
> The formats I have in mind are TAR and ZIP, both of which are already
> supported by {{{}MergeContent and UnpackContent{}}}. The missing part is the
> storage and retrieval of FlowFile attributes, which are currently discarded
> by the relevant TAR/ZIP implementations of these processors.
> My proposal is to extend the basic TAR and ZIP functionality, giving the user
> the option of storing FlowFile attributes in files within the archive, using
> the FlowFile archive entry name as the base and adding a user-configurable
> extension. For instance, MergeContent would produce a file like:
> {{merged.zip:}}
> {{|_ abc.txt}}
> {{|_ abc.txt.attributes}}
> {{|_ xyz.txt}}
> {{|_ xyz.txt.attributes}}
> Similarly, {{UnpackContent}} would read the archive and if configured to
> parse attributes from files, it would for each FlowFile entry, read the next
> file in sequence as an attributes file and merge any attributes defined in
> that file with any existing attributes written by that processor.
> The user would be able to configure a {{MergeContent RecordWriter}} for
> writing out attributes, which would provide the flexibility of choosing their
> output format (for instance, CSV or JSON). They would likewise be able to
> configure a {{{}RecordReader in UnpackContent{}}}.
> The extension of attribute storage and retrieval for these common archive
> formats would enhance the ability of dataflow admins to store FlowFiles,
> along with their attributes, in such a way that they are (and remain)
> readable by current and future archiving systems - without being dependent on
> NiFi.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)