[ 
https://issues.apache.org/jira/browse/NIFI-11951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17762833#comment-17762833
 ] 

Michael W Moser commented on NIFI-11951:
----------------------------------------

[~p-kimberley]  Have a look at the MergeContent "Merge Format" choice "FlowFile 
Tar, v1".  I don't think it supports storing more than 1 flowfile in the TAR 
(which may be why it was replaced), but you could make a generic TAR of 
"FlowFile Tar, v1" files.

> Support for FlowFile attributes in TAR and ZIP archives
> -------------------------------------------------------
>
>                 Key: NIFI-11951
>                 URL: https://issues.apache.org/jira/browse/NIFI-11951
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Core Framework
>    Affects Versions: 1.23.0
>         Environment: Docker
>            Reporter: Peter Kimberley
>            Priority: Major
>
> We have an environment with two separate NiFi clusters, with no direct 
> connectivity between them. We need to pass lots of small data packets between 
> these instances.
> I've tested the _FlowFile Stream v3_ format with MergeContent at the sending 
> end, and the same format at the receiving end - works great. Automatic 
> packing of FlowFile attributes is exactly what we need. We have another 
> requirement though, which is to ultimately archive these merged bundles in a 
> format that's essentially platform agnostic - i.e. can be read in its 
> original form using standard tooling (think Bash/Python scripts) or 
> third-party applications. The _FlowFile Stream v3_ format isn't really 
> suitable for this I believe, as only NiFi can read it. I suppose technically 
> one could invoke the relevant Java class to read it, but that's not workable 
> where certain third-party tools are involved.
> Avro format is an option, however the FlowFile attributes are aggregated 
> (merged or combined, depending on configuration) into a single merged 
> FlowFile. We need the original attributes preserved for each individual 
> FlowFile within the merged archive.
> The formats I have in mind are TAR and ZIP, both of which are already 
> supported by {{{}MergeContent and UnpackContent{}}}. The missing part is the 
> storage and retrieval of FlowFile attributes, which are currently discarded 
> by the relevant TAR/ZIP implementations of these processors.
> My proposal is to extend the basic TAR and ZIP functionality, giving the user 
> the option of storing FlowFile attributes in files within the archive, using 
> the FlowFile archive entry name as the base and adding a user-configurable 
> extension. For instance, MergeContent would produce a file like:
> {{merged.zip:}}
> {{|_ abc.txt}}
> {{|_ abc.txt.attributes}}
> {{|_ xyz.txt}}
> {{|_ xyz.txt.attributes}}
> Similarly, {{UnpackContent}} would read the archive and if configured to 
> parse attributes from files, it would for each FlowFile entry, read the next 
> file in sequence as an attributes file and merge any attributes defined in 
> that file with any existing attributes written by that processor.
> The user would be able to configure a {{MergeContent RecordWriter}} for 
> writing out attributes, which would provide the flexibility of choosing their 
> output format (for instance, CSV or JSON). They would likewise be able to 
> configure a {{{}RecordReader in UnpackContent{}}}.
> The extension of attribute storage and retrieval for these common archive 
> formats would enhance the ability of dataflow admins to store FlowFiles, 
> along with their attributes, in such a way that they are (and remain) 
> readable by current and future archiving systems - without being dependent on 
> NiFi.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to