[jira] [Updated] (NIFI-11951) Support for FlowFile attributes in TAR and ZIP archives

Peter Kimberley (Jira) Tue, 15 Aug 2023 08:23:26 -0700


     [ 
https://issues.apache.org/jira/browse/NIFI-11951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Peter Kimberley updated NIFI-11951:
-----------------------------------
    Description: 
We have an environment with two separate NiFi clusters, with no direct 
connectivity between them. We need to pass lots of small data packets between 
these instances.

I've tested the _FlowFile Stream v3_ format with MergeContent at the sending 
end, and the same format at the receiving end - works great. Automatic packing 
of FlowFile attributes is exactly what we need. We have another requirement 
though, which is to ultimately archive these merged bundles in a format that's 
essentially platform agnostic - i.e. can be read in its original form using 
standard tooling (think Bash/Python scripts) or third-party applications. The 
_FlowFile Stream v3_ format isn't really suitable for this I believe, as only 
NiFi can read it. I suppose technically one could invoke the relevant Java 
class to read it, but that's not workable where certain third-party tools are 
involved.

Avro format is an option, however the FlowFile attributes are aggregated 
(merged or combined, depending on configuration) into a single merged FlowFile. 
We need the original attributes preserved for each individual FlowFile within 
the merged archive.

The formats I have in mind are TAR and ZIP, both of which are already supported 
by {{{}MergeContent and UnpackContent{}}}. The missing part is the storage and 
retrieval of FlowFile attributes, which are currently discarded by the relevant 
TAR/ZIP implementations of these processors.

My proposal is to extend the basic TAR and ZIP functionality, giving the user 
the option of storing FlowFile attributes in files within the archive, using 
the FlowFile archive entry name as the base and adding a user-configurable 
extension. For instance, MergeContent would produce a file like:

{{merged.zip:}}
{{|_ abc.txt}}
{{|_ abc.txt.attributes}}
{{|_ xyz.txt}}
{{|_ xyz.txt.attributes}}

Similarly, {{UnpackContent}} would read the archive and if configured to parse 
attributes from files, it would for each FlowFile entry, read the next file in 
sequence as an attributes file and merge any attributes defined in that file 
with any existing attributes written by that processor.

The user would be able to configure a {{MergeContent RecordWriter}} for writing 
out attributes, which would provide the flexibility of choosing their output 
format (for instance, CSV or JSON). They would likewise be able to configure a 
{{{}RecordReader in UnpackContent{}}}.

The extension of attribute storage and retrieval for these common archive 
formats would enhance the ability of dataflow admins to store FlowFiles, along 
with their attributes, in such a way that they are (and remain) readable by 
current and future archiving systems - without being dependent on NiFi.

  was:
We have an environment with two separate NiFi clusters, with no direct 
connectivity between them. We need to pass lots of small data packets between 
these instances.

I've tested the _FlowFile Stream v3_ format with {{MergeContent }}at the 
sending end, and the same format at the receiving end - works great. Automatic 
packing of FlowFile attributes is exactly what we need. We have another 
requirement though, which is to ultimately archive these merged bundles in a 
format that's essentially platform agnostic - i.e. can be read in its original 
form using standard tooling (think Bash/Python scripts) or third-party 
applications. The _FlowFile Stream v3_ format isn't really suitable for this I 
believe, as only NiFi can read it. I suppose technically one could invoke the 
relevant Java class to read it, but that's not workable where certain 
third-party tools are involved.

Avro format is an option, however the FlowFile attributes are aggregated 
(merged or combined, depending on configuration) into a single merged FlowFile. 
We need the original attributes preserved for each individual FlowFile within 
the merged archive.

The formats I have in mind are TAR and ZIP, both of which are already supported 
by {{MergeContent }}and {{{}UnpackContent{}}}. The missing part is the storage 
and retrieval of FlowFile attributes, which are currently discarded by the 
relevant TAR/ZIP implementations of these processors.

My proposal is to extend the basic TAR and ZIP functionality, giving the user 
the option of storing FlowFile attributes in files within the archive, using 
the FlowFile archive entry name as the base and adding a user-configurable 
extension. For instance, {{MergeContent }}would produce a file like:

{{merged.zip:}}
{{|_ abc.txt}}
{{|_ abc.txt.attributes}}
{{|_ xyz.txt}}
{{|_ xyz.txt.attributes}}

Similarly, {{UnpackContent}} would read the archive and if configured to parse 
attributes from files, it would for each FlowFile entry, read the next file in 
sequence as an attributes file and merge any attributes defined in that file 
with any existing attributes written by that processor.

The user would be able to configure a {{MergeContent RecordWriter}} for writing 
out attributes, which would provide the flexibility of choosing their output 
format (for instance, CSV or JSON). They would likewise be able to configure a 
{{RecordReader }}in {{{}UnpackContent{}}}.

The extension of attribute storage and retrieval for these common archive 
formats would enhance the ability of dataflow admins to store FlowFiles, along 
with their attributes, in such a way that they are (and remain) readable by 
current and future archiving systems - without being dependent on NiFi.


> Support for FlowFile attributes in TAR and ZIP archives
> -------------------------------------------------------
>
>                 Key: NIFI-11951
>                 URL: https://issues.apache.org/jira/browse/NIFI-11951
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Core Framework
>    Affects Versions: 1.23.0
>         Environment: Docker
>            Reporter: Peter Kimberley
>            Priority: Major
>
> We have an environment with two separate NiFi clusters, with no direct 
> connectivity between them. We need to pass lots of small data packets between 
> these instances.
> I've tested the _FlowFile Stream v3_ format with MergeContent at the sending 
> end, and the same format at the receiving end - works great. Automatic 
> packing of FlowFile attributes is exactly what we need. We have another 
> requirement though, which is to ultimately archive these merged bundles in a 
> format that's essentially platform agnostic - i.e. can be read in its 
> original form using standard tooling (think Bash/Python scripts) or 
> third-party applications. The _FlowFile Stream v3_ format isn't really 
> suitable for this I believe, as only NiFi can read it. I suppose technically 
> one could invoke the relevant Java class to read it, but that's not workable 
> where certain third-party tools are involved.
> Avro format is an option, however the FlowFile attributes are aggregated 
> (merged or combined, depending on configuration) into a single merged 
> FlowFile. We need the original attributes preserved for each individual 
> FlowFile within the merged archive.
> The formats I have in mind are TAR and ZIP, both of which are already 
> supported by {{{}MergeContent and UnpackContent{}}}. The missing part is the 
> storage and retrieval of FlowFile attributes, which are currently discarded 
> by the relevant TAR/ZIP implementations of these processors.
> My proposal is to extend the basic TAR and ZIP functionality, giving the user 
> the option of storing FlowFile attributes in files within the archive, using 
> the FlowFile archive entry name as the base and adding a user-configurable 
> extension. For instance, MergeContent would produce a file like:
> {{merged.zip:}}
> {{|_ abc.txt}}
> {{|_ abc.txt.attributes}}
> {{|_ xyz.txt}}
> {{|_ xyz.txt.attributes}}
> Similarly, {{UnpackContent}} would read the archive and if configured to 
> parse attributes from files, it would for each FlowFile entry, read the next 
> file in sequence as an attributes file and merge any attributes defined in 
> that file with any existing attributes written by that processor.
> The user would be able to configure a {{MergeContent RecordWriter}} for 
> writing out attributes, which would provide the flexibility of choosing their 
> output format (for instance, CSV or JSON). They would likewise be able to 
> configure a {{{}RecordReader in UnpackContent{}}}.
> The extension of attribute storage and retrieval for these common archive 
> formats would enhance the ability of dataflow admins to store FlowFiles, 
> along with their attributes, in such a way that they are (and remain) 
> readable by current and future archiving systems - without being dependent on 
> NiFi.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NIFI-11951) Support for FlowFile attributes in TAR and ZIP archives

Reply via email to