[
https://issues.apache.org/jira/browse/NIFI-11951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Kimberley updated NIFI-11951:
-----------------------------------
Description:
We have an environment with two separate NiFi clusters, with no direct
connectivity between them. We need to pass lots of small data packets between
these instances.
I've tested the _FlowFile Stream v3_ format with MergeContent at the sending
end, and the same format at the receiving end - works great. Automatic packing
of FlowFile attributes is exactly what we need. We have another requirement
though, which is to ultimately archive these merged bundles in a format that's
essentially platform agnostic - i.e. can be read in its original form using
standard tooling (think Bash/Python scripts) or third-party applications. The
_FlowFile Stream v3_ format isn't really suitable for this I believe, as only
NiFi can read it. I suppose technically one could invoke the relevant Java
class to read it, but that's not workable where certain third-party tools are
involved.
Avro format is an option, however the FlowFile attributes are aggregated
(merged or combined, depending on configuration) into a single merged FlowFile.
We need the original attributes preserved for each individual FlowFile within
the merged archive.
The formats I have in mind are TAR and ZIP, both of which are already supported
by {{{}MergeContent and UnpackContent{}}}. The missing part is the storage and
retrieval of FlowFile attributes, which are currently discarded by the relevant
TAR/ZIP implementations of these processors.
My proposal is to extend the basic TAR and ZIP functionality, giving the user
the option of storing FlowFile attributes in files within the archive, using
the FlowFile archive entry name as the base and adding a user-configurable
extension. For instance, MergeContent would produce a file like:
{{merged.zip:}}
{{|_ abc.txt}}
{{|_ abc.txt.attributes}}
{{|_ xyz.txt}}
{{|_ xyz.txt.attributes}}
Similarly, {{UnpackContent}} would read the archive and if configured to parse
attributes from files, it would for each FlowFile entry, read the next file in
sequence as an attributes file and merge any attributes defined in that file
with any existing attributes written by that processor.
The user would be able to configure a {{MergeContent RecordWriter}} for writing
out attributes, which would provide the flexibility of choosing their output
format (for instance, CSV or JSON). They would likewise be able to configure a
{{{}RecordReader in UnpackContent{}}}.
The extension of attribute storage and retrieval for these common archive
formats would enhance the ability of dataflow admins to store FlowFiles, along
with their attributes, in such a way that they are (and remain) readable by
current and future archiving systems - without being dependent on NiFi.
was:
We have an environment with two separate NiFi clusters, with no direct
connectivity between them. We need to pass lots of small data packets between
these instances.
I've tested the _FlowFile Stream v3_ format with {{MergeContent }}at the
sending end, and the same format at the receiving end - works great. Automatic
packing of FlowFile attributes is exactly what we need. We have another
requirement though, which is to ultimately archive these merged bundles in a
format that's essentially platform agnostic - i.e. can be read in its original
form using standard tooling (think Bash/Python scripts) or third-party
applications. The _FlowFile Stream v3_ format isn't really suitable for this I
believe, as only NiFi can read it. I suppose technically one could invoke the
relevant Java class to read it, but that's not workable where certain
third-party tools are involved.
Avro format is an option, however the FlowFile attributes are aggregated
(merged or combined, depending on configuration) into a single merged FlowFile.
We need the original attributes preserved for each individual FlowFile within
the merged archive.
The formats I have in mind are TAR and ZIP, both of which are already supported
by {{MergeContent }}and {{{}UnpackContent{}}}. The missing part is the storage
and retrieval of FlowFile attributes, which are currently discarded by the
relevant TAR/ZIP implementations of these processors.
My proposal is to extend the basic TAR and ZIP functionality, giving the user
the option of storing FlowFile attributes in files within the archive, using
the FlowFile archive entry name as the base and adding a user-configurable
extension. For instance, {{MergeContent }}would produce a file like:
{{merged.zip:}}
{{|_ abc.txt}}
{{|_ abc.txt.attributes}}
{{|_ xyz.txt}}
{{|_ xyz.txt.attributes}}
Similarly, {{UnpackContent}} would read the archive and if configured to parse
attributes from files, it would for each FlowFile entry, read the next file in
sequence as an attributes file and merge any attributes defined in that file
with any existing attributes written by that processor.
The user would be able to configure a {{MergeContent RecordWriter}} for writing
out attributes, which would provide the flexibility of choosing their output
format (for instance, CSV or JSON). They would likewise be able to configure a
{{RecordReader }}in {{{}UnpackContent{}}}.
The extension of attribute storage and retrieval for these common archive
formats would enhance the ability of dataflow admins to store FlowFiles, along
with their attributes, in such a way that they are (and remain) readable by
current and future archiving systems - without being dependent on NiFi.
> Support for FlowFile attributes in TAR and ZIP archives
> -------------------------------------------------------
>
> Key: NIFI-11951
> URL: https://issues.apache.org/jira/browse/NIFI-11951
> Project: Apache NiFi
> Issue Type: New Feature
> Components: Core Framework
> Affects Versions: 1.23.0
> Environment: Docker
> Reporter: Peter Kimberley
> Priority: Major
>
> We have an environment with two separate NiFi clusters, with no direct
> connectivity between them. We need to pass lots of small data packets between
> these instances.
> I've tested the _FlowFile Stream v3_ format with MergeContent at the sending
> end, and the same format at the receiving end - works great. Automatic
> packing of FlowFile attributes is exactly what we need. We have another
> requirement though, which is to ultimately archive these merged bundles in a
> format that's essentially platform agnostic - i.e. can be read in its
> original form using standard tooling (think Bash/Python scripts) or
> third-party applications. The _FlowFile Stream v3_ format isn't really
> suitable for this I believe, as only NiFi can read it. I suppose technically
> one could invoke the relevant Java class to read it, but that's not workable
> where certain third-party tools are involved.
> Avro format is an option, however the FlowFile attributes are aggregated
> (merged or combined, depending on configuration) into a single merged
> FlowFile. We need the original attributes preserved for each individual
> FlowFile within the merged archive.
> The formats I have in mind are TAR and ZIP, both of which are already
> supported by {{{}MergeContent and UnpackContent{}}}. The missing part is the
> storage and retrieval of FlowFile attributes, which are currently discarded
> by the relevant TAR/ZIP implementations of these processors.
> My proposal is to extend the basic TAR and ZIP functionality, giving the user
> the option of storing FlowFile attributes in files within the archive, using
> the FlowFile archive entry name as the base and adding a user-configurable
> extension. For instance, MergeContent would produce a file like:
> {{merged.zip:}}
> {{|_ abc.txt}}
> {{|_ abc.txt.attributes}}
> {{|_ xyz.txt}}
> {{|_ xyz.txt.attributes}}
> Similarly, {{UnpackContent}} would read the archive and if configured to
> parse attributes from files, it would for each FlowFile entry, read the next
> file in sequence as an attributes file and merge any attributes defined in
> that file with any existing attributes written by that processor.
> The user would be able to configure a {{MergeContent RecordWriter}} for
> writing out attributes, which would provide the flexibility of choosing their
> output format (for instance, CSV or JSON). They would likewise be able to
> configure a {{{}RecordReader in UnpackContent{}}}.
> The extension of attribute storage and retrieval for these common archive
> formats would enhance the ability of dataflow admins to store FlowFiles,
> along with their attributes, in such a way that they are (and remain)
> readable by current and future archiving systems - without being dependent on
> NiFi.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)