William Dyson created NIFI-12130:
------------------------------------
Summary: PutIceberg: Ability to configure snapshot properties via
dynamic attributes
Key: NIFI-12130
URL: https://issues.apache.org/jira/browse/NIFI-12130
Project: Apache NiFi
Issue Type: New Feature
Components: Extensions
Reporter: William Dyson
*Motivation*
Spark's implementation of Iceberg allows users to add snapshot properties, when
writing data to an Iceberg table, using properties prefixed with
"snapshot-property." like so:
{{df.write}}
{{ .option("write-format", "avro")}}
{{ .option("snapshot-property.key", "value")}}
{{ .insertInto("catalog.db.table") }}
[https://iceberg.apache.org/docs/latest/spark-configuration/#write-options]
These properties can be used to add context to Iceberg snapshots and help users
locate snapshots in recovery scenarios.
In fact, Spark automatically adds the application name as {_}spark.app.id{_}.
Examples of when these properties might be useful include:
* Recording the data source used to produce the new records
* UUID of flow file used to update the table so it can be matched to NiFi
provenance
They can be queried from the snapshots metatable (feature of Iceberg).
*Feature request*
It would be great if we could configure PutIceberg to add these properties in a
similar fashion (e.g. using dynamic properties of the form
snapshot-property.*). Continuing with the comparison to Spark, it may also be
worth automatically adding the flowfile UUID as something like
{_}nifi.flowfile.id{_}.
*Further details*
I'm not entirely clued up on the Iceberg API, but it looks like these are set
on the SnapshotUpdate (AppendFiles inherits from this class):
[https://iceberg.apache.org/javadoc/master/org/apache/iceberg/SnapshotUpdate.html]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)