David Handermann created NIFI-15062:
---------------------------------------
Summary: Add PutIcebergRecord Processor and Controller Services
Key: NIFI-15062
URL: https://issues.apache.org/jira/browse/NIFI-15062
Project: Apache NiFi
Issue Type: New Feature
Components: Extensions
Reporter: David Handermann
Assignee: David Handermann
A new PutIcebergRecord Processor should be added that supports a core set of
Apache Iceberg features and aligns with the extensibility of the Apache Iceberg
ecosystem.
Earlier support for writing Iceberg Records had tight coupling to Apache Hadoop
and Apache Hive Catalogs, along with associated dependencies. The
PutIcebergRecord Processor should build on the foundation of the iceberg-api
library without direct linking to particular Catalog, FileIO, or serialization
formats.
The Java libraries for Apache Iceberg support decoupled and extensible
structure, with Catalog, FileIO, and Record abstractions in the iceberg-api.
Various implementations of these interfaces require specific libraries, which
should be packaged in separate NAR bundles for isolated class loading.
Supporting all possible Apache Iceberg integrations should not be the goal of
the Apache NiFi project, but Controller Service interfaces and bundling should
allow third party implementations for various use cases.
Base modules supporting Apache Iceberg should include the following:
* nifi-iceberg-shared-nar
* nifi-iceberg-service-api
* nifi-iceberg-service-api-nar
* nifi-iceberg-processors
* nifi-iceberg-processors-nar
The shared-nar bundle should include the iceberg-api and iceberg-core
libraries, along with associated dependencies. This provides general
compatibility and avoids unnecessary duplication of common libraries. The
service-api-nar should depend on the shared-nar for proper NAR hierarchical
loading.
The processors-nar should depend on the service-api-nar, avoiding any
dependency on particular Iceberg implementation classes, which will be provided
through Controller Service implementations.
For initial support, Controller Service interfaces should provide abstractions
for the following:
* Iceberg Catalog
* Iceberg Writer
* Iceberg FileIO Provider
Initial Controller Service implementations should include the following:
* REST Iceberg Catalog
* Parquet Iceberg Writer
* S3 Iceberg FileIO Provider
Supporting REST Iceberg Catalogs should cover a variety of use cases.
Supporting Apache Parquet as the initial File Format addresses the most common
use case, and support for other formats could be considered separately.
With many examples, S3 is a common FileIO storage solution. Other FileIO
implementations should be considered to cover major service providers, most
focusing on S3 as the initial implementation narrows the scope of the initial
implementation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)