David Handermann created NIFI-15062:
---------------------------------------

             Summary: Add PutIcebergRecord Processor and Controller Services
                 Key: NIFI-15062
                 URL: https://issues.apache.org/jira/browse/NIFI-15062
             Project: Apache NiFi
          Issue Type: New Feature
          Components: Extensions
            Reporter: David Handermann
            Assignee: David Handermann


A new PutIcebergRecord Processor should be added that supports a core set of 
Apache Iceberg features and aligns with the extensibility of the Apache Iceberg 
ecosystem.

Earlier support for writing Iceberg Records had tight coupling to Apache Hadoop 
and Apache Hive Catalogs, along with associated dependencies. The 
PutIcebergRecord Processor should build on the foundation of the iceberg-api 
library without direct linking to particular Catalog, FileIO, or serialization 
formats.

The Java libraries for Apache Iceberg support decoupled and extensible 
structure, with Catalog, FileIO, and Record abstractions in the iceberg-api. 
Various implementations of these interfaces require specific libraries, which 
should be packaged in separate NAR bundles for isolated class loading.

Supporting all possible Apache Iceberg integrations should not be the goal of 
the Apache NiFi project, but Controller Service interfaces and bundling should 
allow third party implementations for various use cases.

Base modules supporting Apache Iceberg should include the following:
 * nifi-iceberg-shared-nar
 * nifi-iceberg-service-api
 * nifi-iceberg-service-api-nar
 * nifi-iceberg-processors
 * nifi-iceberg-processors-nar

The shared-nar bundle should include the iceberg-api and iceberg-core 
libraries, along with associated dependencies. This provides general 
compatibility and avoids unnecessary duplication of common libraries. The 
service-api-nar should depend on the shared-nar for proper NAR hierarchical 
loading.

The processors-nar should depend on the service-api-nar, avoiding any 
dependency on particular Iceberg implementation classes, which will be provided 
through Controller Service implementations.

For initial support, Controller Service interfaces should provide abstractions 
for the following:
 * Iceberg Catalog
 * Iceberg Writer
 * Iceberg FileIO Provider

Initial Controller Service implementations should include the following:
 * REST Iceberg Catalog
 * Parquet Iceberg Writer
 * S3 Iceberg FileIO Provider

Supporting REST Iceberg Catalogs should cover a variety of use cases.

Supporting Apache Parquet as the initial File Format addresses the most common 
use case, and support for other formats could be considered separately.

With many examples, S3 is a common FileIO storage solution. Other FileIO 
implementations should be considered to cover major service providers, most 
focusing on S3 as the initial implementation narrows the scope of the initial 
implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to