[
https://issues.apache.org/jira/browse/NIFI-15568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Handermann updated NIFI-15568:
------------------------------------
Status: Patch Available (was: Open)
> Fix Partition by Timestamp in Iceberg Parquet Writer
> ----------------------------------------------------
>
> Key: NIFI-15568
> URL: https://issues.apache.org/jira/browse/NIFI-15568
> Project: Apache NiFi
> Issue Type: Bug
> Components: Extensions
> Affects Versions: 2.7.2, 2.7.1, 2.8.0, 2.7.0
> Reporter: Nir Yanay
> Assignee: David Handermann
> Priority: Minor
> Time Spent: 2h 20m
> Remaining Estimate: 0h
>
> While working with PutIcebergRecord in NiFi 2.7.2, I encountered two separate
> issues when writing to Apache Iceberg tables using an on-prem S3-compatible
> object store and an Iceberg REST catalog.
> h3. *Issue 1: On-Prem S3 Configuration Not Supported by S3FileIOProvider*
> NiFi's default S3IcebergFileIOProvider does not expose the necessary
> configuration options required to connect to an on-prem S3-compatible storage
> (e.g., MinIO).
> Specifically, it does not allow configuring:
> * Custom S3 endpoint
> * Path-style access
> * Storage class
> As a result, PutIcebergRecord cannot be used with an on-prem S3 backend out
> of the box. To resolve this, I extended S3IcebergFileIOProvider to support
> the missing properties, enabling connectivity to on-prem S3-compatible
> storage systems.
> UPDATE:
> Apologies for the confusion — I later noticed that parts of the S3 on-prem
> support were already addressed in an earlier change.
> The only missing piece for my use case was support for configuring the S3
> storage class, which I added in this PR.
> h3. *Issue 2: Timestamp Type Mismatch Between NiFi and Iceberg*
> After enabling on-prem S3 support, I encountered a timestamp compatibility
> issue when writing records containing timestamp fields: NiFi represents
> timestamps as java.sql.timestamp while Iceberg represents timestamps as
> java.time.LocalDateTime ( [Find
> Here|https://github.com/apache/iceberg/blob/730ce29d5cd722b1751a1984d9eabb68542eba39/parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetWriter.java#L122]
> )
> h4. Unpartitioned Tables
> Initially, I added a converter to handle the type conversion, which resolved
> the issue for unpartitioned Iceberg tables.
> h4. Partitioned Tables
> However, when the timestamp column was used as a partition key unfortunately
> writes failed again. Further investigation showed that Iceberg internally
> expects timestamp partition keys values to be represented both as Long and
> LocalDateTime at different places in the flow of writing.
> To resolve this, I leveraged Iceberg's InternanlRecordWrapper, which
> correctly handles this dual representation and allows partitioned writes to
> succeed.
> *PR*
> I have a created a PR with the necessary change
> [here|https://github.com/apache/nifi/pull/10877].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)