[ 
https://issues.apache.org/jira/browse/NIFI-15568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Villard updated NIFI-15568:
----------------------------------
    Fix Version/s: 2.9.0
       Resolution: Fixed
           Status: Resolved  (was: Patch Available)

> Fix Partition by Timestamp in Iceberg Parquet Writer
> ----------------------------------------------------
>
>                 Key: NIFI-15568
>                 URL: https://issues.apache.org/jira/browse/NIFI-15568
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 2.7.0, 2.8.0, 2.7.1, 2.7.2
>            Reporter: Nir Yanay
>            Assignee: David Handermann
>            Priority: Minor
>             Fix For: 2.9.0
>
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> While working with PutIcebergRecord in NiFi 2.7.2, I encountered two separate 
> issues when writing to Apache Iceberg tables using an on-prem S3-compatible 
> object store and an Iceberg REST catalog.
> h3. *Issue 1: On-Prem S3 Configuration Not Supported by S3FileIOProvider*
> NiFi's default S3IcebergFileIOProvider does not expose the necessary 
> configuration options required to connect to an on-prem S3-compatible storage 
> (e.g., MinIO).
> Specifically, it does not allow configuring:
>  * Custom S3 endpoint
>  * Path-style access
>  * Storage class
> As a result, PutIcebergRecord cannot be used with an on-prem S3 backend out 
> of the box. To resolve this, I extended S3IcebergFileIOProvider to support 
> the missing properties, enabling connectivity to on-prem S3-compatible 
> storage systems.
> UPDATE:
> Apologies for the confusion — I later noticed that parts of the S3 on-prem 
> support were already addressed in an earlier change.
> The only missing piece for my use case was support for configuring the S3 
> storage class, which I added in this PR.
> h3. *Issue 2: Timestamp Type Mismatch Between NiFi and Iceberg*
> After enabling on-prem S3 support, I encountered a timestamp compatibility 
> issue when writing records containing timestamp fields: NiFi represents 
> timestamps as java.sql.timestamp while Iceberg represents timestamps as 
> java.time.LocalDateTime ( [Find 
> Here|https://github.com/apache/iceberg/blob/730ce29d5cd722b1751a1984d9eabb68542eba39/parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetWriter.java#L122]
>  )
> h4. Unpartitioned Tables
> Initially, I added a converter to handle the type conversion, which resolved 
> the issue for unpartitioned Iceberg tables.
> h4. Partitioned Tables
> However, when the timestamp column was used as a partition key unfortunately 
> writes failed again. Further investigation showed that Iceberg internally 
> expects timestamp partition keys values to be represented both as Long and 
> LocalDateTime at different places in the flow of writing. 
> To resolve this, I leveraged Iceberg's InternanlRecordWrapper, which 
> correctly handles this dual representation and allows partitioned writes to 
> succeed.
> *PR*
> I have a created a PR with the necessary change 
> [here|https://github.com/apache/nifi/pull/10877].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to