xx789633 commented on code in PR #1640: URL: https://github.com/apache/fluss/pull/1640#discussion_r2320679590
########## website/docs/streaming-lakehouse/integrate-data-lakes/iceberg.md: ########## @@ -0,0 +1,343 @@ +--- +title: Iceberg +sidebar_position: 2 +--- + +# Iceberg + +[Apache Iceberg](https://iceberg.apache.org/) is an open table format for huge analytic datasets. It provides ACID transactions, schema evolution, and efficient data organization for data lakes. +To integrate Fluss with Iceberg, you must enable lakehouse storage and configure Iceberg as the lakehouse storage. For more details, see [Enable Lakehouse Storage](maintenance/tiered-storage/lakehouse-storage.md#enable-lakehouse-storage). + +## Introduction + +When a table is created or altered with the option `'table.datalake.enabled' = 'true'` and configured with Iceberg as the datalake format, Fluss will automatically create a corresponding Iceberg table with the same table path. +The schema of the Iceberg table matches that of the Fluss table, except for the addition of three system columns at the end: `__bucket`, `__offset`, and `__timestamp`. +These system columns help Fluss clients consume data from Iceberg in a streaming fashion, such as seeking by a specific bucket using an offset or timestamp. + +```sql title="Flink SQL" +USE CATALOG fluss_catalog; + +CREATE TABLE fluss_order_with_lake ( + `order_key` BIGINT, + `cust_key` INT NOT NULL, + `total_price` DECIMAL(15, 2), + `order_date` DATE, + `order_priority` STRING, + `clerk` STRING, + `ptime` AS PROCTIME(), + PRIMARY KEY (`order_key`) NOT ENFORCED + ) WITH ( + 'table.datalake.enabled' = 'true', + 'table.datalake.freshness' = '30s' +); +``` + +Then, the datalake tiering service continuously tiers data from Fluss to Iceberg. The parameter `table.datalake.freshness` controls the frequency that Fluss writes data to Iceberg tables. By default, the data freshness is 3 minutes. +For primary key tables, changelogs are also generated in the Iceberg format, enabling stream-based consumption via Iceberg APIs. Primary key tables use merge-on-read (MOR) strategy for efficient updates and deletes. + +Since Fluss version 0.8, you can also specify Iceberg table properties when creating a datalake-enabled Fluss table by using the `iceberg.` prefix within the Fluss table properties clause. + +```sql title="Flink SQL" +CREATE TABLE fluss_order_with_lake ( + `order_key` BIGINT, + `cust_key` INT NOT NULL, + `total_price` DECIMAL(15, 2), + `order_date` DATE, + `order_priority` STRING, + `clerk` STRING, + `ptime` AS PROCTIME(), + PRIMARY KEY (`order_key`) NOT ENFORCED + ) WITH ( + 'table.datalake.enabled' = 'true', + 'table.datalake.freshness' = '30s', + 'table.datalake.auto-maintenance' = 'true', + 'iceberg.write.format.default' = 'parquet', + 'iceberg.commit.retry.num-retries' = '5' +); +``` + +For example, you can specify the Iceberg property `write.format.default` to change the file format of the Iceberg table, or set `commit.retry.num-retries` to configure retry behavior for commits. The `table.datalake.auto-maintenance` option (true by default) enables automatic maintenance tasks such as file compaction and snapshot expiration. + +## Table Types and Bucketing Strategy + +Fluss uses a special bucketing strategy when integrating with Iceberg to ensure data distribution consistency between Fluss and Iceberg layers. This enables efficient data access and future union read capabilities. + +### Bucket Strategy + +When Iceberg is configured as the datalake format, Fluss uses `IcebergBucketingFunction` to bucket data following Iceberg's bucketing strategy. This ensures: +- **Data distribution consistency**: The same record goes to the same bucket in both Fluss and Iceberg +- **Efficient data access**: You can quickly locate data for a specific Fluss bucket within Iceberg +- **Dynamic enablement**: Tables can be enabled for datalake without rewriting existing data Review Comment: Does "dynamic enablement" have anything to do with the bucketing strategy? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@fluss.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org