Re: [PR] [lake/iceberg] Add iceberg documents for lakehouse support in fluss [fluss]

via GitHub Thu, 04 Sep 2025 00:10:42 -0700


xx789633 commented on code in PR #1640:
URL: https://github.com/apache/fluss/pull/1640#discussion_r2320679590



##########
website/docs/streaming-lakehouse/integrate-data-lakes/iceberg.md:
##########
@@ -0,0 +1,343 @@
+---
+title: Iceberg
+sidebar_position: 2
+---
+
+# Iceberg
+
+[Apache Iceberg](https://iceberg.apache.org/) is an open table format for huge 
analytic datasets. It provides ACID transactions, schema evolution, and 
efficient data organization for data lakes.
+To integrate Fluss with Iceberg, you must enable lakehouse storage and 
configure Iceberg as the lakehouse storage. For more details, see [Enable 
Lakehouse 
Storage](maintenance/tiered-storage/lakehouse-storage.md#enable-lakehouse-storage).
+
+## Introduction
+
+When a table is created or altered with the option `'table.datalake.enabled' = 
'true'` and configured with Iceberg as the datalake format, Fluss will 
automatically create a corresponding Iceberg table with the same table path.
+The schema of the Iceberg table matches that of the Fluss table, except for 
the addition of three system columns at the end: `__bucket`, `__offset`, and 
`__timestamp`.  
+These system columns help Fluss clients consume data from Iceberg in a 
streaming fashion, such as seeking by a specific bucket using an offset or 
timestamp.
+
+```sql title="Flink SQL"
+USE CATALOG fluss_catalog;
+
+CREATE TABLE fluss_order_with_lake (
+    `order_key` BIGINT,
+    `cust_key` INT NOT NULL,
+    `total_price` DECIMAL(15, 2),
+    `order_date` DATE,
+    `order_priority` STRING,
+    `clerk` STRING,
+    `ptime` AS PROCTIME(),
+    PRIMARY KEY (`order_key`) NOT ENFORCED
+ ) WITH (
+     'table.datalake.enabled' = 'true',
+     'table.datalake.freshness' = '30s'
+);
+```
+
+Then, the datalake tiering service continuously tiers data from Fluss to 
Iceberg. The parameter `table.datalake.freshness` controls the frequency that 
Fluss writes data to Iceberg tables. By default, the data freshness is 3 
minutes.  
+For primary key tables, changelogs are also generated in the Iceberg format, 
enabling stream-based consumption via Iceberg APIs. Primary key tables use 
merge-on-read (MOR) strategy for efficient updates and deletes.
+
+Since Fluss version 0.8, you can also specify Iceberg table properties when 
creating a datalake-enabled Fluss table by using the `iceberg.` prefix within 
the Fluss table properties clause.
+
+```sql title="Flink SQL"
+CREATE TABLE fluss_order_with_lake (
+    `order_key` BIGINT,
+    `cust_key` INT NOT NULL,
+    `total_price` DECIMAL(15, 2),
+    `order_date` DATE,
+    `order_priority` STRING,
+    `clerk` STRING,
+    `ptime` AS PROCTIME(),
+    PRIMARY KEY (`order_key`) NOT ENFORCED
+ ) WITH (
+     'table.datalake.enabled' = 'true',
+     'table.datalake.freshness' = '30s',
+     'table.datalake.auto-maintenance' = 'true',
+     'iceberg.write.format.default' = 'parquet',
+     'iceberg.commit.retry.num-retries' = '5'
+);
+```
+
+For example, you can specify the Iceberg property `write.format.default` to 
change the file format of the Iceberg table, or set `commit.retry.num-retries` 
to configure retry behavior for commits. The `table.datalake.auto-maintenance` 
option (true by default) enables automatic maintenance tasks such as file 
compaction and snapshot expiration.
+
+## Table Types and Bucketing Strategy
+
+Fluss uses a special bucketing strategy when integrating with Iceberg to 
ensure data distribution consistency between Fluss and Iceberg layers. This 
enables efficient data access and future union read capabilities.
+
+### Bucket Strategy
+
+When Iceberg is configured as the datalake format, Fluss uses 
`IcebergBucketingFunction` to bucket data following Iceberg's bucketing 
strategy. This ensures:
+- **Data distribution consistency**: The same record goes to the same bucket 
in both Fluss and Iceberg
+- **Efficient data access**: You can quickly locate data for a specific Fluss 
bucket within Iceberg
+- **Dynamic enablement**: Tables can be enabled for datalake without rewriting 
existing data

Review Comment:
   Does "dynamic enablement" have anything to do with the bucketing strategy?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@fluss.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [lake/iceberg] Add iceberg documents for lakehouse support in fluss [fluss]

Reply via email to