(fluss) branch main updated: [docs][lake/iceberg] Add a part about streaming union read in icebeg doc (#1774)

yuxia Fri, 17 Oct 2025 21:28:39 -0700

This is an automated email from the ASF dual-hosted git repository.

yuxia pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/fluss.git



The following commit(s) were added to refs/heads/main by this push:
     new 7ab3ca511 [docs][lake/iceberg] Add a part about streaming union read 
in icebeg doc (#1774)
7ab3ca511 is described below

commit 7ab3ca51130c11e48e4cc38c59180c574aaf4687
Author: Junbo Wang <[email protected]>
AuthorDate: Tue Sep 30 18:01:08 2025 +0800

    [docs][lake/iceberg] Add a part about streaming union read in icebeg doc 
(#1774)
    
    ---------
    
    Co-authored-by: luoyuxia <[email protected]>
---
 .../integrate-data-lakes/iceberg.md                | 38 ++++++++++++++++++++++
 .../integrate-data-lakes/paimon.md                 |  6 ++--
 2 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/website/docs/streaming-lakehouse/integrate-data-lakes/iceberg.md 
b/website/docs/streaming-lakehouse/integrate-data-lakes/iceberg.md
index b3b11c87c..a51c967c0 100644
--- a/website/docs/streaming-lakehouse/integrate-data-lakes/iceberg.md
+++ b/website/docs/streaming-lakehouse/integrate-data-lakes/iceberg.md
@@ -406,6 +406,44 @@ All Iceberg tables created by Fluss include three system 
columns:
 
 ## Read Tables
 
+### 🐿️ Reading with Apache Flink
+
+When a table has the configuration `table.datalake.enabled = 'true'`, its data 
exists in two layers:
+
+- Fresh data is retained in Fluss
+- Historical data is tiered to Iceberg
+
+#### Union Read of Data in Fluss and Iceberg
+You can query a combined view of both layers with second-level latency which 
is called union read.
+
+##### Prerequisites
+
+You need to place the JARs required by Iceberg to read data into 
`${FLINK_HOME}/lib`. For detailed dependencies and JAR preparation 
instructions, refer to [🚀 Start Tiering Service to 
Iceberg](#-start-tiering-service-to-iceberg).
+
+##### Union Read
+
+To read the full dataset, which includes both Fluss (fresh) and Iceberg 
(historical) data, simply query the table without any suffix. The following 
example illustrates this:
+
+```sql
+-- Set execution mode to streaming or batch, here just take batch as an example
+SET 'execution.runtime-mode' = 'batch';
+
+-- Query will union data from Fluss and Iceberg
+select SUM(visit_count) from fluss_access_log;
+```
+
+It supports both batch and streaming modes, utilizing Iceberg for historical 
data and Fluss for fresh data:
+
+- **Batch mode** (only log table)
+
+- **Streaming mode** (primary key table and log table)
+
+  Flink first reads the latest Iceberg snapshot (tiered via tiering service), 
then switches to Fluss starting from the log offset matching that snapshot. 
This design minimizes Fluss storage requirements (reducing costs) while using 
Iceberg as a complete historical archive.
+
+Key behavior for data retention:
+- **Expired Fluss log data** (controlled by `table.log.ttl`) remains 
accessible via Iceberg if previously tiered
+- **Cleaned-up partitions** in partitioned tables (controlled by 
`table.auto-partition.num-retention`) remain accessible via Iceberg if 
previously tiered
+
 ### 🔍 Reading with Other Engines
 
 Since data tiered to Iceberg from Fluss is stored as standard Iceberg tables, 
you can use any Iceberg-compatible engine. Below is an example using 
[StarRocks](https://docs.starrocks.io/docs/data_source/catalog/iceberg/iceberg_catalog/):
diff --git a/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md 
b/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
index a4a642a40..532470ea6 100644
--- a/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
+++ b/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
@@ -119,9 +119,9 @@ It supports both batch and streaming modes, using Paimon 
for historical data and
   Flink first reads the latest Paimon snapshot (tiered via tiering service), 
then switches to Fluss starting from the log offset aligned with that snapshot, 
ensuring exactly-once semantics.
   This design enables Fluss to store only a small portion of the dataset in 
the Fluss cluster, reducing costs, while Paimon serves as the source of 
complete historical data when needed. 
 
-  More precisely, if Fluss log data is removed due to TTL 
expiration—controlled by the `table.log.ttl` configuration—it can still be read 
by Flink through its Union Read capability, as long as the data has already 
been tiered to Paimon.
-  For partitioned tables, if a partition is cleaned up—controlled by the 
`table.auto-partition.num-retention` configuration—the data in that partition 
remains accessible from Paimon, provided it has been tiered there beforehand. 
-
+Key behavior for data retention:
+- **Expired Fluss log data** (controlled by `table.log.ttl`) remains 
accessible via Iceberg if previously tiered
+- **Cleaned-up partitions** in partitioned tables (controlled by 
`table.auto-partition.num-retention`) remain accessible via Iceberg if 
previously tiered
 
 ### Reading with other Engines

(fluss) branch main updated: [docs][lake/iceberg] Add a part about streaming union read in icebeg doc (#1774)

Reply via email to