This is an automated email from the ASF dual-hosted git repository.
yuxia pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/fluss.git
The following commit(s) were added to refs/heads/main by this push:
new 7ab3ca511 [docs][lake/iceberg] Add a part about streaming union read
in icebeg doc (#1774)
7ab3ca511 is described below
commit 7ab3ca51130c11e48e4cc38c59180c574aaf4687
Author: Junbo Wang <[email protected]>
AuthorDate: Tue Sep 30 18:01:08 2025 +0800
[docs][lake/iceberg] Add a part about streaming union read in icebeg doc
(#1774)
---------
Co-authored-by: luoyuxia <[email protected]>
---
.../integrate-data-lakes/iceberg.md | 38 ++++++++++++++++++++++
.../integrate-data-lakes/paimon.md | 6 ++--
2 files changed, 41 insertions(+), 3 deletions(-)
diff --git a/website/docs/streaming-lakehouse/integrate-data-lakes/iceberg.md
b/website/docs/streaming-lakehouse/integrate-data-lakes/iceberg.md
index b3b11c87c..a51c967c0 100644
--- a/website/docs/streaming-lakehouse/integrate-data-lakes/iceberg.md
+++ b/website/docs/streaming-lakehouse/integrate-data-lakes/iceberg.md
@@ -406,6 +406,44 @@ All Iceberg tables created by Fluss include three system
columns:
## Read Tables
+### đżď¸ Reading with Apache Flink
+
+When a table has the configuration `table.datalake.enabled = 'true'`, its data
exists in two layers:
+
+- Fresh data is retained in Fluss
+- Historical data is tiered to Iceberg
+
+#### Union Read of Data in Fluss and Iceberg
+You can query a combined view of both layers with second-level latency which
is called union read.
+
+##### Prerequisites
+
+You need to place the JARs required by Iceberg to read data into
`${FLINK_HOME}/lib`. For detailed dependencies and JAR preparation
instructions, refer to [đ Start Tiering Service to
Iceberg](#-start-tiering-service-to-iceberg).
+
+##### Union Read
+
+To read the full dataset, which includes both Fluss (fresh) and Iceberg
(historical) data, simply query the table without any suffix. The following
example illustrates this:
+
+```sql
+-- Set execution mode to streaming or batch, here just take batch as an example
+SET 'execution.runtime-mode' = 'batch';
+
+-- Query will union data from Fluss and Iceberg
+select SUM(visit_count) from fluss_access_log;
+```
+
+It supports both batch and streaming modes, utilizing Iceberg for historical
data and Fluss for fresh data:
+
+- **Batch mode** (only log table)
+
+- **Streaming mode** (primary key table and log table)
+
+ Flink first reads the latest Iceberg snapshot (tiered via tiering service),
then switches to Fluss starting from the log offset matching that snapshot.
This design minimizes Fluss storage requirements (reducing costs) while using
Iceberg as a complete historical archive.
+
+Key behavior for data retention:
+- **Expired Fluss log data** (controlled by `table.log.ttl`) remains
accessible via Iceberg if previously tiered
+- **Cleaned-up partitions** in partitioned tables (controlled by
`table.auto-partition.num-retention`) remain accessible via Iceberg if
previously tiered
+
### đ Reading with Other Engines
Since data tiered to Iceberg from Fluss is stored as standard Iceberg tables,
you can use any Iceberg-compatible engine. Below is an example using
[StarRocks](https://docs.starrocks.io/docs/data_source/catalog/iceberg/iceberg_catalog/):
diff --git a/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
b/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
index a4a642a40..532470ea6 100644
--- a/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
+++ b/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
@@ -119,9 +119,9 @@ It supports both batch and streaming modes, using Paimon
for historical data and
Flink first reads the latest Paimon snapshot (tiered via tiering service),
then switches to Fluss starting from the log offset aligned with that snapshot,
ensuring exactly-once semantics.
This design enables Fluss to store only a small portion of the dataset in
the Fluss cluster, reducing costs, while Paimon serves as the source of
complete historical data when needed.
- More precisely, if Fluss log data is removed due to TTL
expirationâcontrolled by the `table.log.ttl` configurationâit can still be read
by Flink through its Union Read capability, as long as the data has already
been tiered to Paimon.
- For partitioned tables, if a partition is cleaned upâcontrolled by the
`table.auto-partition.num-retention` configurationâthe data in that partition
remains accessible from Paimon, provided it has been tiered there beforehand.
-
+Key behavior for data retention:
+- **Expired Fluss log data** (controlled by `table.log.ttl`) remains
accessible via Iceberg if previously tiered
+- **Cleaned-up partitions** in partitioned tables (controlled by
`table.auto-partition.num-retention`) remain accessible via Iceberg if
previously tiered
### Reading with other Engines