Re: [PR] [doc] improve quickstart: merge Iceberg/Paimon sections & adjust DML behavior [fluss]

via GitHub Mon, 10 Nov 2025 03:54:28 -0800


wuchong commented on code in PR #1924:
URL: https://github.com/apache/fluss/pull/1924#discussion_r2510131111



##########
website/docs/quickstart/flink-lake.md:
##########
@@ -1,15 +1,16 @@
 ---
-title: Real-Time Analytics with Flink (Iceberg)
+title: Build Streaming Lakehouse
 sidebar_position: 2
 ---
 
-# Real-Time Analytics With Flink (Iceberg)
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
 
-This guide will get you up and running with Apache Flink to do real-time 
analytics, covering some powerful features of Fluss,
-including integrating with Apache Iceberg.
-The guide is derived from [TPC-H](https://www.tpc.org/tpch/) **Q5**.
+import CleanUp from './_shared-cleanup.md';
+import LakeAnalytics from './_shared-lake-analytics.md';
+import CreateTable from './_shared-create-table.md';
 
-For more information on working with Flink, refer to the [Apache Flink 
Engine](engine-flink/getting-started.md) section.
+This guide will help you set up a basic streaming Lakehouse using Fluss with 
Paimon or Iceberg.

Review Comment:
   ```suggestion
   This guide will help you set up a basic Streaming Lakehouse using Fluss with 
Paimon or Iceberg, and help you better understand the powerful feature of Union 
Read. 
   ```



##########
website/docs/quickstart/flink-lake.md:
##########
@@ -23,6 +24,132 @@ We encourage you to use a recent version of Docker and 
[Compose v2](https://docs
 
 ### Starting required components
 
+<Tabs groupId="lake-tabs">
+  <TabItem value="paimon" label="Paimon" default>
+
+We will use `docker compose` to spin up the required components for this 
tutorial.
+
+1. Create a working directory for this guide.
+
+```shell
+mkdir fluss-quickstart-flink-paimon
+cd fluss-quickstart-flink-paimon

Review Comment:
   Can we just name this directory `fluss-quickstart-paimon`? In the future, we 
want to introduce Trino and other query engines into this quickstart. So 
binding to a specific engine is not feasible. The same to the iceberg 
directory. 



##########
website/docs/quickstart/flink-lake.md:
##########
@@ -285,104 +384,15 @@ SELECT o.order_key,
        c.mktsegment,
        n.name
 FROM fluss_order o

Review Comment:
   `fluss_order` is not defined on this page. 
   
   Besides, there is no records in `fluss_customer`, `fluss_order` and 
`fluss_nation` as there is no insert into job to these tables. 
   
   Besides, this query is different with the one in iceberg tab. I think we 
need keep consistent between them?



##########
website/docs/quickstart/flink-lake.md:
##########
@@ -437,71 +447,48 @@ LEFT JOIN fluss_nation FOR SYSTEM_TIME AS OF o.ptime AS n
     ON c.nation_key = n.nation_key;
 ```
 
-### Real-Time Analytics on Fluss datalake-enabled Tables
-
-The data for the `datalake_enriched_orders` table is stored in Fluss (for 
real-time data) and Iceberg (for historical data).
-
-When querying the `datalake_enriched_orders` table, Fluss uses a union 
operation that combines data from both Fluss and Iceberg to provide a complete 
result set -- combines **real-time** and **historical** data.
-
-If you wish to query only the data stored in Iceberg—offering high-performance 
access without the overhead of unioning data—you can use the 
`datalake_enriched_orders$lake` table by appending the `$lake` suffix.
-This approach also enables all the optimizations and features of a Flink 
Iceberg table source, including [system 
table](https://iceberg.apache.org/docs/latest/flink-queries/#inspecting-tables) 
such as `datalake_enriched_orders$lake$snapshots`.
-
+  </TabItem>
+</Tabs>
 
-```sql  title="Flink SQL"
--- switch to batch mode
-SET 'execution.runtime-mode' = 'batch';
-```
+### Real-Time Analytics on Fluss datalake-enabled Tables
 
+<Tabs groupId="lake-tabs">
+  <TabItem value="paimon" label="Paimon" default>
 
-```sql  title="Flink SQL"
--- query snapshots in iceberg
-SELECT snapshot_id, operation FROM datalake_enriched_orders$lake$snapshots;
-```
-
-**Sample Output:**
-```shell
-+---------------------+-----------+
-|         snapshot_id | operation |
-+---------------------+-----------+
-| 7792523713868625335 |    append |
-| 7960217942125627573 |    append |
-+---------------------+-----------+
-```
-**Note:** Make sure to wait for the configured `datalake.freshness` (~30s) to 
complete before querying the snapshots, otherwise the result will be empty.
+<LakeAnalytics name="Paimon"/>
 
-Run the following SQL to do analytics on Iceberg data:
-```sql  title="Flink SQL"
--- to sum prices of all orders in iceberg
-SELECT sum(total_price) as sum_price FROM datalake_enriched_orders$lake;
-```
-**Sample Output:**
 ```shell
-+-----------+
-| sum_price |
-+-----------+
-| 432880.93 |
-+-----------+
-```
-
-To achieve results with sub-second data freshness, you can query the table 
directly, which seamlessly unifies data from both Fluss and Iceberg:
-
-```sql  title="Flink SQL"
--- to sum prices of all orders (combining fluss and iceberg data)
-SELECT sum(total_price) as sum_price FROM datalake_enriched_orders;
+docker compose exec taskmanager tree /tmp/paimon/fluss.db
 ```
 
 **Sample Output:**
 ```shell
-+-----------+
-| sum_price |
-+-----------+
-| 558660.03 |
-+-----------+
-```
-
-You can execute the real-time analytics query multiple times, and the results 
will vary with each run as new data is continuously written to Fluss in 
real-time.
+/tmp/paimon/fluss.db
+└── datalake_enriched_orders
+    ├── bucket-0
+    │   ├── changelog-aef1810f-85b2-4eba-8eb8-9b136dec5bdb-0.orc
+    │   └── data-aef1810f-85b2-4eba-8eb8-9b136dec5bdb-1.orc
+    ├── manifest
+    │   ├── manifest-aaa007e1-81a2-40b3-ba1f-9df4528bc402-0
+    │   ├── manifest-aaa007e1-81a2-40b3-ba1f-9df4528bc402-1
+    │   ├── manifest-list-ceb77e1f-7d17-4160-9e1f-f334918c6e0d-0
+    │   ├── manifest-list-ceb77e1f-7d17-4160-9e1f-f334918c6e0d-1
+    │   └── manifest-list-ceb77e1f-7d17-4160-9e1f-f334918c6e0d-2
+    ├── schema
+    │   └── schema-0
+    └── snapshot
+        ├── EARLIEST
+        ├── LATEST
+        └── snapshot-1
+```
+The files adhere to Paimon's standard format, enabling seamless querying with 
other engines such as 
[StarRocks](https://docs.starrocks.io/docs/data_source/catalog/paimon_catalog/).

Review Comment:
   Keep consistent with Iceberg tab? use "Trino" and "Spark" here?



##########
website/docs/quickstart/flink-lake.md:
##########
@@ -1,15 +1,16 @@
 ---
-title: Real-Time Analytics with Flink (Iceberg)
+title: Build Streaming Lakehouse

Review Comment:
   1. **Rename the title** from `Build Streaming Lakehouse` to `Building a 
Streaming Lakehouse`, this matches standard quickstart naming conventions.  
   2. **Consider renaming the file** from `flink-lake.md` to `lakehouse.md` for 
better clarity and consistency, since the guide now focuses on the broader 
lakehouse architecture, not just Flink integration.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [doc] improve quickstart: merge Iceberg/Paimon sections & adjust DML behavior [fluss]

Reply via email to