wuchong commented on code in PR #1924: URL: https://github.com/apache/fluss/pull/1924#discussion_r2510131111
########## website/docs/quickstart/flink-lake.md: ########## @@ -1,15 +1,16 @@ --- -title: Real-Time Analytics with Flink (Iceberg) +title: Build Streaming Lakehouse sidebar_position: 2 --- -# Real-Time Analytics With Flink (Iceberg) +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; -This guide will get you up and running with Apache Flink to do real-time analytics, covering some powerful features of Fluss, -including integrating with Apache Iceberg. -The guide is derived from [TPC-H](https://www.tpc.org/tpch/) **Q5**. +import CleanUp from './_shared-cleanup.md'; +import LakeAnalytics from './_shared-lake-analytics.md'; +import CreateTable from './_shared-create-table.md'; -For more information on working with Flink, refer to the [Apache Flink Engine](engine-flink/getting-started.md) section. +This guide will help you set up a basic streaming Lakehouse using Fluss with Paimon or Iceberg. Review Comment: ```suggestion This guide will help you set up a basic Streaming Lakehouse using Fluss with Paimon or Iceberg, and help you better understand the powerful feature of Union Read. ``` ########## website/docs/quickstart/flink-lake.md: ########## @@ -23,6 +24,132 @@ We encourage you to use a recent version of Docker and [Compose v2](https://docs ### Starting required components +<Tabs groupId="lake-tabs"> + <TabItem value="paimon" label="Paimon" default> + +We will use `docker compose` to spin up the required components for this tutorial. + +1. Create a working directory for this guide. + +```shell +mkdir fluss-quickstart-flink-paimon +cd fluss-quickstart-flink-paimon Review Comment: Can we just name this directory `fluss-quickstart-paimon`? In the future, we want to introduce Trino and other query engines into this quickstart. So binding to a specific engine is not feasible. The same to the iceberg directory. ########## website/docs/quickstart/flink-lake.md: ########## @@ -285,104 +384,15 @@ SELECT o.order_key, c.mktsegment, n.name FROM fluss_order o Review Comment: `fluss_order` is not defined on this page. Besides, there is no records in `fluss_customer`, `fluss_order` and `fluss_nation` as there is no insert into job to these tables. Besides, this query is different with the one in iceberg tab. I think we need keep consistent between them? ########## website/docs/quickstart/flink-lake.md: ########## @@ -437,71 +447,48 @@ LEFT JOIN fluss_nation FOR SYSTEM_TIME AS OF o.ptime AS n ON c.nation_key = n.nation_key; ``` -### Real-Time Analytics on Fluss datalake-enabled Tables - -The data for the `datalake_enriched_orders` table is stored in Fluss (for real-time data) and Iceberg (for historical data). - -When querying the `datalake_enriched_orders` table, Fluss uses a union operation that combines data from both Fluss and Iceberg to provide a complete result set -- combines **real-time** and **historical** data. - -If you wish to query only the data stored in Iceberg—offering high-performance access without the overhead of unioning data—you can use the `datalake_enriched_orders$lake` table by appending the `$lake` suffix. -This approach also enables all the optimizations and features of a Flink Iceberg table source, including [system table](https://iceberg.apache.org/docs/latest/flink-queries/#inspecting-tables) such as `datalake_enriched_orders$lake$snapshots`. - + </TabItem> +</Tabs> -```sql title="Flink SQL" --- switch to batch mode -SET 'execution.runtime-mode' = 'batch'; -``` +### Real-Time Analytics on Fluss datalake-enabled Tables +<Tabs groupId="lake-tabs"> + <TabItem value="paimon" label="Paimon" default> -```sql title="Flink SQL" --- query snapshots in iceberg -SELECT snapshot_id, operation FROM datalake_enriched_orders$lake$snapshots; -``` - -**Sample Output:** -```shell -+---------------------+-----------+ -| snapshot_id | operation | -+---------------------+-----------+ -| 7792523713868625335 | append | -| 7960217942125627573 | append | -+---------------------+-----------+ -``` -**Note:** Make sure to wait for the configured `datalake.freshness` (~30s) to complete before querying the snapshots, otherwise the result will be empty. +<LakeAnalytics name="Paimon"/> -Run the following SQL to do analytics on Iceberg data: -```sql title="Flink SQL" --- to sum prices of all orders in iceberg -SELECT sum(total_price) as sum_price FROM datalake_enriched_orders$lake; -``` -**Sample Output:** ```shell -+-----------+ -| sum_price | -+-----------+ -| 432880.93 | -+-----------+ -``` - -To achieve results with sub-second data freshness, you can query the table directly, which seamlessly unifies data from both Fluss and Iceberg: - -```sql title="Flink SQL" --- to sum prices of all orders (combining fluss and iceberg data) -SELECT sum(total_price) as sum_price FROM datalake_enriched_orders; +docker compose exec taskmanager tree /tmp/paimon/fluss.db ``` **Sample Output:** ```shell -+-----------+ -| sum_price | -+-----------+ -| 558660.03 | -+-----------+ -``` - -You can execute the real-time analytics query multiple times, and the results will vary with each run as new data is continuously written to Fluss in real-time. +/tmp/paimon/fluss.db +└── datalake_enriched_orders + ├── bucket-0 + │ ├── changelog-aef1810f-85b2-4eba-8eb8-9b136dec5bdb-0.orc + │ └── data-aef1810f-85b2-4eba-8eb8-9b136dec5bdb-1.orc + ├── manifest + │ ├── manifest-aaa007e1-81a2-40b3-ba1f-9df4528bc402-0 + │ ├── manifest-aaa007e1-81a2-40b3-ba1f-9df4528bc402-1 + │ ├── manifest-list-ceb77e1f-7d17-4160-9e1f-f334918c6e0d-0 + │ ├── manifest-list-ceb77e1f-7d17-4160-9e1f-f334918c6e0d-1 + │ └── manifest-list-ceb77e1f-7d17-4160-9e1f-f334918c6e0d-2 + ├── schema + │ └── schema-0 + └── snapshot + ├── EARLIEST + ├── LATEST + └── snapshot-1 +``` +The files adhere to Paimon's standard format, enabling seamless querying with other engines such as [StarRocks](https://docs.starrocks.io/docs/data_source/catalog/paimon_catalog/). Review Comment: Keep consistent with Iceberg tab? use "Trino" and "Spark" here? ########## website/docs/quickstart/flink-lake.md: ########## @@ -1,15 +1,16 @@ --- -title: Real-Time Analytics with Flink (Iceberg) +title: Build Streaming Lakehouse Review Comment: 1. **Rename the title** from `Build Streaming Lakehouse` to `Building a Streaming Lakehouse`, this matches standard quickstart naming conventions. 2. **Consider renaming the file** from `flink-lake.md` to `lakehouse.md` for better clarity and consistency, since the guide now focuses on the broader lakehouse architecture, not just Flink integration. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
