Re: [PR] [doc] improve quickstart: merge Iceberg/Paimon sections & adjust DML behavior [fluss]

via GitHub Sun, 02 Nov 2025 22:59:24 -0800


Copilot commented on code in PR #1924:
URL: https://github.com/apache/fluss/pull/1924#discussion_r2485482982



##########
website/docs/quickstart/_flink-iceberg.mdx:
##########
@@ -0,0 +1,260 @@
+import SharedFlinkSQL from '@site/docs/quickstart/_shared-flink-sql.md';
+import SharedPrerequisites from 
'@site/docs/quickstart/_shared-prerequisites.md';
+import SharedCleanup from '@site/docs/quickstart/_shared-cleanup.md';
+import SharedStreamingIntoFluss from 
'@site/docs/quickstart/_shared-streaming-into-fluss.md';
+
+# Real-Time Analytics With Flink (Iceberg)
+
+This guide will get you up and running with Apache Flink to do real-time 
analytics, covering some powerful features of Fluss,
+including integrating with Apache Iceberg.
+The guide is derived from [TPC-H](https://www.tpc.org/tpch/) **Q5**.
+
+For more information on working with Flink, refer to the [Apache Flink 
Engine](engine-flink/getting-started.md) section.
+
+## Environment Setup
+
+<SharedPrerequisites/>
+
+### Starting required components
+
+We will use `docker compose` to spin up the required components for this 
tutorial.
+
+1. Create a working directory for this guide.
+
+```shell
+mkdir fluss-quickstart-flink-iceberg
+cd fluss-quickstart-flink-iceberg
+```
+
+2. Create a `lib` directory and download the required Hadoop jar file:
+
+```shell
+mkdir lib
+wget -O lib/hadoop-apache-3.3.5-2.jar 
https://repo1.maven.org/maven2/io/trino/hadoop/hadoop-apache/3.3.5-2/hadoop-apache-3.3.5-2.jar
+```
+
+This jar file provides Hadoop 3.3.5 dependencies required for Iceberg's Hadoop 
catalog integration.
+
+:::info
+The `lib` directory serves as a staging area for additional jars needed by the 
Fluss coordinator server. The docker-compose configuration (see step 3) mounts 
this directory and copies all jars to `/opt/fluss/plugins/iceberg/` inside the 
coordinator container at startup.
+
+You can add more jars to this `lib` directory based on your requirements:
+- **Cloud storage support**: For AWS S3 integration with Iceberg, add the 
corresponding Iceberg bundle jars (e.g., `iceberg-aws-bundle`)
+- **Custom Hadoop configurations**: Add jars for specific HDFS distributions 
or custom authentication mechanisms
+- **Other catalog backends**: Add jars needed for alternative Iceberg catalog 
implementations (e.g., Rest, Hive, Glue)
+
+Any jar placed in the `lib` directory will be automatically loaded by the 
Fluss coordinator server, making it available for Iceberg integration.
+:::
+
+3. Create a `docker-compose.yml` file with the following content:
+
+
+```yaml
+services:
+  zookeeper:
+    restart: always
+    image: zookeeper:3.9.2
+
+  coordinator-server:
+    image: apache/fluss:$FLUSS_DOCKER_VERSION$
+    depends_on:
+      - zookeeper
+    environment:
+      - |
+        FLUSS_PROPERTIES=
+        zookeeper.address: zookeeper:2181
+        bind.listeners: FLUSS://coordinator-server:9123
+        remote.data.dir: /tmp/fluss/remote-data
+        datalake.format: iceberg
+        datalake.iceberg.type: hadoop
+        datalake.iceberg.warehouse: /tmp/iceberg
+    volumes:
+      - shared-tmpfs:/tmp/iceberg
+      - ./lib:/tmp/lib
+    entrypoint: ["sh", "-c", "cp -v /tmp/lib/*.jar /opt/fluss/plugins/iceberg/ 
&& exec /docker-entrypoint.sh coordinatorServer"]
+
+  tablet-server:
+    image: apache/fluss:$FLUSS_DOCKER_VERSION$
+    command: tabletServer
+    depends_on:
+      - coordinator-server
+    environment:
+      - |
+        FLUSS_PROPERTIES=
+        zookeeper.address: zookeeper:2181
+        bind.listeners: FLUSS://tablet-server:9123
+        data.dir: /tmp/fluss/data
+        remote.data.dir: /tmp/fluss/remote-data
+        kv.snapshot.interval: 0s
+        datalake.format: iceberg
+        datalake.iceberg.type: hadoop
+        datalake.iceberg.warehouse: /tmp/iceberg
+    volumes:
+      - shared-tmpfs:/tmp/iceberg
+
+  jobmanager:
+    image: apache/fluss-quickstart-flink:1.20-$FLUSS_DOCKER_VERSION$
+    ports:
+      - "8083:8081"
+    command: jobmanager
+    environment:
+      - |
+        FLINK_PROPERTIES=
+        jobmanager.rpc.address: jobmanager
+    volumes:
+      - shared-tmpfs:/tmp/iceberg
+
+  taskmanager:
+    image: apache/fluss-quickstart-flink:1.20-$FLUSS_DOCKER_VERSION$
+    depends_on:
+      - jobmanager
+    command: taskmanager
+    environment:
+      - |
+        FLINK_PROPERTIES=
+        jobmanager.rpc.address: jobmanager
+        taskmanager.numberOfTaskSlots: 10
+        taskmanager.memory.process.size: 2048m
+        taskmanager.memory.framework.off-heap.size: 256m
+    volumes:
+      - shared-tmpfs:/tmp/iceberg
+
+volumes:
+  shared-tmpfs:
+    driver: local
+    driver_opts:
+      type: "tmpfs"
+      device: "tmpfs"
+```
+
+The Docker Compose environment consists of the following containers:
+- **Fluss Cluster:** a Fluss `CoordinatorServer`, a Fluss `TabletServer` and a 
`ZooKeeper` server.
+- **Flink Cluster**: a Flink `JobManager` and a Flink `TaskManager` container 
to execute queries.
+
+**Note:** The `apache/fluss-quickstart-flink` image is based on 
[flink:1.20.1-java17](https://hub.docker.com/layers/library/flink/1.20-java17/images/sha256:bf1af6406c4f4ad8faa46efe2b3d0a0bf811d1034849c42c1e3484712bc83505)
 and
+includes the [fluss-flink](engine-flink/getting-started.md), 
[iceberg-flink](https://iceberg.apache.org/docs/latest/flink/) and
+[flink-connector-faker](https://flink-packages.org/packages/flink-faker) to 
simplify this guide.
+
+3. To start all containers, run:

Review Comment:
   The step numbering is incorrect. Step 2 creates the `lib` directory and 
downloads the Hadoop jar (lines 29-34), but then step 3 appears twice: once for 
creating the docker-compose.yml file (line 49) and again for starting 
containers (line 138). The docker-compose.yml creation should be labeled as 
step 3, and starting containers should be step 4.
   ```suggestion
   4. To start all containers, run:
   ```



##########
website/docs/quickstart/_shared-streaming-into-fluss.md:
##########
@@ -0,0 +1,57 @@
+### Streaming into Fluss datalake-enabled tables
+
+By default, tables are created with data lake integration disabled, meaning 
the Lakehouse Tiering Service will not tier the table's data to the data lake.
+
+To enable lakehouse functionality as a tiered storage solution for a table, 
you must create the table with the configuration option `table.datalake.enabled 
= true`. 
+Return to the `SQL client` and execute the following SQL statement to create a 
table with data lake integration enabled:
+```sql  title="Flink SQL"
+CREATE TABLE datalake_enriched_orders (
+    `order_key` BIGINT,
+    `cust_key` INT NOT NULL,
+    `total_price` DECIMAL(15, 2),
+    `order_date` DATE,
+    `order_priority` STRING,
+    `clerk` STRING,
+    `cust_name` STRING,
+    `cust_phone` STRING,
+    `cust_acctbal` DECIMAL(15, 2),
+    `cust_mktsegment` STRING,
+    `nation_name` STRING,
+    PRIMARY KEY (`order_key`) NOT ENFORCED

Review Comment:
   The `PRIMARY KEY` definition is missing from the `datalake_enriched_orders` 
table in the shared file. Looking at the original flink.md (Paimon version), 
the table had `PRIMARY KEY (order_key) NOT ENFORCED` on line 20, but the 
original flink-iceberg.md did not have this constraint. The shared file now 
includes the PRIMARY KEY, which changes the behavior for Iceberg users who 
previously created tables without primary keys. This could break existing 
workflows or cause unexpected behavior.
   ```suggestion
       `nation_name` STRING
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [doc] improve quickstart: merge Iceberg/Paimon sections & adjust DML behavior [fluss]

Reply via email to