Copilot commented on code in PR #1924: URL: https://github.com/apache/fluss/pull/1924#discussion_r2485482982
########## website/docs/quickstart/_flink-iceberg.mdx: ########## @@ -0,0 +1,260 @@ +import SharedFlinkSQL from '@site/docs/quickstart/_shared-flink-sql.md'; +import SharedPrerequisites from '@site/docs/quickstart/_shared-prerequisites.md'; +import SharedCleanup from '@site/docs/quickstart/_shared-cleanup.md'; +import SharedStreamingIntoFluss from '@site/docs/quickstart/_shared-streaming-into-fluss.md'; + +# Real-Time Analytics With Flink (Iceberg) + +This guide will get you up and running with Apache Flink to do real-time analytics, covering some powerful features of Fluss, +including integrating with Apache Iceberg. +The guide is derived from [TPC-H](https://www.tpc.org/tpch/) **Q5**. + +For more information on working with Flink, refer to the [Apache Flink Engine](engine-flink/getting-started.md) section. + +## Environment Setup + +<SharedPrerequisites/> + +### Starting required components + +We will use `docker compose` to spin up the required components for this tutorial. + +1. Create a working directory for this guide. + +```shell +mkdir fluss-quickstart-flink-iceberg +cd fluss-quickstart-flink-iceberg +``` + +2. Create a `lib` directory and download the required Hadoop jar file: + +```shell +mkdir lib +wget -O lib/hadoop-apache-3.3.5-2.jar https://repo1.maven.org/maven2/io/trino/hadoop/hadoop-apache/3.3.5-2/hadoop-apache-3.3.5-2.jar +``` + +This jar file provides Hadoop 3.3.5 dependencies required for Iceberg's Hadoop catalog integration. + +:::info +The `lib` directory serves as a staging area for additional jars needed by the Fluss coordinator server. The docker-compose configuration (see step 3) mounts this directory and copies all jars to `/opt/fluss/plugins/iceberg/` inside the coordinator container at startup. + +You can add more jars to this `lib` directory based on your requirements: +- **Cloud storage support**: For AWS S3 integration with Iceberg, add the corresponding Iceberg bundle jars (e.g., `iceberg-aws-bundle`) +- **Custom Hadoop configurations**: Add jars for specific HDFS distributions or custom authentication mechanisms +- **Other catalog backends**: Add jars needed for alternative Iceberg catalog implementations (e.g., Rest, Hive, Glue) + +Any jar placed in the `lib` directory will be automatically loaded by the Fluss coordinator server, making it available for Iceberg integration. +::: + +3. Create a `docker-compose.yml` file with the following content: + + +```yaml +services: + zookeeper: + restart: always + image: zookeeper:3.9.2 + + coordinator-server: + image: apache/fluss:$FLUSS_DOCKER_VERSION$ + depends_on: + - zookeeper + environment: + - | + FLUSS_PROPERTIES= + zookeeper.address: zookeeper:2181 + bind.listeners: FLUSS://coordinator-server:9123 + remote.data.dir: /tmp/fluss/remote-data + datalake.format: iceberg + datalake.iceberg.type: hadoop + datalake.iceberg.warehouse: /tmp/iceberg + volumes: + - shared-tmpfs:/tmp/iceberg + - ./lib:/tmp/lib + entrypoint: ["sh", "-c", "cp -v /tmp/lib/*.jar /opt/fluss/plugins/iceberg/ && exec /docker-entrypoint.sh coordinatorServer"] + + tablet-server: + image: apache/fluss:$FLUSS_DOCKER_VERSION$ + command: tabletServer + depends_on: + - coordinator-server + environment: + - | + FLUSS_PROPERTIES= + zookeeper.address: zookeeper:2181 + bind.listeners: FLUSS://tablet-server:9123 + data.dir: /tmp/fluss/data + remote.data.dir: /tmp/fluss/remote-data + kv.snapshot.interval: 0s + datalake.format: iceberg + datalake.iceberg.type: hadoop + datalake.iceberg.warehouse: /tmp/iceberg + volumes: + - shared-tmpfs:/tmp/iceberg + + jobmanager: + image: apache/fluss-quickstart-flink:1.20-$FLUSS_DOCKER_VERSION$ + ports: + - "8083:8081" + command: jobmanager + environment: + - | + FLINK_PROPERTIES= + jobmanager.rpc.address: jobmanager + volumes: + - shared-tmpfs:/tmp/iceberg + + taskmanager: + image: apache/fluss-quickstart-flink:1.20-$FLUSS_DOCKER_VERSION$ + depends_on: + - jobmanager + command: taskmanager + environment: + - | + FLINK_PROPERTIES= + jobmanager.rpc.address: jobmanager + taskmanager.numberOfTaskSlots: 10 + taskmanager.memory.process.size: 2048m + taskmanager.memory.framework.off-heap.size: 256m + volumes: + - shared-tmpfs:/tmp/iceberg + +volumes: + shared-tmpfs: + driver: local + driver_opts: + type: "tmpfs" + device: "tmpfs" +``` + +The Docker Compose environment consists of the following containers: +- **Fluss Cluster:** a Fluss `CoordinatorServer`, a Fluss `TabletServer` and a `ZooKeeper` server. +- **Flink Cluster**: a Flink `JobManager` and a Flink `TaskManager` container to execute queries. + +**Note:** The `apache/fluss-quickstart-flink` image is based on [flink:1.20.1-java17](https://hub.docker.com/layers/library/flink/1.20-java17/images/sha256:bf1af6406c4f4ad8faa46efe2b3d0a0bf811d1034849c42c1e3484712bc83505) and +includes the [fluss-flink](engine-flink/getting-started.md), [iceberg-flink](https://iceberg.apache.org/docs/latest/flink/) and +[flink-connector-faker](https://flink-packages.org/packages/flink-faker) to simplify this guide. + +3. To start all containers, run: Review Comment: The step numbering is incorrect. Step 2 creates the `lib` directory and downloads the Hadoop jar (lines 29-34), but then step 3 appears twice: once for creating the docker-compose.yml file (line 49) and again for starting containers (line 138). The docker-compose.yml creation should be labeled as step 3, and starting containers should be step 4. ```suggestion 4. To start all containers, run: ``` ########## website/docs/quickstart/_shared-streaming-into-fluss.md: ########## @@ -0,0 +1,57 @@ +### Streaming into Fluss datalake-enabled tables + +By default, tables are created with data lake integration disabled, meaning the Lakehouse Tiering Service will not tier the table's data to the data lake. + +To enable lakehouse functionality as a tiered storage solution for a table, you must create the table with the configuration option `table.datalake.enabled = true`. +Return to the `SQL client` and execute the following SQL statement to create a table with data lake integration enabled: +```sql title="Flink SQL" +CREATE TABLE datalake_enriched_orders ( + `order_key` BIGINT, + `cust_key` INT NOT NULL, + `total_price` DECIMAL(15, 2), + `order_date` DATE, + `order_priority` STRING, + `clerk` STRING, + `cust_name` STRING, + `cust_phone` STRING, + `cust_acctbal` DECIMAL(15, 2), + `cust_mktsegment` STRING, + `nation_name` STRING, + PRIMARY KEY (`order_key`) NOT ENFORCED Review Comment: The `PRIMARY KEY` definition is missing from the `datalake_enriched_orders` table in the shared file. Looking at the original flink.md (Paimon version), the table had `PRIMARY KEY (order_key) NOT ENFORCED` on line 20, but the original flink-iceberg.md did not have this constraint. The shared file now includes the PRIMARY KEY, which changes the behavior for Iceberg users who previously created tables without primary keys. This could break existing workflows or cause unexpected behavior. ```suggestion `nation_name` STRING ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
