This is an automated email from the ASF dual-hosted git repository. jark pushed a commit to branch release-0.9 in repository https://gitbox.apache.org/repos/asf/fluss.git
commit 55e57d3d720b8d67b49a8432334f740b2b9b525a Author: Keith Lee <[email protected]> AuthorDate: Wed Feb 11 02:09:10 2026 +0000 [docs] Update Quickstart to Use S3 (via RustFS) Instead of Local File System (#2569) --- docker/quickstart-flink/prepare_build.sh | 4 +- pom.xml | 2 +- website/docs/quickstart/flink.md | 96 ++++++++-- website/docs/quickstart/lakehouse.md | 304 +++++++++++++++++++++++-------- 4 files changed, 318 insertions(+), 88 deletions(-) diff --git a/docker/quickstart-flink/prepare_build.sh b/docker/quickstart-flink/prepare_build.sh index 3dd9f964a..a2904b05d 100755 --- a/docker/quickstart-flink/prepare_build.sh +++ b/docker/quickstart-flink/prepare_build.sh @@ -76,7 +76,7 @@ download_jar() { log_info "Downloading $description..." # Download the file - if ! wget -O "$dest_file" "$url"; then + if ! curl -fL -o "$dest_file" "$url"; then log_error "Failed to download $description from $url" return 1 fi @@ -258,4 +258,4 @@ show_summary() { } # Run main function -main "$@" \ No newline at end of file +main "$@" diff --git a/pom.xml b/pom.xml index e223b86e8..7322422f5 100644 --- a/pom.xml +++ b/pom.xml @@ -1139,4 +1139,4 @@ </pluginManagement> </build> -</project> \ No newline at end of file +</project> diff --git a/website/docs/quickstart/flink.md b/website/docs/quickstart/flink.md index a3dc859a7..b5fd0f93c 100644 --- a/website/docs/quickstart/flink.md +++ b/website/docs/quickstart/flink.md @@ -33,7 +33,7 @@ mkdir fluss-quickstart-flink cd fluss-quickstart-flink ``` -2. Create a `lib` directory and download the required jar files. You can adjust the Flink version as needed. Please make sure to download the compatible versions of [fluss-flink connector jar](/downloads) and [flink-connector-faker](https://github.com/knaufk/flink-faker/releases) +2. Create a `lib` directory and download the required jar files. You can adjust the Flink version as needed. Please make sure to download the compatible versions of [fluss-flink connector jar](/downloads), [fluss-fs-s3 jar](/downloads), and [flink-connector-faker](https://github.com/knaufk/flink-faker/releases) ```shell export FLINK_VERSION="1.20" @@ -41,26 +41,58 @@ export FLINK_VERSION="1.20" ```shell mkdir lib -wget -O lib/flink-faker-0.5.3.jar https://github.com/knaufk/flink-faker/releases/download/v0.5.3/flink-faker-0.5.3.jar -wget -O "lib/fluss-flink-${FLINK_VERSION}-$FLUSS_DOCKER_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-${FLINK_VERSION}/$FLUSS_DOCKER_VERSION$/fluss-flink-${FLINK_VERSION}-$FLUSS_DOCKER_VERSION$.jar" +curl -fL -o lib/flink-faker-0.5.3.jar https://github.com/knaufk/flink-faker/releases/download/v0.5.3/flink-faker-0.5.3.jar +curl -fL -o "lib/fluss-flink-${FLINK_VERSION}-$FLUSS_DOCKER_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-${FLINK_VERSION}/$FLUSS_DOCKER_VERSION$/fluss-flink-${FLINK_VERSION}-$FLUSS_DOCKER_VERSION$.jar" +curl -fL -o "lib/fluss-fs-s3-$FLUSS_DOCKER_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-fs-s3/$FLUSS_DOCKER_VERSION$/fluss-fs-s3-$FLUSS_DOCKER_VERSION$.jar" ``` 3. Create a `docker-compose.yml` file with the following content: ```yaml services: + #begin RustFS (S3-compatible storage) + rustfs: + image: rustfs/rustfs:latest + ports: + - "9000:9000" + - "9001:9001" + environment: + - RUSTFS_ACCESS_KEY=rustfsadmin + - RUSTFS_SECRET_KEY=rustfsadmin + - RUSTFS_CONSOLE_ENABLE=true + volumes: + - rustfs-data:/data + command: /data + rustfs-init: + image: minio/mc + depends_on: + - rustfs + entrypoint: > + /bin/sh -c " + until mc alias set rustfs http://rustfs:9000 rustfsadmin rustfsadmin; do + echo 'Waiting for RustFS...'; + sleep 1; + done; + mc mb --ignore-existing rustfs/fluss; + " + #end #begin Fluss cluster coordinator-server: image: apache/fluss:$FLUSS_DOCKER_VERSION$ command: coordinatorServer depends_on: - zookeeper + - rustfs-init environment: - | FLUSS_PROPERTIES= zookeeper.address: zookeeper:2181 bind.listeners: FLUSS://coordinator-server:9123 - remote.data.dir: /tmp/fluss/remote-data + remote.data.dir: s3://fluss/remote-data + s3.endpoint: http://rustfs:9000 + s3.access-key: rustfsadmin + s3.secret-key: rustfsadmin + s3.path-style-access: true tablet-server: image: apache/fluss:$FLUSS_DOCKER_VERSION$ command: tabletServer @@ -72,8 +104,12 @@ services: zookeeper.address: zookeeper:2181 bind.listeners: FLUSS://tablet-server:9123 data.dir: /tmp/fluss/data - remote.data.dir: /tmp/fluss/remote-data - kv.snapshot.interval: 0s + remote.data.dir: s3://fluss/remote-data + s3.endpoint: http://rustfs:9000 + s3.access-key: rustfsadmin + s3.secret-key: rustfsadmin + s3.path-style-access: true + kv.snapshot.interval: 60s zookeeper: restart: always image: zookeeper:3.9.2 @@ -112,33 +148,43 @@ services: - | FLINK_PROPERTIES= jobmanager.rpc.address: jobmanager - rest.address: jobmanager + rest.address: jobmanager entrypoint: ["sh", "-c", "cp -v /tmp/lib/*.jar /opt/flink/lib && exec /docker-entrypoint.sh bin/sql-client.sh"] volumes: - ./lib:/tmp/lib #end + +volumes: + rustfs-data: ``` The Docker Compose environment consists of the following containers: +- **RustFS:** an S3-compatible object storage for tiered storage. You can access the RustFS console at http://localhost:9001 with credentials `rustfsadmin/rustfsadmin`. An init container (`rustfs-init`) automatically creates the `fluss` bucket on startup. - **Fluss Cluster:** a Fluss `CoordinatorServer`, a Fluss `TabletServer` and a `ZooKeeper` server. + - Snapshot interval `kv.snapshot.interval` is configured as 60 seconds. You may want to configure this differently for production systems + - Credentials are configured directly with `s3.access-key` and `s3.secret-key`. Production systems should use CredentialsProvider chain specific to cloud environments. - **Flink Cluster**: a Flink `JobManager`, a Flink `TaskManager`, and a Flink SQL client container to execute queries. -3. To start all containers, run: +:::tip +[RustFS](https://github.com/rustfs/rustfs) is used as replacement for S3 in this quickstart example, for your production setup you may want to configure this to use cloud file system. See [here](/maintenance/filesystems/overview.md) for information on how to setup cloud file systems +::: + +4. To start all containers, run: ```shell docker compose up -d ``` This command automatically starts all the containers defined in the Docker Compose configuration in detached mode. -Run +Run ```shell docker compose ps ``` to check whether all containers are running properly. -You can also visit http://localhost:8083/ to see if Flink is running normally. +5. Verify the setup. You can visit http://localhost:8083/ to see if Flink is running normally. The S3 bucket for Fluss tiered storage is automatically created by the `rustfs-init` service. You can access the RustFS console at http://localhost:9001 with credentials `rustfsadmin/rustfsadmin` to view the `fluss` bucket. :::note -- If you want to additionally use an observability stack, follow one of the provided quickstart guides [here](maintenance/observability/quickstart.md) and then continue with this guide. +- If you want to additionally use an observability stack, follow one of the provided quickstart guides [here](/docs/maintenance/observability/quickstart.md) and then continue with this guide. - All the following commands involving `docker compose` should be executed in the created working directory that contains the `docker-compose.yml` file. ::: @@ -399,6 +445,34 @@ The following SQL query should return an empty result. SELECT * FROM fluss_customer WHERE `cust_key` = 1; ``` +### Quitting Sql Client + +The following command allows you to quit Flink SQL Client. +```sql title="Flink SQL" +quit; +``` + +### Remote Storage + +Finally, you can use the following command to view the Primary Key Table snapshot files stored on RustFS: + +```shell +docker run --rm --net=host \ +-e MC_HOST_rustfs=http://rustfsadmin:rustfsadmin@localhost:9000 \ +minio/mc ls --recursive rustfs/fluss/ +``` + +Sample output: +```shell +[2026-02-03 20:28:59 UTC] 26KiB STANDARD remote-data/kv/fluss/enriched_orders-3/0/shared/4f675202-e560-4b8e-9af4-08e9769b4797 +[2026-02-03 20:27:59 UTC] 11KiB STANDARD remote-data/kv/fluss/enriched_orders-3/0/shared/87447c34-81d0-4be5-b4c8-abcea5ce68e9 +[2026-02-03 20:28:59 UTC] 0B STANDARD remote-data/kv/fluss/enriched_orders-3/0/snap-0/ +[2026-02-03 20:28:59 UTC] 1.1KiB STANDARD remote-data/kv/fluss/enriched_orders-3/0/snap-1/_METADATA +[2026-02-03 20:28:59 UTC] 211B STANDARD remote-data/kv/fluss/enriched_orders-3/0/snap-1/aaffa8fc-ddb3-4754-938a-45e28df6d975 +[2026-02-03 20:28:59 UTC] 16B STANDARD remote-data/kv/fluss/enriched_orders-3/0/snap-1/d3c18e43-11ee-4e39-912d-087ca01de0e8 +[2026-02-03 20:28:59 UTC] 6.2KiB STANDARD remote-data/kv/fluss/enriched_orders-3/0/snap-1/ea2f2097-aa9a-4c2a-9e72-530218cd551c +``` + ## Clean up After finishing the tutorial, run `exit` to exit Flink SQL CLI Container and then run ```shell diff --git a/website/docs/quickstart/lakehouse.md b/website/docs/quickstart/lakehouse.md index f99175fc2..5a848c502 100644 --- a/website/docs/quickstart/lakehouse.md +++ b/website/docs/quickstart/lakehouse.md @@ -38,26 +38,28 @@ cd fluss-quickstart-paimon mkdir -p lib opt # Flink connectors -wget -O lib/flink-faker-0.5.3.jar https://github.com/knaufk/flink-faker/releases/download/v0.5.3/flink-faker-0.5.3.jar -wget -O "lib/fluss-flink-1.20-$FLUSS_DOCKER_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-1.20/$FLUSS_DOCKER_VERSION$/fluss-flink-1.20-$FLUSS_DOCKER_VERSION$.jar" -wget -O "lib/paimon-flink-1.20-$PAIMON_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/paimon/paimon-flink-1.20/$PAIMON_VERSION$/paimon-flink-1.20-$PAIMON_VERSION$.jar" +curl -fL -o lib/flink-faker-0.5.3.jar https://github.com/knaufk/flink-faker/releases/download/v0.5.3/flink-faker-0.5.3.jar +curl -fL -o "lib/fluss-flink-1.20-$FLUSS_DOCKER_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-1.20/$FLUSS_DOCKER_VERSION$/fluss-flink-1.20-$FLUSS_DOCKER_VERSION$.jar" +curl -fL -o "lib/paimon-flink-1.20-$PAIMON_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/paimon/paimon-flink-1.20/$PAIMON_VERSION$/paimon-flink-1.20-$PAIMON_VERSION$.jar" # Fluss lake plugin -wget -O "lib/fluss-lake-paimon-$FLUSS_DOCKER_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-paimon/$FLUSS_DOCKER_VERSION$/fluss-lake-paimon-$FLUSS_DOCKER_VERSION$.jar" +curl -fL -o "lib/fluss-lake-paimon-$FLUSS_DOCKER_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-paimon/$FLUSS_DOCKER_VERSION$/fluss-lake-paimon-$FLUSS_DOCKER_VERSION$.jar" # Paimon bundle jar -wget -O "lib/paimon-bundle-$PAIMON_VERSION$.jar" "https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-bundle/$PAIMON_VERSION$/paimon-bundle-$PAIMON_VERSION$.jar" +curl -fL -o "lib/paimon-bundle-$PAIMON_VERSION$.jar" "https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-bundle/$PAIMON_VERSION$/paimon-bundle-$PAIMON_VERSION$.jar" # Hadoop bundle jar -wget -O lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar https://repo.maven.apache.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.8.3-10.0/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar +curl -fL -o lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar https://repo.maven.apache.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.8.3-10.0/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar + +# AWS S3 support +curl -fL -o "lib/paimon-s3-$PAIMON_VERSION$.jar" "https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-s3/$PAIMON_VERSION$/paimon-s3-$PAIMON_VERSION$.jar" # Tiering service -wget -O "opt/fluss-flink-tiering-$FLUSS_DOCKER_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-tiering/$FLUSS_DOCKER_VERSION$/fluss-flink-tiering-$FLUSS_DOCKER_VERSION$.jar" +curl -fL -o "opt/fluss-flink-tiering-$FLUSS_DOCKER_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-tiering/$FLUSS_DOCKER_VERSION$/fluss-flink-tiering-$FLUSS_DOCKER_VERSION$.jar" ``` :::info You can add more jars to this `lib` directory based on your requirements: -- **Cloud storage support**: For AWS S3 integration with Paimon, add the corresponding [paimon-s3](https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-s3/$PAIMON_VERSION$/paimon-s3-$PAIMON_VERSION$.jar) - **Other catalog backends**: Add jars needed for alternative Paimon catalog implementations (e.g., Hive, JDBC) ::: @@ -65,23 +67,59 @@ You can add more jars to this `lib` directory based on your requirements: ```yaml services: + #begin RustFS (S3-compatible storage) + rustfs: + image: rustfs/rustfs:latest + ports: + - "9000:9000" + - "9001:9001" + environment: + - RUSTFS_ACCESS_KEY=rustfsadmin + - RUSTFS_SECRET_KEY=rustfsadmin + - RUSTFS_CONSOLE_ENABLE=true + volumes: + - rustfs-data:/data + command: /data + rustfs-init: + image: minio/mc + depends_on: + - rustfs + entrypoint: > + /bin/sh -c " + until mc alias set rustfs http://rustfs:9000 rustfsadmin rustfsadmin; do + echo 'Waiting for RustFS...'; + sleep 1; + done; + mc mb --ignore-existing rustfs/fluss; + " + #end coordinator-server: image: apache/fluss:$FLUSS_DOCKER_VERSION$ command: coordinatorServer depends_on: - - zookeeper + zookeeper: + condition: service_started + rustfs-init: + condition: service_completed_successfully environment: - | FLUSS_PROPERTIES= zookeeper.address: zookeeper:2181 bind.listeners: FLUSS://coordinator-server:9123 - remote.data.dir: /tmp/fluss/remote-data + remote.data.dir: s3://fluss/remote-data + s3.endpoint: http://rustfs:9000 + s3.access-key: rustfsadmin + s3.secret-key: rustfsadmin + s3.path.style.access: true datalake.format: paimon datalake.paimon.metastore: filesystem - datalake.paimon.warehouse: /tmp/paimon + datalake.paimon.warehouse: s3://fluss/paimon + datalake.paimon.s3.endpoint: http://rustfs:9000 + datalake.paimon.s3.access-key: rustfsadmin + datalake.paimon.s3.secret-key: rustfsadmin + datalake.paimon.s3.path.style.access: true volumes: - - shared-tmpfs:/tmp/paimon - - shared-tmpfs:/tmp/fluss + - ./lib/paimon-s3-$PAIMON_VERSION$.jar:/opt/fluss/plugins/paimon/paimon-s3-$PAIMON_VERSION$.jar tablet-server: image: apache/fluss:$FLUSS_DOCKER_VERSION$ command: tabletServer @@ -93,14 +131,21 @@ services: zookeeper.address: zookeeper:2181 bind.listeners: FLUSS://tablet-server:9123 data.dir: /tmp/fluss/data - remote.data.dir: /tmp/fluss/remote-data + remote.data.dir: s3://fluss/remote-data + s3.endpoint: http://rustfs:9000 + s3.access-key: rustfsadmin + s3.secret-key: rustfsadmin + s3.path.style.access: true kv.snapshot.interval: 0s datalake.format: paimon datalake.paimon.metastore: filesystem - datalake.paimon.warehouse: /tmp/paimon + datalake.paimon.warehouse: s3://fluss/paimon + datalake.paimon.s3.endpoint: http://rustfs:9000 + datalake.paimon.s3.access-key: rustfsadmin + datalake.paimon.s3.secret-key: rustfsadmin + datalake.paimon.s3.path.style.access: true volumes: - - shared-tmpfs:/tmp/paimon - - shared-tmpfs:/tmp/fluss + - ./lib/paimon-s3-$PAIMON_VERSION$.jar:/opt/fluss/plugins/paimon/paimon-s3-$PAIMON_VERSION$.jar zookeeper: restart: always image: zookeeper:3.9.2 @@ -110,8 +155,7 @@ services: - "8083:8081" entrypoint: ["/bin/bash", "-c"] command: > - "sed -i 's/exec $(drop_privs_cmd)//g' /docker-entrypoint.sh && - cp /tmp/jars/*.jar /opt/flink/lib/ 2>/dev/null || true; + "cp /tmp/jars/*.jar /opt/flink/lib/ 2>/dev/null || true; cp /tmp/opt/*.jar /opt/flink/opt/ 2>/dev/null || true; /docker-entrypoint.sh jobmanager" environment: @@ -119,8 +163,6 @@ services: FLINK_PROPERTIES= jobmanager.rpc.address: jobmanager volumes: - - shared-tmpfs:/tmp/paimon - - shared-tmpfs:/tmp/fluss - ./lib:/tmp/jars - ./opt:/tmp/opt taskmanager: @@ -129,8 +171,7 @@ services: - jobmanager entrypoint: ["/bin/bash", "-c"] command: > - "sed -i 's/exec $(drop_privs_cmd)//g' /docker-entrypoint.sh && - cp /tmp/jars/*.jar /opt/flink/lib/ 2>/dev/null || true; + "cp /tmp/jars/*.jar /opt/flink/lib/ 2>/dev/null || true; cp /tmp/opt/*.jar /opt/flink/opt/ 2>/dev/null || true; /docker-entrypoint.sh taskmanager" environment: @@ -139,23 +180,24 @@ services: jobmanager.rpc.address: jobmanager taskmanager.numberOfTaskSlots: 10 taskmanager.memory.process.size: 2048m + taskmanager.memory.task.off-heap.size: 128m volumes: - - shared-tmpfs:/tmp/paimon - - shared-tmpfs:/tmp/fluss - ./lib:/tmp/jars - ./opt:/tmp/opt volumes: - shared-tmpfs: - driver: local - driver_opts: - type: "tmpfs" - device: "tmpfs" + rustfs-data: ``` The Docker Compose environment consists of the following containers: - **Fluss Cluster:** a Fluss `CoordinatorServer`, a Fluss `TabletServer` and a `ZooKeeper` server. - **Flink Cluster**: a Flink `JobManager` and a Flink `TaskManager` container to execute queries. +- **RustFS**: an S3-compatible storage system used both as Fluss remote storage and Paimon's filesystem warehouse. + + +:::tip +[RustFS](https://github.com/rustfs/rustfs) is used as replacement for S3 in this quickstart example, for your production setup you may want to configure this to use cloud file system. See [here](/maintenance/filesystems/overview.md) for information on how to setup cloud file systems +::: 4. To start all containers, run: ```shell @@ -198,25 +240,30 @@ cd fluss-quickstart-iceberg mkdir -p lib opt # Flink connectors -wget -O lib/flink-faker-0.5.3.jar https://github.com/knaufk/flink-faker/releases/download/v0.5.3/flink-faker-0.5.3.jar -wget -O "lib/fluss-flink-1.20-$FLUSS_DOCKER_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-1.20/$FLUSS_DOCKER_VERSION$/fluss-flink-1.20-$FLUSS_DOCKER_VERSION$.jar" -wget -O lib/iceberg-flink-runtime-1.20-1.10.1.jar "https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.20/1.10.1/iceberg-flink-runtime-1.20-1.10.1.jar" +curl -fL -o lib/flink-faker-0.5.3.jar https://github.com/knaufk/flink-faker/releases/download/v0.5.3/flink-faker-0.5.3.jar +curl -fL -o "lib/fluss-flink-1.20-$FLUSS_DOCKER_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-1.20/$FLUSS_DOCKER_VERSION$/fluss-flink-1.20-$FLUSS_DOCKER_VERSION$.jar" +curl -fL -o lib/iceberg-flink-runtime-1.20-1.10.1.jar https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.20/1.10.1/iceberg-flink-runtime-1.20-1.10.1.jar # Fluss lake plugin -wget -O "lib/fluss-lake-iceberg-$FLUSS_DOCKER_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-iceberg/$FLUSS_DOCKER_VERSION$/fluss-lake-iceberg-$FLUSS_DOCKER_VERSION$.jar" +curl -fL -o "lib/fluss-lake-iceberg-$FLUSS_DOCKER_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-iceberg/$FLUSS_DOCKER_VERSION$/fluss-lake-iceberg-$FLUSS_DOCKER_VERSION$.jar" + +# Iceberg AWS support (S3FileIO + AWS SDK) +curl -fL -o lib/iceberg-aws-1.10.1.jar https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws/1.10.1/iceberg-aws-1.10.1.jar +curl -fL -o lib/iceberg-aws-bundle-1.10.1.jar https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.10.1/iceberg-aws-bundle-1.10.1.jar -# Hadoop filesystem support -wget -O lib/hadoop-apache-3.3.5-2.jar https://repo1.maven.org/maven2/io/trino/hadoop/hadoop-apache/3.3.5-2/hadoop-apache-3.3.5-2.jar -wget -O lib/failsafe-3.3.2.jar https://repo1.maven.org/maven2/dev/failsafe/failsafe/3.3.2/failsafe-3.3.2.jar +# JDBC catalog driver +curl -fL -o lib/postgresql-42.7.4.jar https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.4/postgresql-42.7.4.jar + +# Hadoop client (required by Iceberg's Flink integration) +curl -fL -o lib/hadoop-client-api-3.3.5.jar https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client-api/3.3.5/hadoop-client-api-3.3.5.jar +curl -fL -o lib/hadoop-client-runtime-3.3.5.jar https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client-runtime/3.3.5/hadoop-client-runtime-3.3.5.jar # Tiering service -wget -O "opt/fluss-flink-tiering-$FLUSS_DOCKER_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-tiering/$FLUSS_DOCKER_VERSION$/fluss-flink-tiering-$FLUSS_DOCKER_VERSION$.jar" +curl -fL -o "opt/fluss-flink-tiering-$FLUSS_DOCKER_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-tiering/$FLUSS_DOCKER_VERSION$/fluss-flink-tiering-$FLUSS_DOCKER_VERSION$.jar" ``` :::info You can add more jars to this `lib` directory based on your requirements: -- **Cloud storage support**: For AWS S3 integration with Iceberg, add the corresponding Iceberg bundle jars (e.g., `iceberg-aws-bundle`) -- **Custom Hadoop configurations**: Add jars for specific HDFS distributions or custom authentication mechanisms - **Other catalog backends**: Add jars needed for alternative Iceberg catalog implementations (e.g., Rest, Hive, Glue) ::: @@ -224,25 +271,81 @@ You can add more jars to this `lib` directory based on your requirements: ```yaml services: + #begin RustFS (S3-compatible storage) + rustfs: + image: rustfs/rustfs:latest + ports: + - "9000:9000" + - "9001:9001" + environment: + - RUSTFS_ACCESS_KEY=rustfsadmin + - RUSTFS_SECRET_KEY=rustfsadmin + - RUSTFS_CONSOLE_ENABLE=true + volumes: + - rustfs-data:/data + command: /data + rustfs-init: + image: minio/mc + depends_on: + - rustfs + entrypoint: > + /bin/sh -c " + until mc alias set rustfs http://rustfs:9000 rustfsadmin rustfsadmin; do + echo 'Waiting for RustFS...'; + sleep 1; + done; + mc mb --ignore-existing rustfs/fluss; + " + #end + postgres: + image: postgres:17 + environment: + - POSTGRES_USER=iceberg + - POSTGRES_PASSWORD=iceberg + - POSTGRES_DB=iceberg + healthcheck: + test: ["CMD-SHELL", "pg_isready -U iceberg"] + interval: 3s + timeout: 3s + retries: 5 coordinator-server: image: apache/fluss:$FLUSS_DOCKER_VERSION$ command: coordinatorServer depends_on: - - zookeeper + postgres: + condition: service_healthy + zookeeper: + condition: service_started + rustfs-init: + condition: service_completed_successfully environment: - | FLUSS_PROPERTIES= zookeeper.address: zookeeper:2181 bind.listeners: FLUSS://coordinator-server:9123 - remote.data.dir: /tmp/fluss/remote-data + remote.data.dir: s3://fluss/remote-data + s3.endpoint: http://rustfs:9000 + s3.access-key: rustfsadmin + s3.secret-key: rustfsadmin + s3.path.style.access: true datalake.format: iceberg - datalake.iceberg.type: hadoop - datalake.iceberg.warehouse: /tmp/iceberg + datalake.iceberg.catalog-impl: org.apache.iceberg.jdbc.JdbcCatalog + datalake.iceberg.name: fluss_catalog + datalake.iceberg.uri: jdbc:postgresql://postgres:5432/iceberg + datalake.iceberg.jdbc.user: iceberg + datalake.iceberg.jdbc.password: iceberg + datalake.iceberg.warehouse: s3://fluss/iceberg + datalake.iceberg.io-impl: org.apache.iceberg.aws.s3.S3FileIO + datalake.iceberg.s3.endpoint: http://rustfs:9000 + datalake.iceberg.s3.access-key-id: rustfsadmin + datalake.iceberg.s3.secret-access-key: rustfsadmin + datalake.iceberg.s3.path-style-access: true + datalake.iceberg.client.region: us-east-1 volumes: - - shared-tmpfs:/tmp/iceberg - - shared-tmpfs:/tmp/fluss - ./lib/fluss-lake-iceberg-$FLUSS_DOCKER_VERSION$.jar:/opt/fluss/plugins/iceberg/fluss-lake-iceberg-$FLUSS_DOCKER_VERSION$.jar - - ./lib/hadoop-apache-3.3.5-2.jar:/opt/fluss/plugins/iceberg/hadoop-apache-3.3.5-2.jar + - ./lib/iceberg-aws-1.10.1.jar:/opt/fluss/plugins/iceberg/iceberg-aws-1.10.1.jar + - ./lib/iceberg-aws-bundle-1.10.1.jar:/opt/fluss/plugins/iceberg/iceberg-aws-bundle-1.10.1.jar + - ./lib/postgresql-42.7.4.jar:/opt/fluss/plugins/iceberg/postgresql-42.7.4.jar tablet-server: image: apache/fluss:$FLUSS_DOCKER_VERSION$ command: tabletServer @@ -254,14 +357,25 @@ services: zookeeper.address: zookeeper:2181 bind.listeners: FLUSS://tablet-server:9123 data.dir: /tmp/fluss/data - remote.data.dir: /tmp/fluss/remote-data kv.snapshot.interval: 0s + remote.data.dir: s3://fluss/remote-data + s3.endpoint: http://rustfs:9000 + s3.access-key: rustfsadmin + s3.secret-key: rustfsadmin + s3.path.style.access: true datalake.format: iceberg - datalake.iceberg.type: hadoop - datalake.iceberg.warehouse: /tmp/iceberg - volumes: - - shared-tmpfs:/tmp/iceberg - - shared-tmpfs:/tmp/fluss + datalake.iceberg.catalog-impl: org.apache.iceberg.jdbc.JdbcCatalog + datalake.iceberg.name: fluss_catalog + datalake.iceberg.uri: jdbc:postgresql://postgres:5432/iceberg + datalake.iceberg.jdbc.user: iceberg + datalake.iceberg.jdbc.password: iceberg + datalake.iceberg.warehouse: s3://fluss/iceberg + datalake.iceberg.io-impl: org.apache.iceberg.aws.s3.S3FileIO + datalake.iceberg.s3.endpoint: http://rustfs:9000 + datalake.iceberg.s3.access-key-id: rustfsadmin + datalake.iceberg.s3.secret-access-key: rustfsadmin + datalake.iceberg.s3.path-style-access: true + datalake.iceberg.client.region: us-east-1 zookeeper: restart: always image: zookeeper:3.9.2 @@ -271,8 +385,7 @@ services: - "8083:8081" entrypoint: ["/bin/bash", "-c"] command: > - "sed -i 's/exec $(drop_privs_cmd)//g' /docker-entrypoint.sh && - cp /tmp/jars/*.jar /opt/flink/lib/ 2>/dev/null || true; + "cp /tmp/jars/*.jar /opt/flink/lib/ 2>/dev/null || true; cp /tmp/opt/*.jar /opt/flink/opt/ 2>/dev/null || true; /docker-entrypoint.sh jobmanager" environment: @@ -280,8 +393,6 @@ services: FLINK_PROPERTIES= jobmanager.rpc.address: jobmanager volumes: - - shared-tmpfs:/tmp/iceberg - - shared-tmpfs:/tmp/fluss - ./lib:/tmp/jars - ./opt:/tmp/opt taskmanager: @@ -290,8 +401,7 @@ services: - jobmanager entrypoint: ["/bin/bash", "-c"] command: > - "sed -i 's/exec $(drop_privs_cmd)//g' /docker-entrypoint.sh && - cp /tmp/jars/*.jar /opt/flink/lib/ 2>/dev/null || true; + "cp /tmp/jars/*.jar /opt/flink/lib/ 2>/dev/null || true; cp /tmp/opt/*.jar /opt/flink/opt/ 2>/dev/null || true; /docker-entrypoint.sh taskmanager" environment: @@ -300,23 +410,24 @@ services: jobmanager.rpc.address: jobmanager taskmanager.numberOfTaskSlots: 10 taskmanager.memory.process.size: 2048m + taskmanager.memory.task.off-heap.size: 128m volumes: - - shared-tmpfs:/tmp/iceberg - - shared-tmpfs:/tmp/fluss - ./lib:/tmp/jars - ./opt:/tmp/opt volumes: - shared-tmpfs: - driver: local - driver_opts: - type: "tmpfs" - device: "tmpfs" + rustfs-data: ``` The Docker Compose environment consists of the following containers: - **Fluss Cluster:** a Fluss `CoordinatorServer`, a Fluss `TabletServer` and a `ZooKeeper` server. - **Flink Cluster**: a Flink `JobManager` and a Flink `TaskManager` container to execute queries. +- **PostgreSQL**: stores Iceberg catalog metadata (used by `JdbcCatalog`). +- **RustFS**: an S3-compatible storage system used both as Fluss remote storage and Iceberg's filesystem warehouse. + +:::tip +[RustFS](https://github.com/rustfs/rustfs) is used as replacement for S3 in this quickstart example, for your production setup you may want to configure this to use cloud file system. See [here](/maintenance/filesystems/overview.md) for information on how to setup cloud file systems +::: 4. To start all containers, run: ```shell @@ -492,12 +603,33 @@ SET 'table.exec.sink.not-null-enforcer'='DROP'; ## Create Fluss Tables ### Create Fluss Catalog Use the following SQL to create a Fluss catalog: + +<Tabs groupId="lake-tabs"> + <TabItem value="paimon" label="Paimon" default> + +```sql title="Flink SQL" +CREATE CATALOG fluss_catalog WITH ( + 'type' = 'fluss', + 'bootstrap.servers' = 'coordinator-server:9123', + 'paimon.s3.access-key' = 'rustfsadmin', + 'paimon.s3.secret-key' = 'rustfsadmin' +); +``` + </TabItem> + + <TabItem value="iceberg" label="Iceberg"> + ```sql title="Flink SQL" CREATE CATALOG fluss_catalog WITH ( 'type' = 'fluss', - 'bootstrap.servers' = 'coordinator-server:9123' + 'bootstrap.servers' = 'coordinator-server:9123', + 'iceberg.jdbc.password' = 'iceberg', + 'iceberg.s3.access-key-id' = 'rustfsadmin', + 'iceberg.s3.secret-access-key' = 'rustfsadmin' ); ``` + </TabItem> +</Tabs> ```sql title="Flink SQL" USE CATALOG fluss_catalog; @@ -614,7 +746,11 @@ docker compose exec jobmanager \ --fluss.bootstrap.servers coordinator-server:9123 \ --datalake.format paimon \ --datalake.paimon.metastore filesystem \ - --datalake.paimon.warehouse /tmp/paimon + --datalake.paimon.warehouse s3://fluss/paimon \ + --datalake.paimon.s3.endpoint http://rustfs:9000 \ + --datalake.paimon.s3.access.key rustfsadmin \ + --datalake.paimon.s3.secret.key rustfsadmin \ + --datalake.paimon.s3.path.style.access true ``` You should see a Flink Job to tier data from Fluss to Paimon running in the [Flink Web UI](http://localhost:8083/). @@ -627,11 +763,21 @@ Open a new terminal, navigate to the `fluss-quickstart-iceberg` directory, and e ```shell docker compose exec jobmanager \ /opt/flink/bin/flink run \ - /opt/flink/opt/fluss-flink-tiering.jar \ + /opt/flink/opt/fluss-flink-tiering-$FLUSS_VERSION$.jar \ --fluss.bootstrap.servers coordinator-server:9123 \ --datalake.format iceberg \ - --datalake.iceberg.type hadoop \ - --datalake.iceberg.warehouse /tmp/iceberg + --datalake.iceberg.catalog-impl org.apache.iceberg.jdbc.JdbcCatalog \ + --datalake.iceberg.name fluss_catalog \ + --datalake.iceberg.uri "jdbc:postgresql://postgres:5432/iceberg" \ + --datalake.iceberg.jdbc.user iceberg \ + --datalake.iceberg.jdbc.password iceberg \ + --datalake.iceberg.warehouse "s3://fluss/iceberg" \ + --datalake.iceberg.io-impl org.apache.iceberg.aws.s3.S3FileIO \ + --datalake.iceberg.s3.endpoint "http://rustfs:9000" \ + --datalake.iceberg.s3.access-key-id rustfsadmin \ + --datalake.iceberg.s3.secret-access-key rustfsadmin \ + --datalake.iceberg.s3.path-style-access true \ + --datalake.iceberg.client.region us-east-1 ``` You should see a Flink Job to tier data from Fluss to Iceberg running in the [Flink Web UI](http://localhost:8083/). @@ -888,9 +1034,19 @@ The files adhere to Iceberg's standard format, enabling seamless querying with o </TabItem> </Tabs> +### Quitting Sql Client + +The following command allows you to quit Flink SQL Client. +```sql title="Flink SQL" +quit; +``` + +### Tiered Storage + +You can visit http://localhost:9001/ and sign in with `rustfsadmin` / `rustfsadmin` to view the files stored on tiered storage. + ## Clean up -After finishing the tutorial, run `exit` to exit Flink SQL CLI Container and then run +Run the following to stop all containers. ```shell docker compose down -v -``` -to stop all containers. \ No newline at end of file +``` \ No newline at end of file
