kevinjqliu commented on code in PR #15062: URL: https://github.com/apache/iceberg/pull/15062#discussion_r2817539067
########## site/docs/assets/images/flink-quickstart.excalidraw.png: ########## Review Comment: nit: should we update this diagram to include IRC? ########## site/docs/flink-quickstart.md: ########## @@ -0,0 +1,174 @@ +--- +title: "Flink and Iceberg Quickstart" +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +This guide will get you up and running with Apache Iceberg™ using Apache Flink™, including sample code to +highlight some powerful features. You can learn more about Iceberg's Flink runtime by checking out the [Flink](docs/latest/flink.md) section. + +## Quickstart environment + +The fastest way to get started is to use Docker Compose with the [Iceberg Flink Quickstart](https://github.com/apache/iceberg/tree/main/docker/iceberg-flink-quickstart) image. + +To use this, you'll need to install the [Docker CLI](https://docs.docker.com/get-docker/) as well as the [Docker Compose CLI](https://github.com/docker/compose-cli/blob/main/INSTALL.md). + +The quickstart includes: + +* A local Flink cluster (Job Manager and Task Manager) +* Iceberg REST Catalog +* MinIO (local S3 storage) + + + +Clone the Iceberg repository and start up the Docker containers: + +```sh +git clone https://github.com/apache/iceberg.git +cd iceberg +docker compose -f docker/iceberg-flink-quickstart/docker-compose.yml up -d --build +``` + +Launch a Flink SQL client session: + +```sh +docker exec -it jobmanager ./bin/sql-client.sh +``` + +## Creating an Iceberg Catalog in Flink + +Iceberg has several catalog back-ends that can be used to track tables, like JDBC, Hive MetaStore and Glue. +In this guide we use a REST catalog, backed by S3. +To learn more, check out the [Catalog](docs/latest/flink-configuration.md#catalog-configuration) page in the Flink section. + +First up, we need to define a Flink catalog. +Tables within this catalog will be stored on S3 blob store: + +```sql +CREATE CATALOG iceberg_catalog WITH ( + 'type' = 'iceberg', + 'catalog-impl' = 'org.apache.iceberg.rest.RESTCatalog', + 'uri' = 'http://iceberg-rest:8181', + 'warehouse' = 's3://warehouse/', + 'io-impl' = 'org.apache.iceberg.aws.s3.S3FileIO', + 's3.endpoint' = 'http://minio:9000', + 's3.access-key-id' = 'admin', + 's3.secret-access-key' = 'password', + 's3.path-style-access' = 'true' +); +``` + +Create a database in the catalog: + +```sql +CREATE DATABASE IF NOT EXISTS iceberg_catalog.nyc; +``` + +## Creating a Table + +To create your first Iceberg table in Flink, run a [`CREATE TABLE`](docs/latest/flink-ddl.md#create-table) command. +Let's create a table using `iceberg_catalog.nyc.taxis` where `iceberg_catalog` is the catalog name, `nyc` is the database name, and `taxis` is the table name. + +```sql +CREATE TABLE iceberg_catalog.nyc.taxis +( + vendor_id BIGINT, + trip_id BIGINT, + trip_distance FLOAT, + fare_amount DOUBLE, + store_and_fwd_flag STRING +); +``` + +Iceberg catalogs support the full range of Flink SQL DDL commands, including: + +* [`CREATE TABLE ... PARTITIONED BY`](docs/latest/flink-ddl.md#partitioned-by) +* [`ALTER TABLE`](docs/latest/flink-ddl.md#alter-table) +* [`DROP TABLE`](docs/latest/flink-ddl.md#drop-table) + +## Writing Data to a Table + +Once your table is created, you can insert records. + +Flink uses checkpoints to ensure data durability and exactly-once semantics. +Without checkpointing, Iceberg data and metadata may not be fully committed to storage. + +```sql +SET 'execution.checkpointing.interval' = '10s'; +``` + +Then you can write some data: + +```sql +INSERT INTO iceberg_catalog.nyc.taxis +VALUES (1, 1000371, 1.8, 15.32, 'N'), (2, 1000372, 2.5, 22.15, 'N'), (2, 1000373, 0.9, 9.01, 'N'), (1, 1000374, 8.4, 42.13, 'Y'); +``` + +## Reading Data from a Table + +To read a table, use the Iceberg table's name: + +```sql +SELECT * FROM iceberg_catalog.nyc.taxis; +``` + +## Creating a Table with Inline Catalog Configuration + +Creating a Flink catalog as shown above, backed by an Iceberg catalog, is one way to use Iceberg in Flink. +Another way is to use the [Iceberg connector](docs/latest/flink-connector.md) and specify the Iceberg details as table properties: + +Now create a table using inline configuration: + +!!! note + The table is _not_ being created in the Iceberg catalog that we created above, since we haven't supplied it as a prefix. If we had used `iceberg_catalog.taxis_inline_config`, it would use the Iceberg details from the catalog definition instead of the inline configuration. Review Comment: nit i was confused by this. seems like the difference is "external catalog" (IRC) vs "inline catalog" (in memory?) the flink connector page mentions this https://iceberg.apache.org/docs/latest/flink-connector/ > Apache Flink supports creating Iceberg table directly without creating the explicit Flink catalog in Flink SQL. That means we can just create an iceberg table by specifying 'connector'='iceberg' table option in Flink SQL which is similar to usage in the Flink official [document](https://nightlies.apache.org/flink/flink-docs-release-2.0/docs/connectors/table/overview/). ########## site/docs/flink-quickstart.md: ########## @@ -0,0 +1,174 @@ +--- +title: "Flink and Iceberg Quickstart" +--- +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +This guide will get you up and running with Apache Iceberg™ using Apache Flink™, including sample code to +highlight some powerful features. You can learn more about Iceberg's Flink runtime by checking out the [Flink](docs/latest/flink.md) section. + +## Quickstart environment + +The fastest way to get started is to use Docker Compose with the [Iceberg Flink Quickstart](https://github.com/apache/iceberg/tree/main/docker/iceberg-flink-quickstart) image. + +To use this, you'll need to install the [Docker CLI](https://docs.docker.com/get-docker/) as well as the [Docker Compose CLI](https://github.com/docker/compose-cli/blob/main/INSTALL.md). Review Comment: ```suggestion To use this, you'll need to install the [Docker CLI](https://docs.docker.com/get-docker/). ``` nit i think the docker-compose cli is deprecated, the repo is archived ########## docker/iceberg-flink-quickstart/test.sql: ########## @@ -0,0 +1,78 @@ +-- ============================================================================= +-- Iceberg Flink Quickstart Test Script +-- ============================================================================= +-- +-- Prerequisites: +-- docker compose -f docker/iceberg-flink-quickstart/docker-compose.yml up -d --build +-- docker exec -it jobmanager ./bin/sql-client.sh +-- +-- Then paste this script or run line by line +-- ============================================================================= + +-- ----------------------------------------------------------------------------- +-- 1. Create the Iceberg REST catalog +-- ----------------------------------------------------------------------------- +CREATE CATALOG iceberg_catalog WITH ( + 'type' = 'iceberg', + 'catalog-impl' = 'org.apache.iceberg.rest.RESTCatalog', + 'uri' = 'http://iceberg-rest:8181', + 'warehouse' = 's3://warehouse/', + 'io-impl' = 'org.apache.iceberg.aws.s3.S3FileIO', + 's3.endpoint' = 'http://minio:9000', + 's3.access-key-id' = 'admin', + 's3.secret-access-key' = 'password', + 's3.path-style-access' = 'true' +); + +-- ----------------------------------------------------------------------------- +-- 2. Create a database and table +-- ----------------------------------------------------------------------------- +CREATE DATABASE IF NOT EXISTS iceberg_catalog.nyc; + +CREATE TABLE iceberg_catalog.nyc.taxis ( + vendor_id BIGINT, + trip_id BIGINT, + trip_distance FLOAT, + fare_amount DOUBLE, + store_and_fwd_flag STRING +); + +-- ----------------------------------------------------------------------------- +-- 3. Enable checkpointing (required for Iceberg commits) +-- ----------------------------------------------------------------------------- +SET 'execution.checkpointing.interval' = '10s'; + +-- ----------------------------------------------------------------------------- +-- 4. Insert data +-- ----------------------------------------------------------------------------- +INSERT INTO iceberg_catalog.nyc.taxis +VALUES + (1, 1000371, 1.8, 15.32, 'N'), + (2, 1000372, 2.5, 22.15, 'N'), + (2, 1000373, 0.9, 9.01, 'N'), + (1, 1000374, 8.4, 42.13, 'Y'); + +-- ----------------------------------------------------------------------------- +-- 5. Query the data +-- ----------------------------------------------------------------------------- +SET 'sql-client.execution.result-mode' = 'tableau'; +SELECT * FROM iceberg_catalog.nyc.taxis; + +-- ----------------------------------------------------------------------------- +-- 6. Inspect Iceberg metadata +-- ----------------------------------------------------------------------------- +-- Snapshots +SELECT * FROM iceberg_catalog.nyc.`taxis$snapshots`; + +-- Data files +SELECT content, file_path, file_format, record_count +FROM iceberg_catalog.nyc.`taxis$files`; + +-- History +SELECT * FROM iceberg_catalog.nyc.`taxis$history`; + +-- ----------------------------------------------------------------------------- +-- 7. Cleanup (optional) +-- ----------------------------------------------------------------------------- +DROP TABLE iceberg_catalog.nyc.taxis; +DROP DATABASE iceberg_catalog.nyc; Review Comment: ```suggestion -- DROP TABLE iceberg_catalog.nyc.taxis; -- DROP DATABASE iceberg_catalog.nyc; ``` nit: comment these out since its optional -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
