morningman commented on code in PR #2571: URL: https://github.com/apache/polaris/pull/2571#discussion_r2350227771
########## site/content/blog/2025/09/15/doris-polaris-integration.md: ########## @@ -0,0 +1,427 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: "Doris X Polaris: Building Unified Data Lakehouse with Iceberg REST Catalog - A Practical Guide" +date: 2025-09-15 +author: zy-kkk +--- + +With the continuous evolution of data lake technologies, efficiently and securely managing massive datasets stored on object storage (such as AWS S3) while providing unified access endpoints for upstream analytics engines (like [Apache Doris](https://doris.apache.org)) has become a core challenge in modern data architectures. [Apache Polaris](https://polaris.apache.org/), as an open and standardized REST Catalog service for Iceberg, provides an ideal solution to this challenge. It not only handles centralized metadata management but also significantly enhances data lake security and manageability through fine-grained access control and flexible credential management mechanisms. + +This document will provide a detailed guide on integrating Apache Doris with Polaris to achieve efficient querying and management of Iceberg data on S3. We'll guide you through the complete process from environment preparation to final data querying step by step + +**Through this documentation, you will quickly learn:** + +* **AWS Environment Setup**: How to create and configure S3 buckets in AWS, and prepare the necessary IAM roles and policies for both Polaris and Doris, enabling Polaris to access S3 and vend temporary credentials for Doris. + +* **Polaris Deployment and Configuration**: How to download and start the Polaris service, and create Iceberg Catalog, Namespace, and corresponding Principal/Role/permissions in Polaris to provide secure metadata access endpoints for Doris. + +* **Doris-Polaris Integration**: Explains how Doris obtains metadata access tokens from Polaris via OAuth2, and demonstrates two core underlying storage access methods: + + 1. Temporary AK/SK distribution by Polaris (Credential Vending mechanism) + + 2. Doris directly using static AK/SK to access S3 + +## About Apache Doris + +[Apache Doris](https://doris.apache.org) is the fastest analytical and search database for the AI era. + +It provides high-performance hybrid search capabilities across structured data, semi-structured data (such as JSON), and vector data. It excels at delivering high-concurrency, low-latency queries, while also offering advanced optimization for complex join operations. In addition, Doris can serve as a unified query engine, delivering high-performance analytical services not only on its self-managed internal table format but also on open lakehouse formats such as Iceberg. + +With Doris, users can easily build a real-time lakehouse data platform. + +## About Apache Polaris + +Apache Polaris (Incubating) is a catalog implementation for Apache Iceberg™ tables and is built on the open source Apache Iceberg™ REST protocol. + +With Polaris, you can provide centralized, secure read and write access to your Iceberg tables across different REST-compatible query engines. + +## Hands-on Guide + +### 1. AWS Environment Setup + +Before we begin, we need to prepare S3 buckets and corresponding IAM roles on AWS, which form the foundation for Polaris to manage data and Doris to access data. + +#### 1.1 Create S3 Bucket + +First, we create an S3 bucket named `polaris-doris-test` to store the Iceberg table data that will be created later. + +```bash +# Create an S3 bucket +aws s3 mb s3://polaris-doris-test --region us-west-2 +# Verify that the bucket was created successfully +aws s3 ls | grep polaris-doris-test +``` + +#### 1.2 Create IAM Role for Object Storage Access + +To implement secure credential management, we need to create an IAM role for Polaris to use through the STS AssumeRole mechanism. This design follows the security best practices of the least privileged principle and separation of duties. + +1. Create a trust policy file + + Create the `polaris-trust-policy.json` file: + + > Note: Replace YOUR\_ACCOUNT\_ID with your actual AWS account ID, which can be obtained using `aws sts get-caller-identity --query Account --output text`. + + ```bash + cat > polaris-trust-policy.json <<EOF + { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "AWS": "arn:aws:iam::YOUR_ACCOUNT_ID:root" + }, + "Action": "sts:AssumeRole", + "Condition": { + "StringEquals": { + "sts:ExternalId": "polaris-doris-demo" + } + } + } + ] + } + EOF + ``` + +2. Create an IAM Role + + ```bash + aws iam create-role \ + --role-name polaris-doris-demo \ + --assume-role-policy-document file:///path/to/polaris-trust-policy.json \ + --description "IAM Role for Polaris to access S3 storage" + ``` + +3. Attach S3 access permission policy + + ```bash + # Attach the AmazonS3FullAccess managed policy (for testing only, use fine-grained permissions for production environments) + aws iam attach-role-policy \ + --role-name polaris-doris-demo \ + --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess + ``` + +#### 1.3 Bind IAM Role to EC2 Instance (Optional) + +> If you do not perform this step, you need to export `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` before starting polaris + +If your Polaris service will run on an EC2 instance, it is best to bind an IAM role to the EC2 instance instead of using access keys. This avoids hard-coding credentials in the code and improves security. + +1. Create a trust policy for the EC2 instance role + + First, create the trust policy file that allows the EC2 service to assume this role: + + ```json + cat > ec2-trust-policy.json <<EOF + { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "Service": "ec2.amazonaws.com" + }, + "Action": "sts:AssumeRole" + } + ] + } + EOF + ``` + +2. Create EC2 Instance Role + + ```bash + aws iam create-role \ + --role-name polaris-ec2-role \ + --assume-role-policy-document file:///path/to/ec2-trust-policy.json \ + --description "IAM Role for EC2 instance running Polaris service" + ``` + +3. Attach S3 access permission policy + + ```bash + # Attach the AmazonS3FullAccess managed policy + aws iam attach-role-policy \ + --role-name polaris-ec2-role \ + --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess + ``` + +4. Create an instance configuration file + + ```bash + # Create an instance profile + aws iam create-instance-profile \ + --instance-profile-name polaris-ec2-instance-profile + + # Add a role to an instance profile + aws iam add-role-to-instance-profile \ + --instance-profile-name polaris-ec2-instance-profile \ + --role-name polaris-ec2-role + ``` + +5. Attach the instance profile to the EC2 instance + + ```bash + # If it is a newly created EC2 instance, specify it at startup + aws ec2 run-instances \ + --image-id ami-xxxxxxxxx \ + --instance-type t3.medium \ + --iam-instance-profile Name=polaris-ec2-instance-profile \ + --other-parameters... + + # If it is an existing EC2 instance, you need to associate the instance profile + aws ec2 associate-iam-instance-profile \ + --instance-id i-xxxxxxxxx \ + --iam-instance-profile Name=polaris-ec2-instance-profile + ``` + +### 2. Polaris Deployment and Catalog Creation + +With the environment ready, we'll now deploy the Polaris service and configure the Iceberg Catalog. + +> This document uses the source code quick start method. For more deployment methods, please refer to: https://polaris.apache.org/releases/1.0.1/getting-started/deploying-polaris/ + +#### 2.1 Clone Source Code and Start Polaris + +1. Configure AWS Credentials(Optional) + + If you're not running Polaris on EC2, or if the EC2 instance doesn't have the appropriate IAM Role attached, you need to provide Polaris with AK/SK that has permission to assume the `polaris-doris-demo` role through environment variables. + + ```bash + export AWS_ACCESS_KEY_ID=YOUR_AWS_ACCESS_KEY_ID + export AWS_SECRET_ACCESS_KEY=YOUR_AWS_SECRET_ACCESS_KEY + ``` + +2. Clone Polaris Repository and Switch to Specific Version + + ```bash + git clone https://github.com/apache/polaris.git + cd polaris + # Recommend using a released stable version + git checkout apache-polaris-1.0.1-incubating + ``` + +3. Run Polaris + + Ensure you have Java 21+ and Docker 27+ installed. + + ```bash + ./gradlew run -Dpolaris.bootstrap.credentials=POLARIS,root,secret + ``` + + * `POLARIS` is the realm + + * `root` is the `CLIENT_ID` + + * `secret` is the `CLIENT_SECRET` + + * If credentials are not set, it will use preset credentials `POLARIS,root,s3cr3t` + + This command will compile and start the Polaris service, which listens on port 8181 by default. + + > You can also use binary distribution, see: https://github.com/apache/polaris/tree/main/runtime/distribution + + +#### 2.2 Create Catalog and Namespace in Polaris + +1. Export ROOT Credentials + + > The `CLIENT_ID` and `CLIENT_SECRET` here are the same as those we set when we started Polaris + + ```bash + export CLIENT_ID=root + export CLIENT_SECRET=secret + ``` + +2. Create Catalog (Pointing to S3 Storage) + + ```bash + ./polaris catalogs create \ + --storage-type s3 \ + --default-base-location s3://polaris-doris-test/polaris1 \ + --role-arn arn:aws:iam::<account_id>:role/polaris-doris-demo \ + --external-id polaris-doris-demo \ + doris_catalog + ``` + + * `--storage-type`: Specifies the underlying storage as S3. + + * `--default-base-location`: Default root path for Iceberg table data. + + * `--role-arn`: IAM Role that Polaris service uses to assume for S3 access. + + * `--external-id`: External ID used when assuming the role, must match the configuration in the IAM Role trust policy. + +3. Create Namespace + + ```bash + ./polaris namespaces create --catalog doris_catalog doris_demo + ``` + + This creates a namespace (database) named `doris_demo` under `doris_catalog`. + +#### 2.3 Polaris Security Roles and Permission Configuration + +To allow Doris to access as a `non-root` user, we need to create a new user and role with appropriate permissions. + +1. Create Principal Role and Catalog Role + + ```bash + # Create a Principal Role for aggregating permissions + ./polaris principal-roles create doris_pr_role + + # Create a Catalog Role under doris_catalog + ./polaris catalog-roles create --catalog doris_catalog doris_catalog_role + ``` + +2. Grant Permissions to Catalog Role + + ```bash + # Grant doris_catalog_role permission to manage content within the Catalog + ./polaris privileges catalog grant \ + --catalog doris_catalog \ + --catalog-role doris_catalog_role \ + CATALOG_MANAGE_CONTENT + ``` + +3. Associate Principal Role and Catalog Role + + ```bash + # Assign doris_catalog_role to doris_pr_role + ./polaris catalog-roles grant \ + --catalog doris_catalog \ + --principal-role doris_pr_role \ + doris_catalog_role + ``` + +4. Create New Principal (User) and Bind Role + + ```bash + # Create a new user (Principal) named doris_user + ./polaris principals create doris_user + # Example output: {"clientId": "6e155b128dc06c13", "clientSecret": "ce9fbb4cc91c43ff2955f2c6545239d7"} + # Please note down this new client_id and client_secret pair, as Doris will use them for connection. + + # Bind doris_user to doris_pr_role + ./polaris principal-roles grant \ + doris_pr_role \ + --principal doris_user + ``` + + With this, all Polaris-side configuration is complete. We've created a user named `doris_user` that obtains permission to manage `doris_catalog` through `doris_pr_role`. + +### 3. Doris-Polaris Integration + +Now, we'll create an Iceberg Catalog in Doris that connects to the newly configured Polaris service. Doris supports multiple flexible authentication combinations. + +> Note: In this example, we use OAuth2 authentication credential to connect to the Polaris rest service. In addition, Doris also supports using `iceberg.rest.oauth2.token `to directly provide a pre-obtained Bearer Token + +#### Method 1: OAuth2 + Temporary Storage Credentials (Credential Vending) + +This is the **most recommended** approach. Doris uses OAuth2 credentials to authenticate with Polaris and obtain metadata. When needing to read/write data files on S3, Doris requests a temporary S3 access credential with minimal privileges from Polaris. + +**Doris Catalog Creation Statement:** + +Use the `clientId` and `clientSecret` generated for `doris_user`. + +```sql +CREATE CATALOG polaris_vended PROPERTIES ( + 'type' = 'iceberg', + -- Catalog name in Polaris + 'warehouse' = 'doris_catalog', + 'iceberg.catalog.type' = 'rest', + -- Polaris service address + 'iceberg.rest.uri' = 'http://YOUR_POLARIS_HOST:8181/api/catalog', + -- Metadata authentication method + 'iceberg.rest.security.type' = 'oauth2', + -- Replace with doris_user's client_id:client_secret + 'iceberg.rest.oauth2.credential' = 'client_id:client_secret', + 'iceberg.rest.oauth2.server-uri' = 'http://YOUR_POLARIS_HOST:8181/api/catalog/v1/oauth/tokens', + 'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:doris_pr_role', + -- Enable credential vending + 'iceberg.rest.vended-credentials-enabled' = 'true', + -- S3 basic configuration (no keys required) + 's3.endpoint' = 'https://s3.us-west-2.amazonaws.com', Review Comment: Thanks for pointing this out! This is actually a limitation on the Doris side — we currently need to recognize the storage type through an explicit parameter. We’ll look into improving this in the future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@polaris.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org