singhpk234 commented on code in PR #2571: URL: https://github.com/apache/polaris/pull/2571#discussion_r2350292577
########## site/content/blog/2025/09/15/doris-polaris-integration.md: ########## @@ -0,0 +1,427 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: "Doris X Polaris: Building Unified Data Lakehouse with Iceberg REST Catalog - A Practical Guide" +date: 2025-09-15 +author: zy-kkk +--- + +With the continuous evolution of data lake technologies, efficiently and securely managing massive datasets stored on object storage (such as AWS S3) while providing unified access endpoints for upstream analytics engines (like [Apache Doris](https://doris.apache.org)) has become a core challenge in modern data architectures. [Apache Polaris](https://polaris.apache.org/), as an open and standardized REST Catalog service for Iceberg, provides an ideal solution to this challenge. It not only handles centralized metadata management but also significantly enhances data lake security and manageability through fine-grained access control and flexible credential management mechanisms. + +This document will provide a detailed guide on integrating Apache Doris with Polaris to achieve efficient querying and management of Iceberg data on S3. We'll guide you through the complete process from environment preparation to final data querying step by step + +**Through this documentation, you will quickly learn:** + +* **AWS Environment Setup**: How to create and configure S3 buckets in AWS, and prepare the necessary IAM roles and policies for both Polaris and Doris, enabling Polaris to access S3 and vend temporary credentials for Doris. + +* **Polaris Deployment and Configuration**: How to download and start the Polaris service, and create Iceberg Catalog, Namespace, and corresponding Principal/Role/permissions in Polaris to provide secure metadata access endpoints for Doris. + +* **Doris-Polaris Integration**: Explains how Doris obtains metadata access tokens from Polaris via OAuth2, and demonstrates two core underlying storage access methods: + + 1. Temporary AK/SK distribution by Polaris (Credential Vending mechanism) + + 2. Doris directly using static AK/SK to access S3 + +## About Apache Doris + +[Apache Doris](https://doris.apache.org) is the fastest analytical and search database for the AI era. + +It provides high-performance hybrid search capabilities across structured data, semi-structured data (such as JSON), and vector data. It excels at delivering high-concurrency, low-latency queries, while also offering advanced optimization for complex join operations. In addition, Doris can serve as a unified query engine, delivering high-performance analytical services not only on its self-managed internal table format but also on open lakehouse formats such as Iceberg. + +With Doris, users can easily build a real-time lakehouse data platform. + +## About Apache Polaris + +Apache Polaris (Incubating) is a catalog implementation for Apache Iceberg™ tables and is built on the open source Apache Iceberg™ REST protocol. + +With Polaris, you can provide centralized, secure read and write access to your Iceberg tables across different REST-compatible query engines. + +## Hands-on Guide + +### 1. AWS Environment Setup + +Before we begin, we need to prepare S3 buckets and corresponding IAM roles on AWS, which form the foundation for Polaris to manage data and Doris to access data. + +#### 1.1 Create S3 Bucket + +First, we create an S3 bucket named `polaris-doris-test` to store the Iceberg table data that will be created later. + +```bash +# Create an S3 bucket +aws s3 mb s3://polaris-doris-test --region us-west-2 +# Verify that the bucket was created successfully +aws s3 ls | grep polaris-doris-test +``` + +#### 1.2 Create IAM Role for Object Storage Access + +To implement secure credential management, we need to create an IAM role for Polaris to use through the STS AssumeRole mechanism. This design follows the security best practices of the least privileged principle and separation of duties. + +1. Create a trust policy file + + Create the `polaris-trust-policy.json` file: + + > Note: Replace YOUR\_ACCOUNT\_ID with your actual AWS account ID, which can be obtained using `aws sts get-caller-identity --query Account --output text`. + + ```bash + cat > polaris-trust-policy.json <<EOF + { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "AWS": "arn:aws:iam::YOUR_ACCOUNT_ID:root" + }, + "Action": "sts:AssumeRole", + "Condition": { + "StringEquals": { + "sts:ExternalId": "polaris-doris-demo" + } + } + } + ] + } + EOF + ``` + +2. Create an IAM Role + + ```bash + aws iam create-role \ + --role-name polaris-doris-demo \ + --assume-role-policy-document file:///path/to/polaris-trust-policy.json \ + --description "IAM Role for Polaris to access S3 storage" + ``` + +3. Attach S3 access permission policy + + ```bash + # Attach the AmazonS3FullAccess managed policy (for testing only, use fine-grained permissions for production environments) + aws iam attach-role-policy \ + --role-name polaris-doris-demo \ + --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess + ``` + +#### 1.3 Bind IAM Role to EC2 Instance (Optional) + +> If you do not perform this step, you need to export `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` before starting polaris + +If your Polaris service will run on an EC2 instance, it is best to bind an IAM role to the EC2 instance instead of using access keys. This avoids hard-coding credentials in the code and improves security. + +1. Create a trust policy for the EC2 instance role + + First, create the trust policy file that allows the EC2 service to assume this role: + + ```json + cat > ec2-trust-policy.json <<EOF + { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "Service": "ec2.amazonaws.com" + }, + "Action": "sts:AssumeRole" + } + ] + } + EOF + ``` + +2. Create EC2 Instance Role + + ```bash + aws iam create-role \ + --role-name polaris-ec2-role \ + --assume-role-policy-document file:///path/to/ec2-trust-policy.json \ + --description "IAM Role for EC2 instance running Polaris service" + ``` + +3. Attach S3 access permission policy + + ```bash + # Attach the AmazonS3FullAccess managed policy + aws iam attach-role-policy \ + --role-name polaris-ec2-role \ + --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess + ``` + +4. Create an instance configuration file + + ```bash + # Create an instance profile + aws iam create-instance-profile \ + --instance-profile-name polaris-ec2-instance-profile + + # Add a role to an instance profile + aws iam add-role-to-instance-profile \ + --instance-profile-name polaris-ec2-instance-profile \ + --role-name polaris-ec2-role + ``` + +5. Attach the instance profile to the EC2 instance + + ```bash + # If it is a newly created EC2 instance, specify it at startup + aws ec2 run-instances \ + --image-id ami-xxxxxxxxx \ + --instance-type t3.medium \ + --iam-instance-profile Name=polaris-ec2-instance-profile \ + --other-parameters... + + # If it is an existing EC2 instance, you need to associate the instance profile + aws ec2 associate-iam-instance-profile \ + --instance-id i-xxxxxxxxx \ + --iam-instance-profile Name=polaris-ec2-instance-profile + ``` + +### 2. Polaris Deployment and Catalog Creation + +With the environment ready, we'll now deploy the Polaris service and configure the Iceberg Catalog. Review Comment: [not a blocker] wondering if you had change to look at this script in the repo : https://github.com/apache/polaris/blob/main/getting-started/assets/cloud_providers/deploy-aws.sh it automatically sets up polaris env with bucket creation etc, wondering if that is something we can leverage -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@polaris.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org