dimas-b commented on code in PR #4451: URL: https://github.com/apache/polaris/pull/4451#discussion_r3250202917
########## site/content/in-dev/unreleased/configuration/configuring-polaris-for-production/configuring-aws-s3-cloud-storage-specific.md: ########## @@ -0,0 +1,263 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: Configuring AWS S3 Cloud Storage +linkTitle: Configuring AWS S3 Cloud Storage +type: docs +weight: 610 +--- + +This page covers configuring AWS S3 as the storage backend for a Polaris catalog. All read and write +operations against S3 are performed using credential vending, in which Polaris assumes an IAM role +on behalf of the client and returns scoped, short-lived credentials. The IAM role, its trust policy, +and the bucket itself must be set up before the catalog is created. + +## IAM role and trust policy + +Polaris assumes a customer-managed IAM role via STS when a client requests credentials. The role +must: + +1. Grant the actions required for object access on the bucket and prefix that backs the catalog + (`s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`, `s3:ListBucket` and, if encryption is in use, + the relevant `kms:*` actions). +2. Trust the Polaris service principal — typically the IAM role that the Polaris server runs as. + Polaris fills the `sts:AssumeRole` request with the configured `userArn` and, when supplied, an + `externalId`. The trust policy must accept both. + +A minimal trust policy looks like: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { "AWS": "arn:aws:iam::123456789012:role/polaris-server" }, + "Action": "sts:AssumeRole", + "Condition": { + "StringEquals": { "sts:ExternalId": "polaris-prod" } + } + } + ] +} +``` + +If you do not require an external ID, omit the `Condition` block and the matching `externalId` +field in the storage config. Review Comment: I'm a bit hesitant about this sentence, even though it is correct. Using external IDs is the best practice. Users who know what they are doing and still do not want the external ID probably already know how to construct the policy text. ########## site/content/in-dev/unreleased/configuration/configuring-polaris-for-production/configuring-aws-s3-cloud-storage-specific.md: ########## @@ -0,0 +1,263 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: Configuring AWS S3 Cloud Storage +linkTitle: Configuring AWS S3 Cloud Storage +type: docs +weight: 610 +--- + +This page covers configuring AWS S3 as the storage backend for a Polaris catalog. All read and write +operations against S3 are performed using credential vending, in which Polaris assumes an IAM role +on behalf of the client and returns scoped, short-lived credentials. The IAM role, its trust policy, +and the bucket itself must be set up before the catalog is created. + +## IAM role and trust policy + +Polaris assumes a customer-managed IAM role via STS when a client requests credentials. The role +must: + +1. Grant the actions required for object access on the bucket and prefix that backs the catalog + (`s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`, `s3:ListBucket` and, if encryption is in use, + the relevant `kms:*` actions). +2. Trust the Polaris service principal — typically the IAM role that the Polaris server runs as. + Polaris fills the `sts:AssumeRole` request with the configured `userArn` and, when supplied, an + `externalId`. The trust policy must accept both. + +A minimal trust policy looks like: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { "AWS": "arn:aws:iam::123456789012:role/polaris-server" }, + "Action": "sts:AssumeRole", + "Condition": { + "StringEquals": { "sts:ExternalId": "polaris-prod" } + } + } + ] +} +``` + +If you do not require an external ID, omit the `Condition` block and the matching `externalId` +field in the storage config. + +## Catalog storage configuration + +Provide the role ARN and region when creating the catalog. `userArn` is the identity Polaris +itself uses (typically the role ARN of the server); `externalId` matches the trust policy above. + +```bash +curl -X POST https://<polaris-host>/management/v1/catalogs \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "type": "INTERNAL", + "name": "warehouse_s3", + "storageConfigInfo": { + "storageType": "S3", + "roleArn": "arn:aws:iam::123456789012:role/polaris-warehouse-access", + "userArn": "arn:aws:iam::123456789012:role/polaris-server", Review Comment: I believe this line is irrelevant to deployments using current Apache Polaris distributions (can be omitted). ########## site/content/in-dev/unreleased/configuration/configuring-polaris-for-production/configuring-aws-s3-cloud-storage-specific.md: ########## @@ -0,0 +1,263 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: Configuring AWS S3 Cloud Storage +linkTitle: Configuring AWS S3 Cloud Storage +type: docs +weight: 610 +--- + +This page covers configuring AWS S3 as the storage backend for a Polaris catalog. All read and write +operations against S3 are performed using credential vending, in which Polaris assumes an IAM role +on behalf of the client and returns scoped, short-lived credentials. The IAM role, its trust policy, +and the bucket itself must be set up before the catalog is created. + +## IAM role and trust policy + +Polaris assumes a customer-managed IAM role via STS when a client requests credentials. The role +must: + +1. Grant the actions required for object access on the bucket and prefix that backs the catalog + (`s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`, `s3:ListBucket` and, if encryption is in use, + the relevant `kms:*` actions). +2. Trust the Polaris service principal — typically the IAM role that the Polaris server runs as. + Polaris fills the `sts:AssumeRole` request with the configured `userArn` and, when supplied, an + `externalId`. The trust policy must accept both. + +A minimal trust policy looks like: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { "AWS": "arn:aws:iam::123456789012:role/polaris-server" }, + "Action": "sts:AssumeRole", + "Condition": { + "StringEquals": { "sts:ExternalId": "polaris-prod" } + } + } + ] +} +``` + +If you do not require an external ID, omit the `Condition` block and the matching `externalId` +field in the storage config. + +## Catalog storage configuration + +Provide the role ARN and region when creating the catalog. `userArn` is the identity Polaris +itself uses (typically the role ARN of the server); `externalId` matches the trust policy above. + +```bash +curl -X POST https://<polaris-host>/management/v1/catalogs \ + -H "Authorization: Bearer $TOKEN" \ Review Comment: The bearer token is non-trivial to obtain for users. It might be worth adding a doc page (with a link) to how it can be obtained... of course, that will need to cover external IdP too... so it's not a trivial change. Maybe just add a quick note about this to the doc to avoid making readers think they are missing something obvious 😅 ########## site/content/in-dev/unreleased/configuration/configuring-polaris-for-production/configuring-aws-s3-cloud-storage-specific.md: ########## @@ -0,0 +1,263 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: Configuring AWS S3 Cloud Storage +linkTitle: Configuring AWS S3 Cloud Storage +type: docs +weight: 610 +--- + +This page covers configuring AWS S3 as the storage backend for a Polaris catalog. All read and write +operations against S3 are performed using credential vending, in which Polaris assumes an IAM role +on behalf of the client and returns scoped, short-lived credentials. The IAM role, its trust policy, +and the bucket itself must be set up before the catalog is created. + +## IAM role and trust policy + +Polaris assumes a customer-managed IAM role via STS when a client requests credentials. The role +must: + +1. Grant the actions required for object access on the bucket and prefix that backs the catalog + (`s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`, `s3:ListBucket` and, if encryption is in use, + the relevant `kms:*` actions). +2. Trust the Polaris service principal — typically the IAM role that the Polaris server runs as. + Polaris fills the `sts:AssumeRole` request with the configured `userArn` and, when supplied, an + `externalId`. The trust policy must accept both. + +A minimal trust policy looks like: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { "AWS": "arn:aws:iam::123456789012:role/polaris-server" }, + "Action": "sts:AssumeRole", + "Condition": { + "StringEquals": { "sts:ExternalId": "polaris-prod" } + } + } + ] +} +``` + +If you do not require an external ID, omit the `Condition` block and the matching `externalId` +field in the storage config. + +## Catalog storage configuration + +Provide the role ARN and region when creating the catalog. `userArn` is the identity Polaris +itself uses (typically the role ARN of the server); `externalId` matches the trust policy above. + +```bash +curl -X POST https://<polaris-host>/management/v1/catalogs \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "type": "INTERNAL", + "name": "warehouse_s3", + "storageConfigInfo": { + "storageType": "S3", + "roleArn": "arn:aws:iam::123456789012:role/polaris-warehouse-access", + "userArn": "arn:aws:iam::123456789012:role/polaris-server", + "externalId": "polaris-prod", + "region": "us-east-1" + }, + "properties": { "default-base-location": "s3://warehouse-bucket/prod/" } + }' +``` + +The role ARN is validated against the pattern enforced by `AwsStorageConfigurationInfo`; an +ill-formed ARN is rejected at catalog creation time. + +## Server-side encryption with KMS + +When the bucket uses SSE-KMS, supply the key Polaris should use for writes and the full set of +keys it is allowed to read from: + +```json +"storageConfigInfo": { + "storageType": "S3", + "roleArn": "...", + "region": "us-east-1", + "currentKmsKey": "arn:aws:kms:us-east-1:123456789012:key/aaaa-bbbb", + "allowedKmsKeys": [ + "arn:aws:kms:us-east-1:123456789012:key/aaaa-bbbb", + "arn:aws:kms:us-east-1:123456789012:key/cccc-dddd" + ] +} +``` + +The IAM role's policy must include `kms:GenerateDataKey` and `kms:Decrypt` on every key listed in +`allowedKmsKeys`, and the key policy must grant the same to the role principal. + +If the deployment does not use KMS, set `kmsUnavailable` to `true` so Polaris will not request +KMS-related session permissions: + +```json +"kmsUnavailable": true +``` + +## S3-compatible endpoints + +Polaris can be pointed at S3-compatible object stores (MinIO, Ceph RGW, Apache Ozone S3 gateway). Review Comment: Since the page covers non-AWS systems, I suggest re-titling to `Configuring S3 Storage`. ########## site/content/in-dev/unreleased/configuration/configuring-polaris-for-production/configuring-azure-blob-cloud-storage-specific.md: ########## @@ -0,0 +1,214 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: Configuring Azure Blob Cloud Storage +linkTitle: Configuring Azure Blob Cloud Storage +type: docs +weight: 620 +--- + +This page covers configuring Azure Blob Storage and Azure Data Lake Storage Gen2 (ADLS Gen2) as +the storage backend for a Polaris catalog. Polaris authenticates against Azure with the credentials +of a service principal that has data-plane access to the target storage account, and then vends +short-lived SAS tokens to clients on each table-load request. + +## Service principal and Polaris credentials + +Polaris uses the Azure SDK's `DefaultAzureCredential` chain, which by default reads the +service-principal credentials from environment variables. Create a service principal with data +access to the storage account and pass its credentials to the Polaris process: + +```bash +# Replace <subscription>, <resource-group>, <storage-account> with your values. +az ad sp create-for-rbac \ + --name polaris-storage \ + --role "Storage Blob Data Contributor" \ + --scopes "/subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>" +``` + +The command prints `appId`, `password`, and `tenant`. Set these on the Polaris server: + +```bash +export AZURE_TENANT_ID=<tenant> +export AZURE_CLIENT_ID=<appId> +export AZURE_CLIENT_SECRET=<password> +``` + +In a container deployment, set the same three variables on the Polaris container/pod. The Review Comment: Please mention using k8s secret references rather that clear text env. settings for `AZURE_CLIENT_SECRET`. ########## site/content/in-dev/unreleased/configuration/configuring-polaris-for-production/configuring-aws-s3-cloud-storage-specific.md: ########## @@ -0,0 +1,263 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: Configuring AWS S3 Cloud Storage +linkTitle: Configuring AWS S3 Cloud Storage +type: docs +weight: 610 +--- + +This page covers configuring AWS S3 as the storage backend for a Polaris catalog. All read and write +operations against S3 are performed using credential vending, in which Polaris assumes an IAM role +on behalf of the client and returns scoped, short-lived credentials. The IAM role, its trust policy, +and the bucket itself must be set up before the catalog is created. + +## IAM role and trust policy + +Polaris assumes a customer-managed IAM role via STS when a client requests credentials. The role +must: + +1. Grant the actions required for object access on the bucket and prefix that backs the catalog + (`s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`, `s3:ListBucket` and, if encryption is in use, + the relevant `kms:*` actions). +2. Trust the Polaris service principal — typically the IAM role that the Polaris server runs as. + Polaris fills the `sts:AssumeRole` request with the configured `userArn` and, when supplied, an + `externalId`. The trust policy must accept both. + +A minimal trust policy looks like: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { "AWS": "arn:aws:iam::123456789012:role/polaris-server" }, + "Action": "sts:AssumeRole", + "Condition": { + "StringEquals": { "sts:ExternalId": "polaris-prod" } + } + } + ] +} +``` + +If you do not require an external ID, omit the `Condition` block and the matching `externalId` +field in the storage config. + +## Catalog storage configuration + +Provide the role ARN and region when creating the catalog. `userArn` is the identity Polaris +itself uses (typically the role ARN of the server); `externalId` matches the trust policy above. + +```bash +curl -X POST https://<polaris-host>/management/v1/catalogs \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "type": "INTERNAL", + "name": "warehouse_s3", + "storageConfigInfo": { + "storageType": "S3", + "roleArn": "arn:aws:iam::123456789012:role/polaris-warehouse-access", + "userArn": "arn:aws:iam::123456789012:role/polaris-server", + "externalId": "polaris-prod", + "region": "us-east-1" + }, + "properties": { "default-base-location": "s3://warehouse-bucket/prod/" } + }' +``` + +The role ARN is validated against the pattern enforced by `AwsStorageConfigurationInfo`; an +ill-formed ARN is rejected at catalog creation time. + +## Server-side encryption with KMS + +When the bucket uses SSE-KMS, supply the key Polaris should use for writes and the full set of +keys it is allowed to read from: + +```json +"storageConfigInfo": { + "storageType": "S3", + "roleArn": "...", + "region": "us-east-1", + "currentKmsKey": "arn:aws:kms:us-east-1:123456789012:key/aaaa-bbbb", + "allowedKmsKeys": [ + "arn:aws:kms:us-east-1:123456789012:key/aaaa-bbbb", + "arn:aws:kms:us-east-1:123456789012:key/cccc-dddd" + ] +} +``` + +The IAM role's policy must include `kms:GenerateDataKey` and `kms:Decrypt` on every key listed in +`allowedKmsKeys`, and the key policy must grant the same to the role principal. + +If the deployment does not use KMS, set `kmsUnavailable` to `true` so Polaris will not request +KMS-related session permissions: + +```json +"kmsUnavailable": true +``` + +## S3-compatible endpoints + +Polaris can be pointed at S3-compatible object stores (MinIO, Ceph RGW, Apache Ozone S3 gateway). +The available fields are: + +- `endpoint` — the S3 API endpoint Polaris and its clients should call. +- `endpointInternal` — optional, used by the Polaris server when the in-cluster endpoint differs + from the one returned to clients. +- `pathStyleAccess` — set to `true` for backends that do not support virtual-host-style addressing. +- `stsEndpoint` — STS endpoint; defaults to `endpointInternal` then `endpoint` when not set. +- `stsUnavailable` — set to `true` when the backend does not implement STS. + +How clients receive credentials depends on whether the backend implements STS. + +### Backends with STS support (e.g. AWS S3, MinIO) + +Leave `stsUnavailable` unset (or `false`). Polaris will assume the role and vend short-lived, +subscoped credentials to the client at table-load time when the client sends +`X-Iceberg-Access-Delegation: vended-credentials`. This is the recommended deployment for AWS S3 +and any compatible backend that exposes the STS API. + +```json +"storageConfigInfo": { + "storageType": "S3", + "endpoint": "https://s3.internal.example.com", + "pathStyleAccess": true, + "region": "us-east-1" +} +``` + +### Backends without STS support (e.g. Apache Ozone S3 gateway, Ceph RGW without STS enabled) + +Set `stsUnavailable: true`. Polaris will then skip subscoped credential vending, and clients must +authenticate to the object store directly with long-lived credentials. Because the vended-credential +path is disabled, the client must omit the `X-Iceberg-Access-Delegation` header and supply its own +access key / secret to the underlying FileIO. The Polaris guides for [Apache Ozone][ozone-guide] +and [Ceph][ceph-guide] show this pattern. + +```json +"storageConfigInfo": { + "storageType": "S3", + "endpoint": "https://s3.internal.example.com", + "pathStyleAccess": true, + "stsUnavailable": true, + "region": "us-east-1" +} +``` + +[ozone-guide]: ../../../../guides/ozone/ +[ceph-guide]: ../../../../guides/ceph/ + +## Client configuration + +Engines connect through the Iceberg REST API and let Polaris vend credentials at table-load time; +they do not need static AWS credentials when STS is available. + +Spark example, matching the property names used by the existing MinIO / RustFS guides: + +```shell +bin/spark-sql \ + --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1 \ + --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ + --conf spark.sql.catalog.polaris=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.polaris.type=rest \ + --conf spark.sql.catalog.polaris.uri=https://<polaris-host>/api/catalog \ + --conf spark.sql.catalog.polaris.oauth2-server-uri=https://<polaris-host>/api/catalog/v1/oauth/tokens \ + --conf spark.sql.catalog.polaris.token-refresh-enabled=false \ + --conf spark.sql.catalog.polaris.warehouse=warehouse_s3 \ + --conf spark.sql.catalog.polaris.scope=PRINCIPAL_ROLE:ALL \ + --conf spark.sql.catalog.polaris.credential=<client-id>:<client-secret> \ + --conf spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation=vended-credentials +``` + +The `oauth2-server-uri` is recommended: without it the Iceberg REST client falls back to a +hard-coded `/v1/oauth/tokens` path and logs a deprecation warning, since the automatic fallback +is slated for removal in a future Iceberg release. + +For Trino, use the Iceberg connector with the REST catalog. Two groups of properties are +required: the REST/OAuth2 settings for talking to Polaris, and the native S3 filesystem settings +that Trino uses to read the vended credentials. + +```properties +connector.name=iceberg +iceberg.catalog.type=rest +iceberg.rest-catalog.uri=https://<polaris-host>/api/catalog +iceberg.rest-catalog.warehouse=warehouse_s3 +iceberg.rest-catalog.security=OAUTH2 +iceberg.rest-catalog.oauth2.credential=<client-id>:<client-secret> +iceberg.rest-catalog.oauth2.scope=PRINCIPAL_ROLE:ALL +iceberg.rest-catalog.oauth2.server-uri=https://<polaris-host>/api/catalog/v1/oauth/tokens +iceberg.rest-catalog.vended-credentials-enabled=true +fs.native-s3.enabled=true +s3.region=us-east-1 +``` + +When pointing at an S3-compatible endpoint, also set: + +```properties +s3.endpoint=https://s3.internal.example.com +s3.path-style-access=true Review Comment: Is this truly necessary? Polaris should send the endpoint URI (and ` s3.path-style-access`) as part of the Catalog Config response to clients 🤔 ########## site/content/in-dev/unreleased/configuration/configuring-polaris-for-production/configuring-azure-blob-cloud-storage-specific.md: ########## @@ -0,0 +1,214 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: Configuring Azure Blob Cloud Storage +linkTitle: Configuring Azure Blob Cloud Storage +type: docs +weight: 620 +--- + +This page covers configuring Azure Blob Storage and Azure Data Lake Storage Gen2 (ADLS Gen2) as +the storage backend for a Polaris catalog. Polaris authenticates against Azure with the credentials +of a service principal that has data-plane access to the target storage account, and then vends +short-lived SAS tokens to clients on each table-load request. + +## Service principal and Polaris credentials + +Polaris uses the Azure SDK's `DefaultAzureCredential` chain, which by default reads the +service-principal credentials from environment variables. Create a service principal with data +access to the storage account and pass its credentials to the Polaris process: + +```bash +# Replace <subscription>, <resource-group>, <storage-account> with your values. +az ad sp create-for-rbac \ + --name polaris-storage \ + --role "Storage Blob Data Contributor" \ + --scopes "/subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>" +``` + +The command prints `appId`, `password`, and `tenant`. Set these on the Polaris server: + +```bash +export AZURE_TENANT_ID=<tenant> +export AZURE_CLIENT_ID=<appId> +export AZURE_CLIENT_SECRET=<password> +``` + +In a container deployment, set the same three variables on the Polaris container/pod. The +`Storage Blob Data Contributor` role at storage-account scope is what allows Polaris to issue SAS +tokens for any container under that account; you can scope the role narrower (single container) +when you need to confine a single Polaris catalog to one container. + +## Storage account requirements + +The storage account that backs the catalog should be configured with: + +- **Hierarchical namespace (HNS)** enabled — this turns the account into ADLS Gen2 and is required + for directory-aware operations (rename, recursive list) that Iceberg relies on for atomic + metadata commits. If HNS is disabled, set `hierarchical: false` so Polaris will request flat-blob + permissions only and avoid scoping SAS tokens to non-existent directory ACLs. +- A **container** that will hold the catalog's namespaces and tables (for example `warehouse`). + Polaris does not create the container itself. +- **Firewall** rules that permit traffic from the Polaris control plane and from the engines that + will read the data. SAS tokens do not bypass storage-account firewalls. + +## Catalog storage configuration + +With the service principal in place on the server, create the catalog with the storage account's +tenant ID and the `abfss://` location: + +```bash +curl -X POST https://<polaris-host>/management/v1/catalogs \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "catalog": { + "type": "INTERNAL", + "name": "warehouse_azure", + "properties": { + "default-base-location": "abfss://[email protected]/prod/" + }, + "storageConfigInfo": { + "storageType": "AZURE", + "tenantId": "00000000-0000-0000-0000-000000000000", + "hierarchical": true, + "allowedLocations": [ + "abfss://[email protected]/" + ] + } + } + }' +``` + +`default-base-location` must use the `abfss://` scheme together with the ADLS Gen2 endpoint +(`<account>.dfs.core.windows.net`). The `wasbs://` scheme is not supported. + +`AzureStorageConfigurationInfo` also accepts `multiTenantAppName` and `consentUrl`. These are +used by managed Polaris deployments that present a single multi-tenant Azure AD application to +many customer tenants; in a self-hosted deployment that authenticates with its own service +principal they can be omitted. + +## SAS token scoping and HNS ACLs + +When HNS is enabled (`hierarchical: true`), Polaris narrows each vended SAS token to the directory +that backs the requested namespace or table. The ADLS Gen2 ACL on that directory must include the +service principal as well as any extra principals that should read the data outside of vended +credentials. + +A common failure mode is a token that grants object-level permissions but is denied by a +directory-level ACL. The 403 returned by ADLS includes the path that was denied; align the ACL on +that exact prefix to recover. + +When HNS is disabled, set `hierarchical: false`. Polaris will then issue SAS tokens scoped at the Review Comment: It might be worth mentioning that if the `hierarchical` config is misaligned with the HNS flag in storage, access control errors will likely occur in runtime. ########## site/content/in-dev/unreleased/configuration/configuring-polaris-for-production/configuring-aws-s3-cloud-storage-specific.md: ########## @@ -0,0 +1,263 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: Configuring AWS S3 Cloud Storage +linkTitle: Configuring AWS S3 Cloud Storage +type: docs +weight: 610 +--- + +This page covers configuring AWS S3 as the storage backend for a Polaris catalog. All read and write +operations against S3 are performed using credential vending, in which Polaris assumes an IAM role +on behalf of the client and returns scoped, short-lived credentials. The IAM role, its trust policy, +and the bucket itself must be set up before the catalog is created. + +## IAM role and trust policy + +Polaris assumes a customer-managed IAM role via STS when a client requests credentials. The role +must: + +1. Grant the actions required for object access on the bucket and prefix that backs the catalog + (`s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`, `s3:ListBucket` and, if encryption is in use, + the relevant `kms:*` actions). +2. Trust the Polaris service principal — typically the IAM role that the Polaris server runs as. + Polaris fills the `sts:AssumeRole` request with the configured `userArn` and, when supplied, an + `externalId`. The trust policy must accept both. + +A minimal trust policy looks like: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { "AWS": "arn:aws:iam::123456789012:role/polaris-server" }, + "Action": "sts:AssumeRole", + "Condition": { + "StringEquals": { "sts:ExternalId": "polaris-prod" } + } + } + ] +} +``` + +If you do not require an external ID, omit the `Condition` block and the matching `externalId` +field in the storage config. + +## Catalog storage configuration + +Provide the role ARN and region when creating the catalog. `userArn` is the identity Polaris +itself uses (typically the role ARN of the server); `externalId` matches the trust policy above. + +```bash +curl -X POST https://<polaris-host>/management/v1/catalogs \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "type": "INTERNAL", + "name": "warehouse_s3", + "storageConfigInfo": { + "storageType": "S3", + "roleArn": "arn:aws:iam::123456789012:role/polaris-warehouse-access", + "userArn": "arn:aws:iam::123456789012:role/polaris-server", + "externalId": "polaris-prod", + "region": "us-east-1" + }, + "properties": { "default-base-location": "s3://warehouse-bucket/prod/" } + }' +``` + +The role ARN is validated against the pattern enforced by `AwsStorageConfigurationInfo`; an +ill-formed ARN is rejected at catalog creation time. + +## Server-side encryption with KMS + +When the bucket uses SSE-KMS, supply the key Polaris should use for writes and the full set of +keys it is allowed to read from: + +```json +"storageConfigInfo": { + "storageType": "S3", + "roleArn": "...", + "region": "us-east-1", + "currentKmsKey": "arn:aws:kms:us-east-1:123456789012:key/aaaa-bbbb", + "allowedKmsKeys": [ + "arn:aws:kms:us-east-1:123456789012:key/aaaa-bbbb", Review Comment: Does the current key need to be repeated here?.. TBH, I do not recall... Could you double check? ########## site/content/in-dev/unreleased/configuration/configuring-polaris-for-production/configuring-aws-s3-cloud-storage-specific.md: ########## @@ -0,0 +1,263 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: Configuring AWS S3 Cloud Storage +linkTitle: Configuring AWS S3 Cloud Storage +type: docs +weight: 610 +--- + +This page covers configuring AWS S3 as the storage backend for a Polaris catalog. All read and write +operations against S3 are performed using credential vending, in which Polaris assumes an IAM role +on behalf of the client and returns scoped, short-lived credentials. The IAM role, its trust policy, +and the bucket itself must be set up before the catalog is created. + +## IAM role and trust policy + +Polaris assumes a customer-managed IAM role via STS when a client requests credentials. The role +must: + +1. Grant the actions required for object access on the bucket and prefix that backs the catalog + (`s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`, `s3:ListBucket` and, if encryption is in use, + the relevant `kms:*` actions). +2. Trust the Polaris service principal — typically the IAM role that the Polaris server runs as. + Polaris fills the `sts:AssumeRole` request with the configured `userArn` and, when supplied, an + `externalId`. The trust policy must accept both. + +A minimal trust policy looks like: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { "AWS": "arn:aws:iam::123456789012:role/polaris-server" }, + "Action": "sts:AssumeRole", + "Condition": { + "StringEquals": { "sts:ExternalId": "polaris-prod" } + } + } + ] +} +``` + +If you do not require an external ID, omit the `Condition` block and the matching `externalId` +field in the storage config. + +## Catalog storage configuration + +Provide the role ARN and region when creating the catalog. `userArn` is the identity Polaris +itself uses (typically the role ARN of the server); `externalId` matches the trust policy above. Review Comment: I do not think `userArn` is actually used by the current code in communication with AWS S3 or STS APIs. ########## site/content/in-dev/unreleased/configuration/configuring-polaris-for-production/configuring-aws-s3-cloud-storage-specific.md: ########## @@ -0,0 +1,263 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: Configuring AWS S3 Cloud Storage +linkTitle: Configuring AWS S3 Cloud Storage +type: docs +weight: 610 +--- + +This page covers configuring AWS S3 as the storage backend for a Polaris catalog. All read and write +operations against S3 are performed using credential vending, in which Polaris assumes an IAM role +on behalf of the client and returns scoped, short-lived credentials. The IAM role, its trust policy, +and the bucket itself must be set up before the catalog is created. + +## IAM role and trust policy + +Polaris assumes a customer-managed IAM role via STS when a client requests credentials. The role +must: + +1. Grant the actions required for object access on the bucket and prefix that backs the catalog + (`s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`, `s3:ListBucket` and, if encryption is in use, + the relevant `kms:*` actions). +2. Trust the Polaris service principal — typically the IAM role that the Polaris server runs as. + Polaris fills the `sts:AssumeRole` request with the configured `userArn` and, when supplied, an + `externalId`. The trust policy must accept both. + +A minimal trust policy looks like: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { "AWS": "arn:aws:iam::123456789012:role/polaris-server" }, + "Action": "sts:AssumeRole", + "Condition": { + "StringEquals": { "sts:ExternalId": "polaris-prod" } + } + } + ] +} +``` + +If you do not require an external ID, omit the `Condition` block and the matching `externalId` +field in the storage config. + +## Catalog storage configuration + +Provide the role ARN and region when creating the catalog. `userArn` is the identity Polaris +itself uses (typically the role ARN of the server); `externalId` matches the trust policy above. + +```bash +curl -X POST https://<polaris-host>/management/v1/catalogs \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "type": "INTERNAL", + "name": "warehouse_s3", + "storageConfigInfo": { + "storageType": "S3", + "roleArn": "arn:aws:iam::123456789012:role/polaris-warehouse-access", + "userArn": "arn:aws:iam::123456789012:role/polaris-server", + "externalId": "polaris-prod", + "region": "us-east-1" + }, + "properties": { "default-base-location": "s3://warehouse-bucket/prod/" } + }' +``` + +The role ARN is validated against the pattern enforced by `AwsStorageConfigurationInfo`; an +ill-formed ARN is rejected at catalog creation time. + +## Server-side encryption with KMS + +When the bucket uses SSE-KMS, supply the key Polaris should use for writes and the full set of +keys it is allowed to read from: + +```json +"storageConfigInfo": { + "storageType": "S3", + "roleArn": "...", + "region": "us-east-1", + "currentKmsKey": "arn:aws:kms:us-east-1:123456789012:key/aaaa-bbbb", + "allowedKmsKeys": [ + "arn:aws:kms:us-east-1:123456789012:key/aaaa-bbbb", + "arn:aws:kms:us-east-1:123456789012:key/cccc-dddd" + ] +} +``` + +The IAM role's policy must include `kms:GenerateDataKey` and `kms:Decrypt` on every key listed in +`allowedKmsKeys`, and the key policy must grant the same to the role principal. + +If the deployment does not use KMS, set `kmsUnavailable` to `true` so Polaris will not request +KMS-related session permissions: + +```json +"kmsUnavailable": true +``` + +## S3-compatible endpoints + +Polaris can be pointed at S3-compatible object stores (MinIO, Ceph RGW, Apache Ozone S3 gateway). +The available fields are: + +- `endpoint` — the S3 API endpoint Polaris and its clients should call. +- `endpointInternal` — optional, used by the Polaris server when the in-cluster endpoint differs + from the one returned to clients. +- `pathStyleAccess` — set to `true` for backends that do not support virtual-host-style addressing. +- `stsEndpoint` — STS endpoint; defaults to `endpointInternal` then `endpoint` when not set. +- `stsUnavailable` — set to `true` when the backend does not implement STS. + +How clients receive credentials depends on whether the backend implements STS. Review Comment: This contradicts the top-most paragraph on this page - if STS is not available, Polaris will not vend credentials. ########## site/content/in-dev/unreleased/configuration/configuring-polaris-for-production/configuring-azure-blob-cloud-storage-specific.md: ########## @@ -0,0 +1,214 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: Configuring Azure Blob Cloud Storage +linkTitle: Configuring Azure Blob Cloud Storage +type: docs +weight: 620 +--- + +This page covers configuring Azure Blob Storage and Azure Data Lake Storage Gen2 (ADLS Gen2) as +the storage backend for a Polaris catalog. Polaris authenticates against Azure with the credentials +of a service principal that has data-plane access to the target storage account, and then vends +short-lived SAS tokens to clients on each table-load request. + +## Service principal and Polaris credentials + +Polaris uses the Azure SDK's `DefaultAzureCredential` chain, which by default reads the +service-principal credentials from environment variables. Create a service principal with data +access to the storage account and pass its credentials to the Polaris process: + +```bash +# Replace <subscription>, <resource-group>, <storage-account> with your values. +az ad sp create-for-rbac \ + --name polaris-storage \ + --role "Storage Blob Data Contributor" \ + --scopes "/subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>" +``` + +The command prints `appId`, `password`, and `tenant`. Set these on the Polaris server: + +```bash +export AZURE_TENANT_ID=<tenant> +export AZURE_CLIENT_ID=<appId> +export AZURE_CLIENT_SECRET=<password> +``` + +In a container deployment, set the same three variables on the Polaris container/pod. The +`Storage Blob Data Contributor` role at storage-account scope is what allows Polaris to issue SAS +tokens for any container under that account; you can scope the role narrower (single container) +when you need to confine a single Polaris catalog to one container. + +## Storage account requirements + +The storage account that backs the catalog should be configured with: + +- **Hierarchical namespace (HNS)** enabled — this turns the account into ADLS Gen2 and is required + for directory-aware operations (rename, recursive list) that Iceberg relies on for atomic + metadata commits. If HNS is disabled, set `hierarchical: false` so Polaris will request flat-blob Review Comment: HNS has nothing to do with commit atomicity in Polaris, AFAIK. ########## site/content/in-dev/unreleased/configuration/configuring-polaris-for-production/configuring-azure-blob-cloud-storage-specific.md: ########## @@ -0,0 +1,214 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: Configuring Azure Blob Cloud Storage +linkTitle: Configuring Azure Blob Cloud Storage +type: docs +weight: 620 +--- + +This page covers configuring Azure Blob Storage and Azure Data Lake Storage Gen2 (ADLS Gen2) as +the storage backend for a Polaris catalog. Polaris authenticates against Azure with the credentials +of a service principal that has data-plane access to the target storage account, and then vends +short-lived SAS tokens to clients on each table-load request. + +## Service principal and Polaris credentials + +Polaris uses the Azure SDK's `DefaultAzureCredential` chain, which by default reads the +service-principal credentials from environment variables. Create a service principal with data +access to the storage account and pass its credentials to the Polaris process: + +```bash +# Replace <subscription>, <resource-group>, <storage-account> with your values. +az ad sp create-for-rbac \ + --name polaris-storage \ + --role "Storage Blob Data Contributor" \ + --scopes "/subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>" +``` + +The command prints `appId`, `password`, and `tenant`. Set these on the Polaris server: + +```bash +export AZURE_TENANT_ID=<tenant> +export AZURE_CLIENT_ID=<appId> +export AZURE_CLIENT_SECRET=<password> +``` + +In a container deployment, set the same three variables on the Polaris container/pod. The +`Storage Blob Data Contributor` role at storage-account scope is what allows Polaris to issue SAS +tokens for any container under that account; you can scope the role narrower (single container) +when you need to confine a single Polaris catalog to one container. + +## Storage account requirements + +The storage account that backs the catalog should be configured with: + +- **Hierarchical namespace (HNS)** enabled — this turns the account into ADLS Gen2 and is required + for directory-aware operations (rename, recursive list) that Iceberg relies on for atomic Review Comment: not quite... I do not think HNS is required for technical Polaris/Iceberg operation. HNS is important for downscoping vended crdentials, though. Without HNS vended credentials can be scoped down only to the container level, with HNS to the directory (folder) level. ########## site/content/in-dev/unreleased/configuration/configuring-polaris-for-production/configuring-azure-blob-cloud-storage-specific.md: ########## @@ -0,0 +1,214 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: Configuring Azure Blob Cloud Storage +linkTitle: Configuring Azure Blob Cloud Storage +type: docs +weight: 620 +--- + +This page covers configuring Azure Blob Storage and Azure Data Lake Storage Gen2 (ADLS Gen2) as +the storage backend for a Polaris catalog. Polaris authenticates against Azure with the credentials +of a service principal that has data-plane access to the target storage account, and then vends +short-lived SAS tokens to clients on each table-load request. + +## Service principal and Polaris credentials + +Polaris uses the Azure SDK's `DefaultAzureCredential` chain, which by default reads the +service-principal credentials from environment variables. Create a service principal with data +access to the storage account and pass its credentials to the Polaris process: + +```bash +# Replace <subscription>, <resource-group>, <storage-account> with your values. +az ad sp create-for-rbac \ + --name polaris-storage \ + --role "Storage Blob Data Contributor" \ + --scopes "/subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>" +``` + +The command prints `appId`, `password`, and `tenant`. Set these on the Polaris server: + +```bash +export AZURE_TENANT_ID=<tenant> +export AZURE_CLIENT_ID=<appId> +export AZURE_CLIENT_SECRET=<password> +``` + +In a container deployment, set the same three variables on the Polaris container/pod. The +`Storage Blob Data Contributor` role at storage-account scope is what allows Polaris to issue SAS +tokens for any container under that account; you can scope the role narrower (single container) +when you need to confine a single Polaris catalog to one container. + +## Storage account requirements + +The storage account that backs the catalog should be configured with: + +- **Hierarchical namespace (HNS)** enabled — this turns the account into ADLS Gen2 and is required + for directory-aware operations (rename, recursive list) that Iceberg relies on for atomic + metadata commits. If HNS is disabled, set `hierarchical: false` so Polaris will request flat-blob + permissions only and avoid scoping SAS tokens to non-existent directory ACLs. +- A **container** that will hold the catalog's namespaces and tables (for example `warehouse`). + Polaris does not create the container itself. +- **Firewall** rules that permit traffic from the Polaris control plane and from the engines that + will read the data. SAS tokens do not bypass storage-account firewalls. + +## Catalog storage configuration + +With the service principal in place on the server, create the catalog with the storage account's +tenant ID and the `abfss://` location: + +```bash +curl -X POST https://<polaris-host>/management/v1/catalogs \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "catalog": { + "type": "INTERNAL", + "name": "warehouse_azure", + "properties": { + "default-base-location": "abfss://[email protected]/prod/" + }, + "storageConfigInfo": { + "storageType": "AZURE", + "tenantId": "00000000-0000-0000-0000-000000000000", + "hierarchical": true, + "allowedLocations": [ + "abfss://[email protected]/" + ] + } + } + }' +``` + +`default-base-location` must use the `abfss://` scheme together with the ADLS Gen2 endpoint +(`<account>.dfs.core.windows.net`). The `wasbs://` scheme is not supported. + +`AzureStorageConfigurationInfo` also accepts `multiTenantAppName` and `consentUrl`. These are Review Comment: `multiTenantAppName` and `consentUrl` are not actually used by current Apache Polaris code for communicating with Azure APIs, AFAIK. It might make sense for users to put them in the Catalog config for informational purposes, but I do not really see much value in documenting them at this time 🤔 ########## site/content/in-dev/unreleased/configuration/configuring-polaris-for-production/configuring-aws-s3-cloud-storage-specific.md: ########## @@ -0,0 +1,263 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: Configuring AWS S3 Cloud Storage +linkTitle: Configuring AWS S3 Cloud Storage +type: docs +weight: 610 +--- + +This page covers configuring AWS S3 as the storage backend for a Polaris catalog. All read and write +operations against S3 are performed using credential vending, in which Polaris assumes an IAM role +on behalf of the client and returns scoped, short-lived credentials. The IAM role, its trust policy, +and the bucket itself must be set up before the catalog is created. + +## IAM role and trust policy + +Polaris assumes a customer-managed IAM role via STS when a client requests credentials. The role +must: + +1. Grant the actions required for object access on the bucket and prefix that backs the catalog + (`s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`, `s3:ListBucket` and, if encryption is in use, + the relevant `kms:*` actions). +2. Trust the Polaris service principal — typically the IAM role that the Polaris server runs as. + Polaris fills the `sts:AssumeRole` request with the configured `userArn` and, when supplied, an + `externalId`. The trust policy must accept both. + +A minimal trust policy looks like: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { "AWS": "arn:aws:iam::123456789012:role/polaris-server" }, + "Action": "sts:AssumeRole", + "Condition": { + "StringEquals": { "sts:ExternalId": "polaris-prod" } + } + } + ] +} +``` + +If you do not require an external ID, omit the `Condition` block and the matching `externalId` +field in the storage config. + +## Catalog storage configuration + +Provide the role ARN and region when creating the catalog. `userArn` is the identity Polaris +itself uses (typically the role ARN of the server); `externalId` matches the trust policy above. + +```bash +curl -X POST https://<polaris-host>/management/v1/catalogs \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "type": "INTERNAL", + "name": "warehouse_s3", + "storageConfigInfo": { + "storageType": "S3", + "roleArn": "arn:aws:iam::123456789012:role/polaris-warehouse-access", + "userArn": "arn:aws:iam::123456789012:role/polaris-server", + "externalId": "polaris-prod", + "region": "us-east-1" + }, + "properties": { "default-base-location": "s3://warehouse-bucket/prod/" } + }' +``` + +The role ARN is validated against the pattern enforced by `AwsStorageConfigurationInfo`; an +ill-formed ARN is rejected at catalog creation time. + +## Server-side encryption with KMS + +When the bucket uses SSE-KMS, supply the key Polaris should use for writes and the full set of +keys it is allowed to read from: + +```json +"storageConfigInfo": { + "storageType": "S3", + "roleArn": "...", + "region": "us-east-1", + "currentKmsKey": "arn:aws:kms:us-east-1:123456789012:key/aaaa-bbbb", + "allowedKmsKeys": [ + "arn:aws:kms:us-east-1:123456789012:key/aaaa-bbbb", + "arn:aws:kms:us-east-1:123456789012:key/cccc-dddd" + ] +} +``` + +The IAM role's policy must include `kms:GenerateDataKey` and `kms:Decrypt` on every key listed in +`allowedKmsKeys`, and the key policy must grant the same to the role principal. + +If the deployment does not use KMS, set `kmsUnavailable` to `true` so Polaris will not request +KMS-related session permissions: + +```json +"kmsUnavailable": true +``` + +## S3-compatible endpoints + +Polaris can be pointed at S3-compatible object stores (MinIO, Ceph RGW, Apache Ozone S3 gateway). +The available fields are: + +- `endpoint` — the S3 API endpoint Polaris and its clients should call. +- `endpointInternal` — optional, used by the Polaris server when the in-cluster endpoint differs + from the one returned to clients. +- `pathStyleAccess` — set to `true` for backends that do not support virtual-host-style addressing. +- `stsEndpoint` — STS endpoint; defaults to `endpointInternal` then `endpoint` when not set. +- `stsUnavailable` — set to `true` when the backend does not implement STS. + +How clients receive credentials depends on whether the backend implements STS. + +### Backends with STS support (e.g. AWS S3, MinIO) + +Leave `stsUnavailable` unset (or `false`). Polaris will assume the role and vend short-lived, +subscoped credentials to the client at table-load time when the client sends +`X-Iceberg-Access-Delegation: vended-credentials`. This is the recommended deployment for AWS S3 +and any compatible backend that exposes the STS API. + +```json +"storageConfigInfo": { + "storageType": "S3", + "endpoint": "https://s3.internal.example.com", + "pathStyleAccess": true, + "region": "us-east-1" +} +``` + +### Backends without STS support (e.g. Apache Ozone S3 gateway, Ceph RGW without STS enabled) + +Set `stsUnavailable: true`. Polaris will then skip subscoped credential vending, and clients must +authenticate to the object store directly with long-lived credentials. Because the vended-credential +path is disabled, the client must omit the `X-Iceberg-Access-Delegation` header and supply its own +access key / secret to the underlying FileIO. The Polaris guides for [Apache Ozone][ozone-guide] +and [Ceph][ceph-guide] show this pattern. + +```json +"storageConfigInfo": { + "storageType": "S3", + "endpoint": "https://s3.internal.example.com", + "pathStyleAccess": true, + "stsUnavailable": true, + "region": "us-east-1" +} +``` + +[ozone-guide]: ../../../../guides/ozone/ +[ceph-guide]: ../../../../guides/ceph/ + +## Client configuration + +Engines connect through the Iceberg REST API and let Polaris vend credentials at table-load time; +they do not need static AWS credentials when STS is available. + +Spark example, matching the property names used by the existing MinIO / RustFS guides: + +```shell +bin/spark-sql \ + --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1 \ + --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ + --conf spark.sql.catalog.polaris=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.polaris.type=rest \ + --conf spark.sql.catalog.polaris.uri=https://<polaris-host>/api/catalog \ + --conf spark.sql.catalog.polaris.oauth2-server-uri=https://<polaris-host>/api/catalog/v1/oauth/tokens \ + --conf spark.sql.catalog.polaris.token-refresh-enabled=false \ + --conf spark.sql.catalog.polaris.warehouse=warehouse_s3 \ + --conf spark.sql.catalog.polaris.scope=PRINCIPAL_ROLE:ALL \ + --conf spark.sql.catalog.polaris.credential=<client-id>:<client-secret> \ Review Comment: This property implies Polaris native authentication (and locally managed principals). This is a valid use case, but it's not the only one. Since these docs make production recommendations, I believe we ought to explain external IdP options too. For now, let's just make it explicit (small paragraph) that this page is limited to native Polaris authentication and other objects are available (even if their docs are not ready yet). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
