dimas-b commented on code in PR #4451: URL: https://github.com/apache/polaris/pull/4451#discussion_r3276787082
########## site/content/in-dev/unreleased/configuration/configuring-polaris-for-production/configuring-aws-s3-cloud-storage-specific.md: ########## @@ -0,0 +1,266 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: Configuring S3 Storage +linkTitle: Configuring S3 Storage +type: docs +weight: 610 +--- + +This page covers configuring AWS S3, and S3-compatible object stores (MinIO, Apache Ozone S3 +gateway, Ceph RGW, and similar), as the storage backend for a Polaris catalog. On AWS S3, all read +and write operations are performed using credential vending: Polaris assumes a customer IAM role +via STS and returns scoped, short-lived credentials to the client. The IAM role, its trust policy, +and the bucket itself must be set up before the catalog is created. + +This page is limited to native Polaris authentication. External identity providers are also +supported but are not yet covered here; the configuration patterns below remain otherwise the same. + +## IAM role and trust policy + +Polaris assumes a customer-managed IAM role via STS when a client requests credentials. The role +must: + +1. Grant the actions required for object access on the bucket and prefix that backs the catalog + (`s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`, `s3:ListBucket` and, if encryption is in use, + the relevant `kms:*` actions). +2. Trust the Polaris service principal — typically the IAM role that the Polaris server runs as. + Polaris fills the `sts:AssumeRole` request with an `externalId` when one is configured. The + trust policy must accept the same external ID. + +Using `externalId` is recommended for cross-account or hosted Polaris deployments to mitigate the +confused-deputy problem. A minimal trust policy looks like: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { "AWS": "arn:aws:iam::123456789012:role/polaris-server" }, + "Action": "sts:AssumeRole", + "Condition": { + "StringEquals": { "sts:ExternalId": "polaris-prod" } + } + } + ] +} +``` + +## Catalog storage configuration + +Provide the role ARN, region, and `externalId` when creating the catalog. The token in the +`Authorization` header below is the Polaris admin bearer token obtained from +`/api/catalog/v1/oauth/tokens` (see [Configuring Polaris for Production]({{% relref "." %}}) for +how to bootstrap and issue admin tokens). + +```bash +curl -X POST https://<polaris-host>/management/v1/catalogs \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "catalog": { + "type": "INTERNAL", + "name": "warehouse_s3", + "properties": { "default-base-location": "s3://warehouse-bucket/prod/" }, + "storageConfigInfo": { + "storageType": "S3", + "roleArn": "arn:aws:iam::123456789012:role/polaris-warehouse-access", + "externalId": "polaris-prod", + "region": "us-east-1" + } + } + }' +``` + +The role ARN is validated against the pattern enforced by `AwsStorageConfigurationInfo`; an +ill-formed ARN is rejected at catalog creation time. + +## Server-side encryption with KMS + +When the bucket uses SSE-KMS, supply both `currentKmsKey` (the key Polaris should use for writes) +and `allowedKmsKeys` (every key the catalog is allowed to read from). The two fields are processed +independently in `AwsCredentialsStorageIntegration`, so the write key must be included in +`allowedKmsKeys` as well if you want it readable through vended credentials: + +```json +"storageConfigInfo": { + "storageType": "S3", + "roleArn": "...", + "region": "us-east-1", + "currentKmsKey": "arn:aws:kms:us-east-1:123456789012:key/aaaa-bbbb", + "allowedKmsKeys": [ + "arn:aws:kms:us-east-1:123456789012:key/aaaa-bbbb", + "arn:aws:kms:us-east-1:123456789012:key/cccc-dddd" + ] +} +``` + +The IAM role's policy must include `kms:GenerateDataKey` and `kms:Decrypt` on `currentKmsKey` and +`kms:Decrypt` on every key listed in `allowedKmsKeys`, and each key policy must grant the same to +the role principal. + +If the deployment does not use KMS, set `kmsUnavailable` to `true` so Polaris will not request +KMS-related session permissions: + +```json +"kmsUnavailable": true +``` + +## S3-compatible endpoints + +Polaris can be pointed at S3-compatible object stores (MinIO, Ceph RGW, Apache Ozone S3 gateway). +The available fields are: + +- `endpoint` — the S3 API endpoint Polaris and its clients should call. +- `endpointInternal` — optional, used by the Polaris server when the in-cluster endpoint differs + from the one returned to clients. +- `pathStyleAccess` — set to `true` for backends that do not support virtual-host-style addressing. +- `stsEndpoint` — STS endpoint; defaults to `endpointInternal` then `endpoint` when not set. +- `stsUnavailable` — set to `true` when the backend does not implement STS. + +The credential-vending guarantee at the top of this page assumes that the backend implements STS. +For AWS S3 and S3-compatible backends that expose the STS API (such as MinIO), leave +`stsUnavailable` unset (or `false`) and the vended-credentials flow described above works as is. + +```json +"storageConfigInfo": { + "storageType": "S3", + "endpoint": "https://s3.internal.example.com", + "pathStyleAccess": true, + "region": "us-east-1" +} +``` + +For S3-compatible backends without STS (Apache Ozone S3 gateway, or Ceph RGW without STS enabled), +set `stsUnavailable: true`. Polaris will then skip subscoped credential vending entirely, and the +client must omit `X-Iceberg-Access-Delegation: vended-credentials` and authenticate to the object +store directly. The Polaris guides for [Apache Ozone][ozone-guide] and [Ceph][ceph-guide] show +this pattern end-to-end. + +```json +"storageConfigInfo": { + "storageType": "S3", + "endpoint": "https://s3.internal.example.com", + "pathStyleAccess": true, + "stsUnavailable": true, + "region": "us-east-1" +} +``` + +[ozone-guide]: ../../../../guides/ozone/ +[ceph-guide]: ../../../../guides/ceph/ + +## Client configuration + +Engines connect through the Iceberg REST API and let Polaris vend credentials at table-load time; +they do not need static AWS credentials when STS is available. + +Spark example, matching the property names used by the existing MinIO / RustFS guides: + +```shell +bin/spark-sql \ + --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1 \ + --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ + --conf spark.sql.catalog.polaris=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.polaris.type=rest \ + --conf spark.sql.catalog.polaris.uri=https://<polaris-host>/api/catalog \ + --conf spark.sql.catalog.polaris.oauth2-server-uri=https://<polaris-host>/api/catalog/v1/oauth/tokens \ + --conf spark.sql.catalog.polaris.token-refresh-enabled=false \ + --conf spark.sql.catalog.polaris.warehouse=warehouse_s3 \ + --conf spark.sql.catalog.polaris.scope=PRINCIPAL_ROLE:ALL \ + --conf spark.sql.catalog.polaris.credential=<client-id>:<client-secret> \ + --conf spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation=vended-credentials +``` + +The `oauth2-server-uri` is recommended: without it the Iceberg REST client falls back to a +hard-coded `/v1/oauth/tokens` path and logs a deprecation warning, since the automatic fallback +is slated for removal in a future Iceberg release. + +For Trino, use the Iceberg connector with the REST catalog. The REST/OAuth2 properties talk to +Polaris; the native S3 filesystem properties consume the vended credentials. Polaris itself +returns `endpoint`, `path-style-access`, and `region` in the catalog config response, so the +client-side `s3.*` block below is only needed where Trino requires it to be explicit (for +example, on S3-compatible endpoints where the default AWS endpoint resolver does not apply). + +```properties +connector.name=iceberg +iceberg.catalog.type=rest +iceberg.rest-catalog.uri=https://<polaris-host>/api/catalog +iceberg.rest-catalog.warehouse=warehouse_s3 +iceberg.rest-catalog.security=OAUTH2 +iceberg.rest-catalog.oauth2.credential=<client-id>:<client-secret> +iceberg.rest-catalog.oauth2.scope=PRINCIPAL_ROLE:ALL +iceberg.rest-catalog.oauth2.server-uri=https://<polaris-host>/api/catalog/v1/oauth/tokens +iceberg.rest-catalog.vended-credentials-enabled=true +fs.native-s3.enabled=true +s3.region=us-east-1 +``` + +For S3-compatible endpoints, set the endpoint and path-style flag explicitly on the Trino side: + +```properties +s3.endpoint=https://s3.internal.example.com +s3.path-style-access=true +``` + +For PyIceberg, use the `rest` catalog type. The same Polaris-side properties (`uri`, `warehouse`, +`credential`, `scope`, `oauth2-server-uri`) apply, and the vended-credential header must be +forwarded as a REST header: + +```python +from pyiceberg.catalog.rest import RestCatalog + +cat = RestCatalog( + name="polaris", + **{ + "uri": "https://<polaris-host>/api/catalog", + "warehouse": "warehouse_s3", + "credential": "<client-id>:<client-secret>", + "scope": "PRINCIPAL_ROLE:ALL", + "oauth2-server-uri": "https://<polaris-host>/api/catalog/v1/oauth/tokens", + "header.X-Iceberg-Access-Delegation": "vended-credentials", + }, +) +``` + +Polaris returns the vended S3 properties (`s3.access-key-id`, `s3.secret-access-key`, +`s3.session-token`) to the client at table-load time, so static credentials should not be +configured on the PyIceberg side. + +## Verifying the setup + +A successful end-to-end test should be possible without giving the client any long-lived AWS +credentials: + +```sql +CREATE NAMESPACE warehouse_s3.demo; +CREATE TABLE warehouse_s3.demo.t (id BIGINT, name STRING) USING iceberg; +INSERT INTO warehouse_s3.demo.t VALUES (1, 'hello'); +SELECT * FROM warehouse_s3.demo.t; +``` + +If `INSERT` or `SELECT` fails with a 403, the most common causes are: + +- The IAM role's trust policy does not match the `userArn` / `externalId` Polaris is presenting. Review Comment: Did you mean `roleArn` here? ########## site/content/in-dev/unreleased/configuration/configuring-polaris-for-production/configuring-aws-s3-cloud-storage-specific.md: ########## @@ -0,0 +1,263 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: Configuring AWS S3 Cloud Storage +linkTitle: Configuring AWS S3 Cloud Storage +type: docs +weight: 610 +--- + +This page covers configuring AWS S3 as the storage backend for a Polaris catalog. All read and write +operations against S3 are performed using credential vending, in which Polaris assumes an IAM role +on behalf of the client and returns scoped, short-lived credentials. The IAM role, its trust policy, +and the bucket itself must be set up before the catalog is created. + +## IAM role and trust policy + +Polaris assumes a customer-managed IAM role via STS when a client requests credentials. The role +must: + +1. Grant the actions required for object access on the bucket and prefix that backs the catalog + (`s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`, `s3:ListBucket` and, if encryption is in use, + the relevant `kms:*` actions). +2. Trust the Polaris service principal — typically the IAM role that the Polaris server runs as. + Polaris fills the `sts:AssumeRole` request with the configured `userArn` and, when supplied, an + `externalId`. The trust policy must accept both. + +A minimal trust policy looks like: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { "AWS": "arn:aws:iam::123456789012:role/polaris-server" }, + "Action": "sts:AssumeRole", + "Condition": { + "StringEquals": { "sts:ExternalId": "polaris-prod" } + } + } + ] +} +``` + +If you do not require an external ID, omit the `Condition` block and the matching `externalId` +field in the storage config. + +## Catalog storage configuration + +Provide the role ARN and region when creating the catalog. `userArn` is the identity Polaris +itself uses (typically the role ARN of the server); `externalId` matches the trust policy above. + +```bash +curl -X POST https://<polaris-host>/management/v1/catalogs \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "type": "INTERNAL", + "name": "warehouse_s3", + "storageConfigInfo": { + "storageType": "S3", + "roleArn": "arn:aws:iam::123456789012:role/polaris-warehouse-access", + "userArn": "arn:aws:iam::123456789012:role/polaris-server", + "externalId": "polaris-prod", + "region": "us-east-1" + }, + "properties": { "default-base-location": "s3://warehouse-bucket/prod/" } + }' +``` + +The role ARN is validated against the pattern enforced by `AwsStorageConfigurationInfo`; an +ill-formed ARN is rejected at catalog creation time. + +## Server-side encryption with KMS + +When the bucket uses SSE-KMS, supply the key Polaris should use for writes and the full set of +keys it is allowed to read from: + +```json +"storageConfigInfo": { + "storageType": "S3", + "roleArn": "...", + "region": "us-east-1", + "currentKmsKey": "arn:aws:kms:us-east-1:123456789012:key/aaaa-bbbb", + "allowedKmsKeys": [ + "arn:aws:kms:us-east-1:123456789012:key/aaaa-bbbb", + "arn:aws:kms:us-east-1:123456789012:key/cccc-dddd" + ] +} +``` + +The IAM role's policy must include `kms:GenerateDataKey` and `kms:Decrypt` on every key listed in +`allowedKmsKeys`, and the key policy must grant the same to the role principal. + +If the deployment does not use KMS, set `kmsUnavailable` to `true` so Polaris will not request +KMS-related session permissions: + +```json +"kmsUnavailable": true +``` + +## S3-compatible endpoints + +Polaris can be pointed at S3-compatible object stores (MinIO, Ceph RGW, Apache Ozone S3 gateway). +The available fields are: + +- `endpoint` — the S3 API endpoint Polaris and its clients should call. +- `endpointInternal` — optional, used by the Polaris server when the in-cluster endpoint differs + from the one returned to clients. +- `pathStyleAccess` — set to `true` for backends that do not support virtual-host-style addressing. +- `stsEndpoint` — STS endpoint; defaults to `endpointInternal` then `endpoint` when not set. +- `stsUnavailable` — set to `true` when the backend does not implement STS. + +How clients receive credentials depends on whether the backend implements STS. + +### Backends with STS support (e.g. AWS S3, MinIO) + +Leave `stsUnavailable` unset (or `false`). Polaris will assume the role and vend short-lived, +subscoped credentials to the client at table-load time when the client sends +`X-Iceberg-Access-Delegation: vended-credentials`. This is the recommended deployment for AWS S3 +and any compatible backend that exposes the STS API. + +```json +"storageConfigInfo": { + "storageType": "S3", + "endpoint": "https://s3.internal.example.com", + "pathStyleAccess": true, + "region": "us-east-1" +} +``` + +### Backends without STS support (e.g. Apache Ozone S3 gateway, Ceph RGW without STS enabled) + +Set `stsUnavailable: true`. Polaris will then skip subscoped credential vending, and clients must +authenticate to the object store directly with long-lived credentials. Because the vended-credential +path is disabled, the client must omit the `X-Iceberg-Access-Delegation` header and supply its own +access key / secret to the underlying FileIO. The Polaris guides for [Apache Ozone][ozone-guide] +and [Ceph][ceph-guide] show this pattern. + +```json +"storageConfigInfo": { + "storageType": "S3", + "endpoint": "https://s3.internal.example.com", + "pathStyleAccess": true, + "stsUnavailable": true, + "region": "us-east-1" +} +``` + +[ozone-guide]: ../../../../guides/ozone/ +[ceph-guide]: ../../../../guides/ceph/ + +## Client configuration + +Engines connect through the Iceberg REST API and let Polaris vend credentials at table-load time; +they do not need static AWS credentials when STS is available. + +Spark example, matching the property names used by the existing MinIO / RustFS guides: + +```shell +bin/spark-sql \ + --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.10.1 \ + --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ + --conf spark.sql.catalog.polaris=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.polaris.type=rest \ + --conf spark.sql.catalog.polaris.uri=https://<polaris-host>/api/catalog \ + --conf spark.sql.catalog.polaris.oauth2-server-uri=https://<polaris-host>/api/catalog/v1/oauth/tokens \ + --conf spark.sql.catalog.polaris.token-refresh-enabled=false \ + --conf spark.sql.catalog.polaris.warehouse=warehouse_s3 \ + --conf spark.sql.catalog.polaris.scope=PRINCIPAL_ROLE:ALL \ + --conf spark.sql.catalog.polaris.credential=<client-id>:<client-secret> \ + --conf spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation=vended-credentials +``` + +The `oauth2-server-uri` is recommended: without it the Iceberg REST client falls back to a +hard-coded `/v1/oauth/tokens` path and logs a deprecation warning, since the automatic fallback +is slated for removal in a future Iceberg release. + +For Trino, use the Iceberg connector with the REST catalog. Two groups of properties are +required: the REST/OAuth2 settings for talking to Polaris, and the native S3 filesystem settings +that Trino uses to read the vended credentials. + +```properties +connector.name=iceberg +iceberg.catalog.type=rest +iceberg.rest-catalog.uri=https://<polaris-host>/api/catalog +iceberg.rest-catalog.warehouse=warehouse_s3 +iceberg.rest-catalog.security=OAUTH2 +iceberg.rest-catalog.oauth2.credential=<client-id>:<client-secret> +iceberg.rest-catalog.oauth2.scope=PRINCIPAL_ROLE:ALL +iceberg.rest-catalog.oauth2.server-uri=https://<polaris-host>/api/catalog/v1/oauth/tokens +iceberg.rest-catalog.vended-credentials-enabled=true +fs.native-s3.enabled=true +s3.region=us-east-1 +``` + +When pointing at an S3-compatible endpoint, also set: + +```properties +s3.endpoint=https://s3.internal.example.com +s3.path-style-access=true Review Comment: @mj006648 : Do you think this is still necessary? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
