[Discuss] Correct Handling of Service-Provided Storage Config Properties During Catalog Updates

Rulin Xing Thu, 10 Apr 2025 15:58:05 -0700

Hi folks,

I'd like to initiate a discussion on the expected behavior when updating
catalog properties, specifically around the handling of storage
configuration fields that are automatically provided by the Polaris service.


*Background*
When a catalog is created, certain storage configuration properties are
provided by Polaris itself, polaris users don't need to provide these
properties. Depending on the cloud provider:

   - S3
      - *externalId*: Generated by Polaris if not provided. This is
      immutable.
      - *userARN*: Represents the Polaris service identity, provided by
      Polaris.
   - Azure
      - *consentUrl*: URL used to authorize Polaris to access the user’s
      storage account, generated by Polaris.
      - *multiTenantAppName*: Name of the Polaris client app that must be
      granted permissions to access the specified storage.
   - GCP
      - *gcsServiceAccount*: Represents the Polaris service account.


These values are not required during catalog creation, Polaris sets and
stores them automatically. Users can retrieve them via a GET request
post-creation.

*Workflow:*
Here is the guidance from Open Catalog for creating a catalog:
https://other-docs.snowflake.com/en/opencatalog/create-catalog

To illustrate, consider the scenario of loading an Iceberg table from S3.

1. Before spinning up Polaris, a long-lived AWS user credential needs to be
configured for Polaris (via Environment variable or via some properties).
2. Polaris users create a catalog with S3 storage configurations to provide
the IAM role
3. Polaris users send a getCatalog request to get the service-provided
properties (e.g. IAM user arn).
4. Polaris users add the IAM user arn (which represents polaris) to the
trust relationship of their IAM role so that polaris can assume
user-provided IAM role
5. When Polaris accesses S3, it creates an S3FileIO, which internally uses
an S3 client to send requests to S3.This S3 client leverages sub-scoped
storage credentials to read Iceberg table metadata. These credentials are
derived by assuming a customer-provided IAM role. *Polaris, acting as an
IAM user, uses long-lived AWS credentials *(AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY) *to assume this role with a restricted IAM policy
and requests temporary session credentials* (AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN) for use during this session.

*Problem:*
Previously, when users submitted an *UpdateCatalogRequest*, the provided
storage configuration would completely replace the existing configuration,
including the service-provided fields. If customers forgot to manually
include the service-provided properties in the new storage configurations,
this unintentionally resulted in the loss of those critical properties.

*Fix*
A recent PR addresses this by ensuring that service-provided fields are
inherited during catalog updates. This prevents accidental loss of these
values and keeps the catalog entity intact.
https://github.com/apache/polaris/pull/1191

*Open Questions for Discussion:*
*1. Do users need to provide these properties? *

For Open Catalog, users do not need to provide service-generated properties
like userArn, externalId, etc., and Open Catalog will provide them
automatically. However, this leads to a gap in OSS Polaris, where there’s
no existing mechanism to configure these properties.

*2. Where should these properties live? Should we store these properties in
the Catalog Entity?* *Or do we just inject these info when generating the
loadCatalog response? *

Right now, these properties will be persisted in the metastore.

*3. Should we support both catalog-level and service-level userArn?*

>From a cost and complexity perspective, supporting catalog-level userArn
would require creating a dedicated AWS user credential per catalog, which
is very expensive and likely unnecessary.

It’s better to rely on the externalId to scope permissions at the catalog
level. Users can then configure their IAM role policies to allow access
only for specific Polaris-generated externalIds, offering sufficient
granularity without credential sprawl.

*4. Where and how does Polaris use these properties?*

Taking userArn as an example: Polaris does not use this property directly
in the service logic. Instead, it uses the associated AWS user credentials
to assume the customer’s IAM role. The userArn exists mainly for the
customer’s awareness, they need to know the ARN to update their trust
relationship of their IAM role accordingly.

Sorry for the long post, appreciate you making it through! Please feel free
to share your thoughts, suggestions, or any alternative ideas. Happy to
refine our direction based on what makes the most sense.

Best
Rulin

[Discuss] Correct Handling of Service-Provided Storage Config Properties During Catalog Updates

Reply via email to