Hi folks, I'd like to initiate a discussion on the expected behavior when updating catalog properties, specifically around the handling of storage configuration fields that are automatically provided by the Polaris service.
*Background* When a catalog is created, certain storage configuration properties are provided by Polaris itself, polaris users don't need to provide these properties. Depending on the cloud provider: - S3 - *externalId*: Generated by Polaris if not provided. This is immutable. - *userARN*: Represents the Polaris service identity, provided by Polaris. - Azure - *consentUrl*: URL used to authorize Polaris to access the user’s storage account, generated by Polaris. - *multiTenantAppName*: Name of the Polaris client app that must be granted permissions to access the specified storage. - GCP - *gcsServiceAccount*: Represents the Polaris service account. These values are not required during catalog creation, Polaris sets and stores them automatically. Users can retrieve them via a GET request post-creation. *Workflow:* Here is the guidance from Open Catalog for creating a catalog: https://other-docs.snowflake.com/en/opencatalog/create-catalog To illustrate, consider the scenario of loading an Iceberg table from S3. 1. Before spinning up Polaris, a long-lived AWS user credential needs to be configured for Polaris (via Environment variable or via some properties). 2. Polaris users create a catalog with S3 storage configurations to provide the IAM role 3. Polaris users send a getCatalog request to get the service-provided properties (e.g. IAM user arn). 4. Polaris users add the IAM user arn (which represents polaris) to the trust relationship of their IAM role so that polaris can assume user-provided IAM role 5. When Polaris accesses S3, it creates an S3FileIO, which internally uses an S3 client to send requests to S3.This S3 client leverages sub-scoped storage credentials to read Iceberg table metadata. These credentials are derived by assuming a customer-provided IAM role. *Polaris, acting as an IAM user, uses long-lived AWS credentials *(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) *to assume this role with a restricted IAM policy and requests temporary session credentials* (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN) for use during this session. *Problem:* Previously, when users submitted an *UpdateCatalogRequest*, the provided storage configuration would completely replace the existing configuration, including the service-provided fields. If customers forgot to manually include the service-provided properties in the new storage configurations, this unintentionally resulted in the loss of those critical properties. *Fix* A recent PR addresses this by ensuring that service-provided fields are inherited during catalog updates. This prevents accidental loss of these values and keeps the catalog entity intact. https://github.com/apache/polaris/pull/1191 *Open Questions for Discussion:* *1. Do users need to provide these properties? * For Open Catalog, users do not need to provide service-generated properties like userArn, externalId, etc., and Open Catalog will provide them automatically. However, this leads to a gap in OSS Polaris, where there’s no existing mechanism to configure these properties. *2. Where should these properties live? Should we store these properties in the Catalog Entity?* *Or do we just inject these info when generating the loadCatalog response? * Right now, these properties will be persisted in the metastore. *3. Should we support both catalog-level and service-level userArn?* >From a cost and complexity perspective, supporting catalog-level userArn would require creating a dedicated AWS user credential per catalog, which is very expensive and likely unnecessary. It’s better to rely on the externalId to scope permissions at the catalog level. Users can then configure their IAM role policies to allow access only for specific Polaris-generated externalIds, offering sufficient granularity without credential sprawl. *4. Where and how does Polaris use these properties?* Taking userArn as an example: Polaris does not use this property directly in the service logic. Instead, it uses the associated AWS user credentials to assume the customer’s IAM role. The userArn exists mainly for the customer’s awareness, they need to know the ARN to update their trust relationship of their IAM role accordingly. Sorry for the long post, appreciate you making it through! Please feel free to share your thoughts, suggestions, or any alternative ideas. Happy to refine our direction based on what makes the most sense. Best Rulin