obelix74 opened a new issue, #4706:
URL: https://github.com/apache/polaris/issues/4706

   ### Is your feature request related to a problem? Please describe.
   
   # Add GCS principal attribution to vended credentials (GCP counterpart of 
AWS STS session tags)
   
   ## Summary
   
   Polaris can correlate vended-credential data access back to the catalog 
operation that issued the credentials **on AWS** — via 
`SESSION_TAGS_IN_SUBSCOPED_CREDENTIAL`, which stamps `polaris:principal`, 
`polaris:realm`, `polaris:catalog`, etc. as AWS STS session tags that then 
appear in CloudTrail S3 data events. **There is no equivalent on GCP.** GCS 
Data Access audit logs cannot today be tied to the Polaris principal that 
requested the credential, which breaks audit correlation, 
chargeback/attribution, and incident response for GCS-backed catalogs.
   
   This is called out directly in the current code — 
`GcpStorageCredentialCacheKey` says:
   
   > *"GCP downscoped credentials do not support session tags, so principal and 
credential
   > vending context are never included."*
   
   That is accurate: GCP downscoped tokens (Credential Access Boundary) have no 
tag mechanism, and `x-goog-custom-audit-*` request headers only reach the audit 
log if the **client** chooses to send them — which arbitrary Iceberg clients 
(any version, PyIceberg, Trino, raw SDKs) do not. So the metadata cannot ride 
the token or the request.
   
   This issue proposes the one channel that survives an uncontrolled client: 
**the identity of the vended credential itself**, using Workload Identity 
Federation (WIF).
   
   
   
   
   ### Describe the solution you'd like
   
   ## Proposed solution
   
   When configured, insert a federated step ahead of the existing tenant 
service-account impersonation in `GcpCredentialsStorageIntegration`:
   
   ```
   catalog-signed JWT (sub = <realm>/<principal>, realm claim)
     └─> GCP STS token exchange (IdentityPoolCredentials, programmatic 
subject-token supplier)
           └─> impersonate the configured per-catalog service account (existing 
path)
                 └─> downscope via Credential Access Boundary (existing path, 
unchanged)
   ```
   
   The catalog mints a short-lived RS256 JWT whose subject is 
`<realm>/<principal>` and exchanges it at `sts.googleapis.com` against a 
Workload Identity Pool provider. The resulting federated credential is used as 
the impersonation source. Every GCS Data Access audit log entry produced with 
the vended token then carries the Polaris principal in 
`protoPayload.authenticationInfo.serviceAccountDelegationInfo[].principalSubject`,
 **for any client, with no client cooperation**. Combined with `resourceName` 
(which already encodes catalog/namespace/table via Iceberg paths) and the 
per-catalog service account (`principalEmail` = realm), this recovers the same 
correlation fields AWS gets from session tags.
   
   A useful side effect: per-realm `attribute.realm` IAM bindings on the pool 
make tenant isolation IAM-enforced — a federated identity for realm A cannot 
impersonate realm B's service account.
   
   ### Why reuse the existing context
   
   `CredentialVendingContext` is already provider-neutral and already carries 
`principalName`, `realm`, etc. (it is populated for every cloud by 
`StorageAccessConfigProvider`). No new plumbing into the catalog core is 
needed; only the GCP integration and its cache key need to start consuming 
`principalName`.
   
   ### Configuration (new `FeatureConfiguration` flags)
   
   | Flag | Purpose |
   |------|---------|
   | `GCS_PRINCIPAL_ATTRIBUTION_WIF_AUDIENCE` | Workload Identity Pool provider 
resource name (STS audience + JWT `aud`) |
   | `GCS_PRINCIPAL_ATTRIBUTION_TOKEN_ISSUER` | `iss` of the minted JWT; must 
match the provider's configured issuer |
   | `GCS_PRINCIPAL_ATTRIBUTION_SIGNING_KEY_FILE` | PKCS#8 PEM private key used 
to sign the JWT (public key published in the provider's JWKS) |
   | `GCS_PRINCIPAL_ATTRIBUTION_SIGNING_KEY_ID` | `kid` header so the provider 
selects the right JWKS key during rotation |
   
   There is intentionally **no on/off boolean**: attribution activates 
automatically once the audience, issuer, and signing-key-file are all set 
(audit attribution should always be on wherever the infrastructure exists), and 
is silently inactive otherwise — the normal state on AWS clusters and on GCP 
clusters not yet provisioned for it. It additionally requires a 
`gcpServiceAccount` on the storage config (the SA to impersonate); a partial 
configuration logs a warning and falls back to the existing, non-attributed 
path.
   
   ### Caching correctness
   
   `GcpStorageCredentialCacheKey` currently excludes the principal (correct 
today, since the vended token is principal-independent). With attribution 
enabled the token is derived from a per-principal federated identity, so the 
cache **must** be keyed on the principal — otherwise principal A's attributed 
token could be served to principal B. The change adds `principalName` to the 
GCP cache key, populated **only when attribution is configured** (preserving 
today's cache efficiency when it is off). This mirrors how the AWS key already 
includes the principal once session tags are enabled.
   
   ## Scope
   
   - `FeatureConfiguration`: four new flags above.
   - `storage/gcp/`: `GcpAttributionSubjectBuilder` (builds 
`<realm>/<principal>` within GCP's  127-char `google.subject` limit) and 
`GcpFederatedCredentialsExchanger` (JWT mint via  `com.auth0:java-jwt`, already 
in the version catalog; STS exchange via google-auth  `IdentityPoolCredentials` 
programmatic supplier — no new HTTP machinery).
   - `GcpStorageCredentialCacheKey`: add `principalName` data field.
   - `GcpCredentialsStorageIntegration`: thread principal into the key when 
configured; perform  the federated exchange in `compute()` before impersonation.
   - Tests + a config-doc entry.
   
   Out of scope: AWS/Azure (unchanged); the `gcs.headers.x-goog-custom-audit-*` 
approach (rejected — requires client cooperation that cannot be assumed across 
arbitrary Iceberg clients).
   
   ### Describe alternatives you've considered
   
   ## Alternatives considered
   
   1. **`x-goog-custom-audit-*` request headers** vended as `gcs.headers.*` 
properties — only works if the client forwards them; stock Iceberg `GCSFileIO` 
(≤ 1.11) forwards only `gcs.user-agent`. Fails the "any client" requirement, so 
it cannot be the attribution  guarantee.
   2. **Per-principal service accounts** — would blow past GCP's SA quota and 
destroy credential  cache reuse. WIF federated subjects are the scalable form 
of the same idea.
   
   
   
   ### Additional context
   
   ## Operator prerequisites (documented, not code)
   
   A Workload Identity Pool + OIDC provider (uploaded JWKS, no public issuer 
endpoint required); a signing key whose public half is in the JWKS with a 
stable `kid`; per-realm `attribute.realm` → 
`roles/iam.serviceAccountTokenCreator` bindings on the tenant SAs; GCS Data 
Access audit logs enabled; mesh egress to `sts.googleapis.com`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to