Re: Heads-up: role-ARN option for S3 becoming optional

Dennis Huo Fri, 22 Aug 2025 19:41:51 -0700

Generally, not *all* "required -> optional" demotions are necessarily bad,
but this one is problematic for a few reasons detailed below. The TL;DR:
Polaris has two personas of users - "Polaris service owners" and "Catalog
users", and the "implied behavior" of roleArn == null requires multi-party
consent from both personas.

1. Changing syntax invariants - From a pure syntax standpoint, you're right
that this is possibly "minor" in that it moves some proactive validation
into late failure. Service administrators who carefully watch the release
notes can make adjustments to backend consistency expectations and internal
playbooks as needed. However, this is one category of things that we really
should provide better support for Polaris service-runners (including
out-of-the-box single-tenant Polaris deployments -- this problem isn't just
for "complicated" service deployments). The problem is that if a service
owner is running in a mode where they do *not* consent to allowing non-AWS
S3 storage from their internal Catalog Users, they are still being exposed
to this change in validation inariants. New NullPointerExceptions will
start popping up in places they didn't used to, and when the service owner
then goes to reason about what the behavior of the null roleArn entails,
there's no documentation on the subtleties of assumeRole vs
getSessionToken, whether service-level secrets are exposed, etc.

A simple boolean service-level feature configuration flag (not a
catalog-overridable configuration flag)
ALLOW_NULL_ROLE_ARN_IN_AWS_STORAGE_CONFIG would solve this problem. Service
runners who don't want to fork code or even change any code can then easily
preserve previous S3 behavior by setting that config, whether we default it
"true" or "false".

2. Affirmative intent of getSessionToken vs assumeRole - The "implicit"
difference in semantics of the STS assumeRole call is something we likely
need to change before 1.1 release for the overall
STS-endpoint-customizability anyways. The allowance of null `roleArn` will
need to be highly situational, precisely because we need to convey *intent*
sufficiently to make sure backend behavior matches intent, or else
fails-fast.

For example, we could certainly also support self-managed Polaris
deployments being allowed to use `getSessionToken` instead of `assumeRole`
during downscoping where no intermediate IAM Role is involved. Here, AWS
itself does *not* support simply "setting roleArn to 'null' in assumeRole".
AWS's assumeRole makes "roleArn" *required*. Instead, in the storage config
we would want to express intent:
"polaris.config.storage.use.direct.service.identity.downscoping=true", for
example. Then *situationally* we'd say:

    if (config.get(USE_DIRECT_SERVICE_IDENTITY_DOWNSCOPING)) {
        // Allow roleArn == null
      validateConfigForDirectDownscoping(storageConfig);
    } else {
        // roleArn must not be null
      validateConfigForAssumeRole(storageConfig);
    }

Prematurely letting the null roleArn determine the behavior will cause a
big cleanup mess in the future in case any users already created
partially-broken catalogs. How would a service-owner then know whether the
user just forgot to copy/paste a roleArn into their config, or whether they
were trying to use DIRECT_SERVICE_IDENTITY_DOWNSCOPING?

The direct-downscoping concept itself is well understood and could easily
be applied first-class to other providers -- like for GCP, the CAB token
doesn't necessarily require any "identity-transformation" step as long as
the right privileges are available to mint a CAB token for "self". In this
MinIO case, even if we happen to use a variation of "assumeRole(role ==
null)", it should occur under the new "direct-downscoping" codepath in
Polaris, so that any other "direct-downscoping" functionality we need
applies to that codepath equally, rather than letting it "look" identical
to a "true assumeRole" other than roleArn == null

3. Affirmative intent of "real AWS" vs "S3 compatible storage" - Even
though S3-compatible providers are intended to provide easy drop-in
replacements, in a production service environment these still have lots of
different peripheral requirements, decorators, regulatory requirements,
etc. For example, all network traffic to AWS might need to be funneled
through a peering connection or PrivateLink. Open-ended network traffic
might need to be directed to a different auditing proxy.

Importantly, some of these requirements are targeted at the *Polaris
service owner* and not the *catalog user*. The choice to use non-AWS
S3-compatible storage is a great feature, but fundamentally requires
multi-party consent from both the Polaris service owner and the Catalog
user.

Knowing that for AWS specifically, assumeRole will never allow null
roleArn, this again means we likely need API syntax that properly reflects
that situational validation, and to preserve the entirety of AWS-specific
validation when we know we're using real AWS.

On Fri, Aug 22, 2025 at 7:32 AM Dmitri Bourlatchkov <[email protected]>
wrote:

> Hi Dennis,
>
> Thanks for stating the concerns (A,B,C).
>
> I'm planning to work in that area for [2207]. I propose to have an in-depth
> review of that code under that PR (still WIP on my part).
>
> However, I'm kind of lost about the relationship of that with making
> roleArn optional (which is the main topic of this thread).
>
> Is roleArn being optional detrimental?
>
> From my POV, it enables nicer integration with MinIO use cases in the
> current codebase (not setting roleArn) at the same time AWS use cases are
> not affected.
>
> The only remote problem might be that users of AWS S3 may miss to set
> roleArn in the config. However, that will be caught in runtime (failures to
> Assume Role).
>
> WDYT?
>
> [2207] https://github.com/apache/polaris/issues/2207
>
> Thanks,
> Dmitri.
>
> On Fri, Aug 22, 2025 at 1:38 AM Dennis Huo <[email protected]> wrote:
>
> > Yeah excellent point, and that definitely highlights the need for a more
> > comprehensive design for non-AWS S3-compat storage.
> >
> > Using the removal of roleArn as an "incidental" fix for a fuzzy subset of
> > scenarios is probably not how we want to get entrenched for the first
> > introduction of those features, especially when we didn't even make it
> > clear in the github issue or the committed code how we expect optional
> > roleArn to interact with session-token exchange.
> >
> > IMO the ability to "assumeRole(null /* roleArn */, sessionPolicy)" should
> > itself be treated as idiosyncratic to specific storage providers and
> paired
> > with some explicit expression of intent both for Polaris internally as
> well
> > as for the user.
> >
> > From what I can tell, "null assumeRole" in MinIO is more analogous to
> > "getSessionPolicy" from AWS, though I'm not too familiar with MinIO so we
> > should invite some expert opinions on this.
> >
> > Right now there are several different concerns rolled up into the single
> > "getSubscopedCredential" in Polaris:
> >
> > A. Indirection between root "service identities" (owned by the Polaris
> > service owner) and per-Catalog storage-actor identities (owned by the
> > Catalog administrative user)
> >     -This indirection *in itself* is an important element of the Polaris
> > security model, where service identities do *not* generally have latent
> > direct storage-access permissions, but instead hold "actAs" or
> "assumeRole"
> > types of permissions
> > B. Applying a "subscoping policy" that restricts the blast radius of any
> > storage credentials that may be used, both in terms of "path prefix" and
> in
> > "duration"
> >     -It's intentional to make Polaris "internal" FileIO go through the
> same
> > subscoping flow as much as possible, so that even when it's Polaris
> > writing/reading metadata files, the blast radius matches what would be
> > vended out to a sufficiently privileged principal
> > C. Applying "configuration overrides" related to endpoints, region, etc.
> > These crept into getSubscopedCredentials due to being "convenient", but
> are
> > substantially a different action than credential-minting, though are
> > closely related because of needing to determine STS endpoints from the
> > config
> >
> > I guess we probably want to refactor so that (C) will *always* happen
> > correctly, so we'd need to split out some kind of "getDynamicConfig" that
> > is separate from injecting the *credentials* into the config map.
> >
> > It sounds like we have potential use cases for any mix of (A) and (B).
> >
> > - Single-tenant use cases may not need "indirection" but may still want
> > subscoping both for internal blast-radius management and for
> > credential-vending
> > - Other single-tenant use cases might be okay with neither
> > identity-indirection nor subscoping
> > - I think we've had some discussion about whether to ever allow
> > credential-vending without subscoping (i.e. vending long-lived
> credentials)
> >
> > On Thu, Aug 21, 2025 at 3:53 AM Alexandre Dutra <[email protected]>
> wrote:
> >
> > > Hi,
> > >
> > > We just had an issue created by a user that was attempting to do use
> > > case #2 in Dennis' categorization ("Using DefaultCredentialsProvider
> > > directly without subscoping to access non-AWS s3-compat storage"):
> > >
> > > https://github.com/apache/polaris/issues/2398
> > >
> > > This uncovered some interesting findings (at least for me), which
> > > leads me to think that setting
> > > SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION=true is actually not enough,
> > > and even not recommended in that case. When credentials subscoping is
> > > disabled, the table config returned to the client not only omits S3
> > > credentials, which is expected, but also omits some otherwise very
> > > important S3 settings, such as: s3.endpoint, s3.path-style-access or
> > > client.region, *even if these were properly configured at the catalog
> > > level*. As a result, the client is unable to access the MinIO storage
> > > properly.
> > >
> > > For me, use case #2 is just not achievable right now in Polaris.
> > > Enabling credentials subscoping solves the issue of course, but also
> > > creates a somewhat artificial link between credentials vending and
> > > "generic" storage configuration.
> > >
> > > Thanks,
> > > Alex
> > >
> > > On Thu, Aug 21, 2025 at 6:18 AM Dennis Huo <[email protected]> wrote:
> > > >
> > > > Reposting my comment from the github issue here for further
> discussion:
> > > >
> > > > It seems like there are three distinct "new" use cases:
> > > >
> > > > 1. Using DefaultCredentialsProvider directly without subscoping to
> > access
> > > > storage when running on AWS and using AWS S3
> > > > 2. Using DefaultCredentialsProvider directly without subscoping to
> > access
> > > > non-AWS s3-compat storage
> > > > 3. Using DefaultCredentialsProvider directly with subscoping to
> access
> > > > non-AWS s3-compat storage
> > > >
> > > >
> > > > These are all different from the "normal" flow:
> > > >
> > > > 4. Using DefaultCredentialsProvider as the super-root to assumeRole
> on
> > a
> > > > provided role with subscoping to access storage on S3
> > > >
> > > > For (1) and (2), setting SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION=true
> is
> > > > explicitly intended for that use case, though looking at the code it
> > > seems
> > > > we still need to remove "validate" checks for roleARN, otherwise
> > > > parsing-validation fails at createCatalog time.
> > > >
> > > > We should verify that a "dummy" syntactically valid roleArn such as
> > > > "arn:aws:iam::123456789012:role/my-role" already works for the stated
> > use
> > > > case even without https://github.com/apache/polaris/pull/2329 making
> > > > roleArn optional if the following is set in application.properties:
> > > >
> > > >     polaris.features."SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION"=true
> > > >
> > > > Looking at MinIO that's certainly very interesting that
> > > > AssumeRoleWithWebIdentity makes roleArn optional -- it's not 100%
> clear
> > > > whether the provide Policy is still applied to the returned token.
> I'm
> > > also
> > > > not 100% clear on how we map the stsClient to point at WebIdentity vs
> > > > CustomToken flows for MinIO - for example AssumeRoleWithCustomToken
> > still
> > > > requires roleArn:
> > > >
> > >
> >
> https://docs.min.io/enterprise/aistor-object-store/developers/security-token-service/assumerolewithcustomtoken/
> > > >
> > > > But assuming the subscoping does work, then (3) is a substantially
> new
> > > flow
> > > > where the assumeRole indirection is applied, but yet the identity is
> > the
> > > > service-wide default credentials provider where
> > > > SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION=false is used despite being no
> > > > roleArn provided. This new use case would need a separate
> > > > FeatureConfiguration to avoid multi-tenant deployments from
> > > "accidentally"
> > > > exposing the service identity through vended credentials.
> > > >
> > > > On Tue, Aug 12, 2025 at 9:43 AM Dmitri Bourlatchkov <
> [email protected]>
> > > > wrote:
> > > >
> > > > > Making roleArn optional in the REST API is backward compatible and
> > > allows
> > > > > for better UX with non-AWS S3-compatible storage.
> > > > >
> > > > > This change looks good to me.
> > > > >
> > > > > Cheers,
> > > > > Dmitri.
> > > > >
> > > > > On Tue, Aug 12, 2025 at 5:46 AM Robert Stupp <[email protected]>
> wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Description of the PR: Having the role-arn parameter required
> for a
> > > > > catalog
> > > > > > is redundant in many and requires the generation of an extra role
> > in
> > > > > cases
> > > > > > when IRSI (for AWS) is being used. Other S3 implementations
> (Minio,
> > > Ceph,
> > > > > > many of the appliances) also do not all require a role-ARN.
> > > > > >
> > > > > > See issue [1] and PR [2] to fix the issue.
> > > > > >
> > > > > > Robert
> > > > > >
> > > > > > [1] https://github.com/apache/polaris/issues/2325
> > > > > > [2] https://github.com/apache/polaris/pull/2329
> > > > > >
> > > > >
> > >
> >
>

Re: Heads-up: role-ARN option for S3 becoming optional

Reply via email to