Note - in this case I'd also be okay with using a single feature
configuration flag for both allowing overall STS-endpoint customization and
allowing null roleArn for now, even though we'd still likely need to
untangle it better in the future.

On Fri, Aug 22, 2025 at 7:40 PM Dennis Huo <huoi...@gmail.com> wrote:

> Generally, not *all* "required -> optional" demotions are necessarily bad,
> but this one is problematic for a few reasons detailed below. The TL;DR:
> Polaris has two personas of users - "Polaris service owners" and "Catalog
> users", and the "implied behavior" of roleArn == null requires multi-party
> consent from both personas.
>
> 1. Changing syntax invariants - From a pure syntax standpoint, you're
> right that this is possibly "minor" in that it moves some proactive
> validation into late failure. Service administrators who carefully watch
> the release notes can make adjustments to backend consistency expectations
> and internal playbooks as needed. However, this is one category of things
> that we really should provide better support for Polaris service-runners
> (including out-of-the-box single-tenant Polaris deployments -- this problem
> isn't just for "complicated" service deployments). The problem is that if a
> service owner is running in a mode where they do *not* consent to allowing
> non-AWS S3 storage from their internal Catalog Users, they are still being
> exposed to this change in validation inariants. New NullPointerExceptions
> will start popping up in places they didn't used to, and when the service
> owner then goes to reason about what the behavior of the null roleArn
> entails, there's no documentation on the subtleties of assumeRole vs
> getSessionToken, whether service-level secrets are exposed, etc.
>
> A simple boolean service-level feature configuration flag (not a
> catalog-overridable configuration flag)
> ALLOW_NULL_ROLE_ARN_IN_AWS_STORAGE_CONFIG would solve this problem. Service
> runners who don't want to fork code or even change any code can then easily
> preserve previous S3 behavior by setting that config, whether we default it
> "true" or "false".
>
> 2. Affirmative intent of getSessionToken vs assumeRole - The "implicit"
> difference in semantics of the STS assumeRole call is something we likely
> need to change before 1.1 release for the overall
> STS-endpoint-customizability anyways. The allowance of null `roleArn` will
> need to be highly situational, precisely because we need to convey *intent*
> sufficiently to make sure backend behavior matches intent, or else
> fails-fast.
>
> For example, we could certainly also support self-managed Polaris
> deployments being allowed to use `getSessionToken` instead of `assumeRole`
> during downscoping where no intermediate IAM Role is involved. Here, AWS
> itself does *not* support simply "setting roleArn to 'null' in assumeRole".
> AWS's assumeRole makes "roleArn" *required*. Instead, in the storage config
> we would want to express intent:
> "polaris.config.storage.use.direct.service.identity.downscoping=true", for
> example. Then *situationally* we'd say:
>
>     if (config.get(USE_DIRECT_SERVICE_IDENTITY_DOWNSCOPING)) {
>         // Allow roleArn == null
>       validateConfigForDirectDownscoping(storageConfig);
>     } else {
>         // roleArn must not be null
>       validateConfigForAssumeRole(storageConfig);
>     }
>
> Prematurely letting the null roleArn determine the behavior will cause a
> big cleanup mess in the future in case any users already created
> partially-broken catalogs. How would a service-owner then know whether the
> user just forgot to copy/paste a roleArn into their config, or whether they
> were trying to use DIRECT_SERVICE_IDENTITY_DOWNSCOPING?
>
> The direct-downscoping concept itself is well understood and could easily
> be applied first-class to other providers -- like for GCP, the CAB token
> doesn't necessarily require any "identity-transformation" step as long as
> the right privileges are available to mint a CAB token for "self". In this
> MinIO case, even if we happen to use a variation of "assumeRole(role ==
> null)", it should occur under the new "direct-downscoping" codepath in
> Polaris, so that any other "direct-downscoping" functionality we need
> applies to that codepath equally, rather than letting it "look" identical
> to a "true assumeRole" other than roleArn == null
>
> 3. Affirmative intent of "real AWS" vs "S3 compatible storage" - Even
> though S3-compatible providers are intended to provide easy drop-in
> replacements, in a production service environment these still have lots of
> different peripheral requirements, decorators, regulatory requirements,
> etc. For example, all network traffic to AWS might need to be funneled
> through a peering connection or PrivateLink. Open-ended network traffic
> might need to be directed to a different auditing proxy.
>
> Importantly, some of these requirements are targeted at the *Polaris
> service owner* and not the *catalog user*. The choice to use non-AWS
> S3-compatible storage is a great feature, but fundamentally requires
> multi-party consent from both the Polaris service owner and the Catalog
> user.
>
> Knowing that for AWS specifically, assumeRole will never allow null
> roleArn, this again means we likely need API syntax that properly reflects
> that situational validation, and to preserve the entirety of AWS-specific
> validation when we know we're using real AWS.
>
> On Fri, Aug 22, 2025 at 7:32 AM Dmitri Bourlatchkov <di...@apache.org>
> wrote:
>
>> Hi Dennis,
>>
>> Thanks for stating the concerns (A,B,C).
>>
>> I'm planning to work in that area for [2207]. I propose to have an
>> in-depth
>> review of that code under that PR (still WIP on my part).
>>
>> However, I'm kind of lost about the relationship of that with making
>> roleArn optional (which is the main topic of this thread).
>>
>> Is roleArn being optional detrimental?
>>
>> From my POV, it enables nicer integration with MinIO use cases in the
>> current codebase (not setting roleArn) at the same time AWS use cases are
>> not affected.
>>
>> The only remote problem might be that users of AWS S3 may miss to set
>> roleArn in the config. However, that will be caught in runtime (failures
>> to
>> Assume Role).
>>
>> WDYT?
>>
>> [2207] https://github.com/apache/polaris/issues/2207
>>
>> Thanks,
>> Dmitri.
>>
>> On Fri, Aug 22, 2025 at 1:38 AM Dennis Huo <huoi...@gmail.com> wrote:
>>
>> > Yeah excellent point, and that definitely highlights the need for a more
>> > comprehensive design for non-AWS S3-compat storage.
>> >
>> > Using the removal of roleArn as an "incidental" fix for a fuzzy subset
>> of
>> > scenarios is probably not how we want to get entrenched for the first
>> > introduction of those features, especially when we didn't even make it
>> > clear in the github issue or the committed code how we expect optional
>> > roleArn to interact with session-token exchange.
>> >
>> > IMO the ability to "assumeRole(null /* roleArn */, sessionPolicy)"
>> should
>> > itself be treated as idiosyncratic to specific storage providers and
>> paired
>> > with some explicit expression of intent both for Polaris internally as
>> well
>> > as for the user.
>> >
>> > From what I can tell, "null assumeRole" in MinIO is more analogous to
>> > "getSessionPolicy" from AWS, though I'm not too familiar with MinIO so
>> we
>> > should invite some expert opinions on this.
>> >
>> > Right now there are several different concerns rolled up into the single
>> > "getSubscopedCredential" in Polaris:
>> >
>> > A. Indirection between root "service identities" (owned by the Polaris
>> > service owner) and per-Catalog storage-actor identities (owned by the
>> > Catalog administrative user)
>> >     -This indirection *in itself* is an important element of the Polaris
>> > security model, where service identities do *not* generally have latent
>> > direct storage-access permissions, but instead hold "actAs" or
>> "assumeRole"
>> > types of permissions
>> > B. Applying a "subscoping policy" that restricts the blast radius of any
>> > storage credentials that may be used, both in terms of "path prefix"
>> and in
>> > "duration"
>> >     -It's intentional to make Polaris "internal" FileIO go through the
>> same
>> > subscoping flow as much as possible, so that even when it's Polaris
>> > writing/reading metadata files, the blast radius matches what would be
>> > vended out to a sufficiently privileged principal
>> > C. Applying "configuration overrides" related to endpoints, region, etc.
>> > These crept into getSubscopedCredentials due to being "convenient", but
>> are
>> > substantially a different action than credential-minting, though are
>> > closely related because of needing to determine STS endpoints from the
>> > config
>> >
>> > I guess we probably want to refactor so that (C) will *always* happen
>> > correctly, so we'd need to split out some kind of "getDynamicConfig"
>> that
>> > is separate from injecting the *credentials* into the config map.
>> >
>> > It sounds like we have potential use cases for any mix of (A) and (B).
>> >
>> > - Single-tenant use cases may not need "indirection" but may still want
>> > subscoping both for internal blast-radius management and for
>> > credential-vending
>> > - Other single-tenant use cases might be okay with neither
>> > identity-indirection nor subscoping
>> > - I think we've had some discussion about whether to ever allow
>> > credential-vending without subscoping (i.e. vending long-lived
>> credentials)
>> >
>> > On Thu, Aug 21, 2025 at 3:53 AM Alexandre Dutra <adu...@apache.org>
>> wrote:
>> >
>> > > Hi,
>> > >
>> > > We just had an issue created by a user that was attempting to do use
>> > > case #2 in Dennis' categorization ("Using DefaultCredentialsProvider
>> > > directly without subscoping to access non-AWS s3-compat storage"):
>> > >
>> > > https://github.com/apache/polaris/issues/2398
>> > >
>> > > This uncovered some interesting findings (at least for me), which
>> > > leads me to think that setting
>> > > SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION=true is actually not enough,
>> > > and even not recommended in that case. When credentials subscoping is
>> > > disabled, the table config returned to the client not only omits S3
>> > > credentials, which is expected, but also omits some otherwise very
>> > > important S3 settings, such as: s3.endpoint, s3.path-style-access or
>> > > client.region, *even if these were properly configured at the catalog
>> > > level*. As a result, the client is unable to access the MinIO storage
>> > > properly.
>> > >
>> > > For me, use case #2 is just not achievable right now in Polaris.
>> > > Enabling credentials subscoping solves the issue of course, but also
>> > > creates a somewhat artificial link between credentials vending and
>> > > "generic" storage configuration.
>> > >
>> > > Thanks,
>> > > Alex
>> > >
>> > > On Thu, Aug 21, 2025 at 6:18 AM Dennis Huo <huoi...@gmail.com> wrote:
>> > > >
>> > > > Reposting my comment from the github issue here for further
>> discussion:
>> > > >
>> > > > It seems like there are three distinct "new" use cases:
>> > > >
>> > > > 1. Using DefaultCredentialsProvider directly without subscoping to
>> > access
>> > > > storage when running on AWS and using AWS S3
>> > > > 2. Using DefaultCredentialsProvider directly without subscoping to
>> > access
>> > > > non-AWS s3-compat storage
>> > > > 3. Using DefaultCredentialsProvider directly with subscoping to
>> access
>> > > > non-AWS s3-compat storage
>> > > >
>> > > >
>> > > > These are all different from the "normal" flow:
>> > > >
>> > > > 4. Using DefaultCredentialsProvider as the super-root to assumeRole
>> on
>> > a
>> > > > provided role with subscoping to access storage on S3
>> > > >
>> > > > For (1) and (2), setting
>> SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION=true is
>> > > > explicitly intended for that use case, though looking at the code it
>> > > seems
>> > > > we still need to remove "validate" checks for roleARN, otherwise
>> > > > parsing-validation fails at createCatalog time.
>> > > >
>> > > > We should verify that a "dummy" syntactically valid roleArn such as
>> > > > "arn:aws:iam::123456789012:role/my-role" already works for the
>> stated
>> > use
>> > > > case even without https://github.com/apache/polaris/pull/2329
>> making
>> > > > roleArn optional if the following is set in application.properties:
>> > > >
>> > > >     polaris.features."SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION"=true
>> > > >
>> > > > Looking at MinIO that's certainly very interesting that
>> > > > AssumeRoleWithWebIdentity makes roleArn optional -- it's not 100%
>> clear
>> > > > whether the provide Policy is still applied to the returned token.
>> I'm
>> > > also
>> > > > not 100% clear on how we map the stsClient to point at WebIdentity
>> vs
>> > > > CustomToken flows for MinIO - for example AssumeRoleWithCustomToken
>> > still
>> > > > requires roleArn:
>> > > >
>> > >
>> >
>> https://docs.min.io/enterprise/aistor-object-store/developers/security-token-service/assumerolewithcustomtoken/
>> > > >
>> > > > But assuming the subscoping does work, then (3) is a substantially
>> new
>> > > flow
>> > > > where the assumeRole indirection is applied, but yet the identity is
>> > the
>> > > > service-wide default credentials provider where
>> > > > SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION=false is used despite being
>> no
>> > > > roleArn provided. This new use case would need a separate
>> > > > FeatureConfiguration to avoid multi-tenant deployments from
>> > > "accidentally"
>> > > > exposing the service identity through vended credentials.
>> > > >
>> > > > On Tue, Aug 12, 2025 at 9:43 AM Dmitri Bourlatchkov <
>> di...@apache.org>
>> > > > wrote:
>> > > >
>> > > > > Making roleArn optional in the REST API is backward compatible and
>> > > allows
>> > > > > for better UX with non-AWS S3-compatible storage.
>> > > > >
>> > > > > This change looks good to me.
>> > > > >
>> > > > > Cheers,
>> > > > > Dmitri.
>> > > > >
>> > > > > On Tue, Aug 12, 2025 at 5:46 AM Robert Stupp <sn...@snazy.de>
>> wrote:
>> > > > >
>> > > > > > Hi all,
>> > > > > >
>> > > > > > Description of the PR: Having the role-arn parameter required
>> for a
>> > > > > catalog
>> > > > > > is redundant in many and requires the generation of an extra
>> role
>> > in
>> > > > > cases
>> > > > > > when IRSI (for AWS) is being used. Other S3 implementations
>> (Minio,
>> > > Ceph,
>> > > > > > many of the appliances) also do not all require a role-ARN.
>> > > > > >
>> > > > > > See issue [1] and PR [2] to fix the issue.
>> > > > > >
>> > > > > > Robert
>> > > > > >
>> > > > > > [1] https://github.com/apache/polaris/issues/2325
>> > > > > > [2] https://github.com/apache/polaris/pull/2329
>> > > > > >
>> > > > >
>> > >
>> >
>>
>

Reply via email to