Note - in this case I'd also be okay with using a single feature configuration flag for both allowing overall STS-endpoint customization and allowing null roleArn for now, even though we'd still likely need to untangle it better in the future.
On Fri, Aug 22, 2025 at 7:40 PM Dennis Huo <huoi...@gmail.com> wrote: > Generally, not *all* "required -> optional" demotions are necessarily bad, > but this one is problematic for a few reasons detailed below. The TL;DR: > Polaris has two personas of users - "Polaris service owners" and "Catalog > users", and the "implied behavior" of roleArn == null requires multi-party > consent from both personas. > > 1. Changing syntax invariants - From a pure syntax standpoint, you're > right that this is possibly "minor" in that it moves some proactive > validation into late failure. Service administrators who carefully watch > the release notes can make adjustments to backend consistency expectations > and internal playbooks as needed. However, this is one category of things > that we really should provide better support for Polaris service-runners > (including out-of-the-box single-tenant Polaris deployments -- this problem > isn't just for "complicated" service deployments). The problem is that if a > service owner is running in a mode where they do *not* consent to allowing > non-AWS S3 storage from their internal Catalog Users, they are still being > exposed to this change in validation inariants. New NullPointerExceptions > will start popping up in places they didn't used to, and when the service > owner then goes to reason about what the behavior of the null roleArn > entails, there's no documentation on the subtleties of assumeRole vs > getSessionToken, whether service-level secrets are exposed, etc. > > A simple boolean service-level feature configuration flag (not a > catalog-overridable configuration flag) > ALLOW_NULL_ROLE_ARN_IN_AWS_STORAGE_CONFIG would solve this problem. Service > runners who don't want to fork code or even change any code can then easily > preserve previous S3 behavior by setting that config, whether we default it > "true" or "false". > > 2. Affirmative intent of getSessionToken vs assumeRole - The "implicit" > difference in semantics of the STS assumeRole call is something we likely > need to change before 1.1 release for the overall > STS-endpoint-customizability anyways. The allowance of null `roleArn` will > need to be highly situational, precisely because we need to convey *intent* > sufficiently to make sure backend behavior matches intent, or else > fails-fast. > > For example, we could certainly also support self-managed Polaris > deployments being allowed to use `getSessionToken` instead of `assumeRole` > during downscoping where no intermediate IAM Role is involved. Here, AWS > itself does *not* support simply "setting roleArn to 'null' in assumeRole". > AWS's assumeRole makes "roleArn" *required*. Instead, in the storage config > we would want to express intent: > "polaris.config.storage.use.direct.service.identity.downscoping=true", for > example. Then *situationally* we'd say: > > if (config.get(USE_DIRECT_SERVICE_IDENTITY_DOWNSCOPING)) { > // Allow roleArn == null > validateConfigForDirectDownscoping(storageConfig); > } else { > // roleArn must not be null > validateConfigForAssumeRole(storageConfig); > } > > Prematurely letting the null roleArn determine the behavior will cause a > big cleanup mess in the future in case any users already created > partially-broken catalogs. How would a service-owner then know whether the > user just forgot to copy/paste a roleArn into their config, or whether they > were trying to use DIRECT_SERVICE_IDENTITY_DOWNSCOPING? > > The direct-downscoping concept itself is well understood and could easily > be applied first-class to other providers -- like for GCP, the CAB token > doesn't necessarily require any "identity-transformation" step as long as > the right privileges are available to mint a CAB token for "self". In this > MinIO case, even if we happen to use a variation of "assumeRole(role == > null)", it should occur under the new "direct-downscoping" codepath in > Polaris, so that any other "direct-downscoping" functionality we need > applies to that codepath equally, rather than letting it "look" identical > to a "true assumeRole" other than roleArn == null > > 3. Affirmative intent of "real AWS" vs "S3 compatible storage" - Even > though S3-compatible providers are intended to provide easy drop-in > replacements, in a production service environment these still have lots of > different peripheral requirements, decorators, regulatory requirements, > etc. For example, all network traffic to AWS might need to be funneled > through a peering connection or PrivateLink. Open-ended network traffic > might need to be directed to a different auditing proxy. > > Importantly, some of these requirements are targeted at the *Polaris > service owner* and not the *catalog user*. The choice to use non-AWS > S3-compatible storage is a great feature, but fundamentally requires > multi-party consent from both the Polaris service owner and the Catalog > user. > > Knowing that for AWS specifically, assumeRole will never allow null > roleArn, this again means we likely need API syntax that properly reflects > that situational validation, and to preserve the entirety of AWS-specific > validation when we know we're using real AWS. > > On Fri, Aug 22, 2025 at 7:32 AM Dmitri Bourlatchkov <di...@apache.org> > wrote: > >> Hi Dennis, >> >> Thanks for stating the concerns (A,B,C). >> >> I'm planning to work in that area for [2207]. I propose to have an >> in-depth >> review of that code under that PR (still WIP on my part). >> >> However, I'm kind of lost about the relationship of that with making >> roleArn optional (which is the main topic of this thread). >> >> Is roleArn being optional detrimental? >> >> From my POV, it enables nicer integration with MinIO use cases in the >> current codebase (not setting roleArn) at the same time AWS use cases are >> not affected. >> >> The only remote problem might be that users of AWS S3 may miss to set >> roleArn in the config. However, that will be caught in runtime (failures >> to >> Assume Role). >> >> WDYT? >> >> [2207] https://github.com/apache/polaris/issues/2207 >> >> Thanks, >> Dmitri. >> >> On Fri, Aug 22, 2025 at 1:38 AM Dennis Huo <huoi...@gmail.com> wrote: >> >> > Yeah excellent point, and that definitely highlights the need for a more >> > comprehensive design for non-AWS S3-compat storage. >> > >> > Using the removal of roleArn as an "incidental" fix for a fuzzy subset >> of >> > scenarios is probably not how we want to get entrenched for the first >> > introduction of those features, especially when we didn't even make it >> > clear in the github issue or the committed code how we expect optional >> > roleArn to interact with session-token exchange. >> > >> > IMO the ability to "assumeRole(null /* roleArn */, sessionPolicy)" >> should >> > itself be treated as idiosyncratic to specific storage providers and >> paired >> > with some explicit expression of intent both for Polaris internally as >> well >> > as for the user. >> > >> > From what I can tell, "null assumeRole" in MinIO is more analogous to >> > "getSessionPolicy" from AWS, though I'm not too familiar with MinIO so >> we >> > should invite some expert opinions on this. >> > >> > Right now there are several different concerns rolled up into the single >> > "getSubscopedCredential" in Polaris: >> > >> > A. Indirection between root "service identities" (owned by the Polaris >> > service owner) and per-Catalog storage-actor identities (owned by the >> > Catalog administrative user) >> > -This indirection *in itself* is an important element of the Polaris >> > security model, where service identities do *not* generally have latent >> > direct storage-access permissions, but instead hold "actAs" or >> "assumeRole" >> > types of permissions >> > B. Applying a "subscoping policy" that restricts the blast radius of any >> > storage credentials that may be used, both in terms of "path prefix" >> and in >> > "duration" >> > -It's intentional to make Polaris "internal" FileIO go through the >> same >> > subscoping flow as much as possible, so that even when it's Polaris >> > writing/reading metadata files, the blast radius matches what would be >> > vended out to a sufficiently privileged principal >> > C. Applying "configuration overrides" related to endpoints, region, etc. >> > These crept into getSubscopedCredentials due to being "convenient", but >> are >> > substantially a different action than credential-minting, though are >> > closely related because of needing to determine STS endpoints from the >> > config >> > >> > I guess we probably want to refactor so that (C) will *always* happen >> > correctly, so we'd need to split out some kind of "getDynamicConfig" >> that >> > is separate from injecting the *credentials* into the config map. >> > >> > It sounds like we have potential use cases for any mix of (A) and (B). >> > >> > - Single-tenant use cases may not need "indirection" but may still want >> > subscoping both for internal blast-radius management and for >> > credential-vending >> > - Other single-tenant use cases might be okay with neither >> > identity-indirection nor subscoping >> > - I think we've had some discussion about whether to ever allow >> > credential-vending without subscoping (i.e. vending long-lived >> credentials) >> > >> > On Thu, Aug 21, 2025 at 3:53 AM Alexandre Dutra <adu...@apache.org> >> wrote: >> > >> > > Hi, >> > > >> > > We just had an issue created by a user that was attempting to do use >> > > case #2 in Dennis' categorization ("Using DefaultCredentialsProvider >> > > directly without subscoping to access non-AWS s3-compat storage"): >> > > >> > > https://github.com/apache/polaris/issues/2398 >> > > >> > > This uncovered some interesting findings (at least for me), which >> > > leads me to think that setting >> > > SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION=true is actually not enough, >> > > and even not recommended in that case. When credentials subscoping is >> > > disabled, the table config returned to the client not only omits S3 >> > > credentials, which is expected, but also omits some otherwise very >> > > important S3 settings, such as: s3.endpoint, s3.path-style-access or >> > > client.region, *even if these were properly configured at the catalog >> > > level*. As a result, the client is unable to access the MinIO storage >> > > properly. >> > > >> > > For me, use case #2 is just not achievable right now in Polaris. >> > > Enabling credentials subscoping solves the issue of course, but also >> > > creates a somewhat artificial link between credentials vending and >> > > "generic" storage configuration. >> > > >> > > Thanks, >> > > Alex >> > > >> > > On Thu, Aug 21, 2025 at 6:18 AM Dennis Huo <huoi...@gmail.com> wrote: >> > > > >> > > > Reposting my comment from the github issue here for further >> discussion: >> > > > >> > > > It seems like there are three distinct "new" use cases: >> > > > >> > > > 1. Using DefaultCredentialsProvider directly without subscoping to >> > access >> > > > storage when running on AWS and using AWS S3 >> > > > 2. Using DefaultCredentialsProvider directly without subscoping to >> > access >> > > > non-AWS s3-compat storage >> > > > 3. Using DefaultCredentialsProvider directly with subscoping to >> access >> > > > non-AWS s3-compat storage >> > > > >> > > > >> > > > These are all different from the "normal" flow: >> > > > >> > > > 4. Using DefaultCredentialsProvider as the super-root to assumeRole >> on >> > a >> > > > provided role with subscoping to access storage on S3 >> > > > >> > > > For (1) and (2), setting >> SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION=true is >> > > > explicitly intended for that use case, though looking at the code it >> > > seems >> > > > we still need to remove "validate" checks for roleARN, otherwise >> > > > parsing-validation fails at createCatalog time. >> > > > >> > > > We should verify that a "dummy" syntactically valid roleArn such as >> > > > "arn:aws:iam::123456789012:role/my-role" already works for the >> stated >> > use >> > > > case even without https://github.com/apache/polaris/pull/2329 >> making >> > > > roleArn optional if the following is set in application.properties: >> > > > >> > > > polaris.features."SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION"=true >> > > > >> > > > Looking at MinIO that's certainly very interesting that >> > > > AssumeRoleWithWebIdentity makes roleArn optional -- it's not 100% >> clear >> > > > whether the provide Policy is still applied to the returned token. >> I'm >> > > also >> > > > not 100% clear on how we map the stsClient to point at WebIdentity >> vs >> > > > CustomToken flows for MinIO - for example AssumeRoleWithCustomToken >> > still >> > > > requires roleArn: >> > > > >> > > >> > >> https://docs.min.io/enterprise/aistor-object-store/developers/security-token-service/assumerolewithcustomtoken/ >> > > > >> > > > But assuming the subscoping does work, then (3) is a substantially >> new >> > > flow >> > > > where the assumeRole indirection is applied, but yet the identity is >> > the >> > > > service-wide default credentials provider where >> > > > SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION=false is used despite being >> no >> > > > roleArn provided. This new use case would need a separate >> > > > FeatureConfiguration to avoid multi-tenant deployments from >> > > "accidentally" >> > > > exposing the service identity through vended credentials. >> > > > >> > > > On Tue, Aug 12, 2025 at 9:43 AM Dmitri Bourlatchkov < >> di...@apache.org> >> > > > wrote: >> > > > >> > > > > Making roleArn optional in the REST API is backward compatible and >> > > allows >> > > > > for better UX with non-AWS S3-compatible storage. >> > > > > >> > > > > This change looks good to me. >> > > > > >> > > > > Cheers, >> > > > > Dmitri. >> > > > > >> > > > > On Tue, Aug 12, 2025 at 5:46 AM Robert Stupp <sn...@snazy.de> >> wrote: >> > > > > >> > > > > > Hi all, >> > > > > > >> > > > > > Description of the PR: Having the role-arn parameter required >> for a >> > > > > catalog >> > > > > > is redundant in many and requires the generation of an extra >> role >> > in >> > > > > cases >> > > > > > when IRSI (for AWS) is being used. Other S3 implementations >> (Minio, >> > > Ceph, >> > > > > > many of the appliances) also do not all require a role-ARN. >> > > > > > >> > > > > > See issue [1] and PR [2] to fix the issue. >> > > > > > >> > > > > > Robert >> > > > > > >> > > > > > [1] https://github.com/apache/polaris/issues/2325 >> > > > > > [2] https://github.com/apache/polaris/pull/2329 >> > > > > > >> > > > > >> > > >> > >> >