Currently, we can set table defaults at the catalog level - e.g., we do this to set the s3 endpoint in tests: https://github.com/apache/polaris/blob/main/polaris-service/src/test/java/org/apache/polaris/service/catalog/PolarisSparkIntegrationTest.java#L151-L158
The plan for StorageConfigurationInfo has always been to support Namespace and Table overrides, so it _could_ go there. However, we only inject those properties in the LoadTableResponse#config field when credentials are requested. If region is being injected into the S3FileIO, I think it ought to be injected even if credentials aren't requested. From the Iceberg client code, it looks like _only_ the LoadTableResponse#config field is added to the FileIO, so this would require a change to business logic in the construction of the response. Mike On Fri, Sep 13, 2024 at 4:17 AM Robert Stupp <sn...@snazy.de> wrote: > I think that the particular S3FileIO config and the credential(s) for it > can differ between tables. > > It is (or can be) possible, although not common, that different tables > use different buckets - which may even be on different object store > systems (AWS, Minio, Ceph) or even different object stores. Maybe even > within the same table (e.g. tiered storage - having "cold" data in > slower, but cheaper storage and "hot" data in faster, but more expensive > storage). Associating the right object-store config and credentials is > doable, although it requires some work. > > What I wanna say: I think we should design it in a way that allows such > use cases. > > > On 12.09.24 23:54, Aniket Kulkarni wrote: > > On Thu, Sep 12, 2024 at 02:38:30PM UTC, Robert Stupp wrote: > >> What would be the difference of the region specified in/for the > credentials > >> and the region that's configured for the S3FileIO (via LoadTableResult)? > > My point is that _region_ configured for the catalog via > storageConfigInfo in > > createCatalog [1] is the one that should be returned in LoadTableResult > > (client.region). And if the user doesn't specify a region in > storageConfigInfo > > then it is infered and then returned in LoadTableResult > > > > -- > > aniket > > > > [1] > https://polaris.io/#tag/polaris-management-service_other/operation/createCatalog > > > >> On 12.09.24 06:05, Aniket Kulkarni wrote: > >>> On Wed, Sep 11, 2024 at 10:39:26AM UTC, Yufei Gu wrote: > >>>>> - the storage setup for S3 should have parameter for the bucket > region > >>>>> (org.apache.polaris.core.storage.aws.AwsStorageConfigurationInfo) > >>>>> - if the parameter is not specified, then Polaris attempts to look up > >>>>> (get-bucket-location) the region. > >>>> I'd always prefer the second one. It provides more flexibility. For > >>>> example, it's likely that the regions are different for two namespaces > >>>> under the same catalog. Polaris may not support namespace-level or > >>>> table-level storage location well at this point, but it makes sense > to do > >>>> so. This seems like a big topic. I will not expand here, we can > discuss > >>>> separately. > >>> Agreed. I am partial to that option as well. I just wanted to have an > option > >>> for the user to specify it in the configuration. > >>> > >>>> Plus, the assume-role configured for catalog should have permission to > >>>> infer the region. > >>> Agree. We should add this into Polaris docs; currently this is implicit > >>> and the docs don't explicitly talk about ensuring this permission has > been > >>> given. I will raise a PR for that as well. > >>> > >>> -- > >>> aniket > >>> > >>>> Yufei > >>>> > >>>> > >>>> On Wed, Sep 11, 2024 at 10:03 AM Yufei Gu <flyrain...@gmail.com> > wrote: > >>>> > >>>>> Sorry, just realized that Edward has a new proposal to standardize > >>>>> credentials, https://github.com/apache/iceberg/pull/10722. In that > case, > >>>>> we should add client.region as a field of AWS credential schema. I > will > >>>>> comment on it in the PR. > >>>>> > >>>>>> Does the Polaris community intend to maintain a separate REST > protocol > >>>>> and only sometimes upstream changes? > >>>>> > >>>>> I'm with JB for not diverging from the Iceberg REST spec. This is > mainly > >>>>> for consistency, and to ensure interoperability across engines. > >>>>> > >>>>> Yufei > >>>>> > >>>>> > >>>>> On Wed, Sep 11, 2024 at 9:00 AM Jean-Baptiste Onofré < > j...@nanthrax.net> > >>>>> wrote: > >>>>> > >>>>>> About your point Ryan, my "view" on that is that Polaris should > follow > >>>>>> the REST spec, and the REST spec is at Iceberg. > >>>>>> > >>>>>> I don't think it would be a good idea to have a separate REST > protocol > >>>>>> in Polaris, especially for engine interoperability about Iceberg > >>>>>> (engines should know only the Iceberg REST Spec/Client). > >>>>>> So, I'm more in favor of proposing directly REST spec changes at > Iceberg. > >>>>>> > >>>>>> Thoughts ? > >>>>>> > >>>>>> Regards > >>>>>> JB > >>>>>> > >>>>>> On Wed, Sep 11, 2024 at 5:29 PM rdb...@gmail.com <rdb...@gmail.com> > >>>>>> wrote: > >>>>>>> Thanks for finding this, Aniket. It sounds like a good thing to > fix in > >>>>>> the > >>>>>>> spec to me. > >>>>>>> > >>>>>>> This also brings up a question for the Polaris community. JB said > he > >>>>>> would > >>>>>>> “draft a proposal to update the Iceberg REST Spec *as well*“. Does > the > >>>>>>> Polaris community intend to maintain a separate REST protocol and > only > >>>>>>> sometimes upstream changes? I think it would be good to have clear > >>>>>> guidance > >>>>>>> on this so that it is clear when an update should be taken upstream > >>>>>> first > >>>>>>> vs decided here and later standardized. > >>>>>>> > >>>>>>> Ryan > >>>>>>> > >>>>>>> On Wed, Sep 11, 2024 at 7:34 AM Jean-Baptiste Onofré < > j...@nanthrax.net> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hi Aniket > >>>>>>>> > >>>>>>>> It looks good to me. I think AwsStorageConfigurationInfo is not > >>>>>>>> enough, we also need the client.region in the Iceberg REST Spec > (for > >>>>>>>> consistency between engines). > >>>>>>>> > >>>>>>>> If there's no objection, I would draft a proposal to update the > >>>>>>>> Iceberg REST Spec as well. > >>>>>>>> As a workaround, we can store the client.region as property in the > >>>>>>>> Polaris entity (e.g. table). > >>>>>>>> > >>>>>>>> Thoughts ? > >>>>>>>> > >>>>>>>> Regards > >>>>>>>> JB > >>>>>>>> > >>>>>>>> On Wed, Sep 11, 2024 at 4:31 AM Aniket Kulkarni > >>>>>>>> <aniket.kulka...@dremio.com.invalid> wrote: > >>>>>>>>> For iceberg tables stored in AWS S3 buckets, knowing the region > of > >>>>>> the > >>>>>>>> bucket is critical for engines using vended credentials (when > >>>>>> configured) > >>>>>>>> to access a table. > >>>>>>>>> E.g - the vended credentials for AWS look like this > >>>>>>>>> > >>>>>>>>> { "s3.access-key-id": "ASI....”, > >>>>>>>>> "s3.secret-access-key": "gbVT9PpFBY...”, > >>>>>>>>> "s3.session-token": "IQoJb3JpZ2luX2VjEN3//////////...”, > >>>>>>>>> "expiration-time": “1725572949000” } > >>>>>>>>> > >>>>>>>>> An engine consuming this, would need to either infer (s3api > >>>>>>>> get-bucket-location) the region or ask the end user to provide the > >>>>>> region > >>>>>>>> separately which misses the point of vended credentials. > >>>>>>>>> A engine engine cannot use get-bucket-location, because the > >>>>>> credential > >>>>>>>> generation explicitly allows only s3:GetObject, > s3:GetObjectVersion, > >>>>>>>> s3:PutObject, s3:DeletObject, s3:ListBucket for the table location > >>>>>> prefix. > >>>>>>>> Refer - > >>>>>>>> > >>>>>> > org.apache.polaris.core.storage.aws.AwsCredentialsStorageIntegration#policyString > >>>>>>>>> I propose that > >>>>>>>>> > >>>>>>>>> - the storage setup for S3 should have parameter for the bucket > >>>>>> region > >>>>>>>> (org.apache.polaris.core.storage.aws.AwsStorageConfigurationInfo) > >>>>>>>>> - if the parameter is not specified, then Polaris attempts to > look > >>>>>> up > >>>>>>>> (get-bucket-location) the region. > >>>>>>>>> - the information is returned in vended credentials (if enabled) > as > >>>>>>>> "s3.region”:… > >>>>>>>>> Note - another option could be to allow ’s3:GetBucketLocation’ in > >>>>>> the > >>>>>>>> policyString when generating vended credentials’ iam role, but > that > >>>>>> is sub > >>>>>>>> optimal and therefore I am not proposing it. It would engines to > make > >>>>>>>> multiple get-bucket-location calls - one per table being looked > up. > >>>>>>>>> -- > >>>>>>>>> aniket > >>>>>>>>> > >> -- > >> Robert Stupp > >> @snazy > >> > -- > Robert Stupp > @snazy > >