Currently, we can set table defaults at the catalog level - e.g., we do
this to set the s3 endpoint in tests:
https://github.com/apache/polaris/blob/main/polaris-service/src/test/java/org/apache/polaris/service/catalog/PolarisSparkIntegrationTest.java#L151-L158

The plan for StorageConfigurationInfo has always been to support Namespace
and Table overrides, so it _could_ go there. However, we only inject those
properties in the LoadTableResponse#config field when credentials are
requested. If region is being injected into the S3FileIO, I think it ought
to be injected even if credentials aren't requested. From the Iceberg
client code, it looks like _only_ the LoadTableResponse#config field is
added to the FileIO, so this would require a change to business logic in
the construction of the response.


Mike


On Fri, Sep 13, 2024 at 4:17 AM Robert Stupp <sn...@snazy.de> wrote:

> I think that the particular S3FileIO config and the credential(s) for it
> can differ between tables.
>
> It is (or can be) possible, although not common, that different tables
> use different buckets - which may even be on different object store
> systems (AWS, Minio, Ceph) or even different object stores. Maybe even
> within the same table (e.g. tiered storage - having "cold" data in
> slower, but cheaper storage and "hot" data in faster, but more expensive
> storage). Associating the right object-store config and credentials is
> doable, although it requires some work.
>
> What I wanna say: I think we should design it in a way that allows such
> use cases.
>
>
> On 12.09.24 23:54, Aniket Kulkarni wrote:
> > On Thu, Sep 12, 2024 at 02:38:30PM UTC, Robert Stupp wrote:
> >> What would be the difference of the region specified in/for the
> credentials
> >> and the region that's configured for the S3FileIO (via LoadTableResult)?
> > My point is that _region_ configured for the catalog via
> storageConfigInfo in
> > createCatalog [1] is the one that should be returned in LoadTableResult
> > (client.region). And if the user doesn't specify a region in
> storageConfigInfo
> > then it is infered and then returned in LoadTableResult
> >
> > --
> > aniket
> >
> > [1]
> https://polaris.io/#tag/polaris-management-service_other/operation/createCatalog
> >
> >> On 12.09.24 06:05, Aniket Kulkarni wrote:
> >>> On Wed, Sep 11, 2024 at 10:39:26AM UTC, Yufei Gu wrote:
> >>>>> - the storage setup for S3 should have parameter for the bucket
> region
> >>>>> (org.apache.polaris.core.storage.aws.AwsStorageConfigurationInfo)
> >>>>> - if the parameter is not specified, then Polaris attempts to look up
> >>>>> (get-bucket-location) the region.
> >>>> I'd always prefer the second one. It provides more flexibility. For
> >>>> example, it's likely that the regions are different for two namespaces
> >>>> under the same catalog. Polaris may not support namespace-level or
> >>>> table-level storage location well at this point, but it makes sense
> to do
> >>>> so. This seems like a big topic. I will not expand here, we can
> discuss
> >>>> separately.
> >>> Agreed. I am partial to that option as well. I just wanted to have an
> option
> >>> for the user to specify it in the configuration.
> >>>
> >>>> Plus, the assume-role configured for catalog should have permission to
> >>>> infer the region.
> >>> Agree. We should add this into Polaris docs; currently this is implicit
> >>> and the docs don't explicitly talk about ensuring this permission has
> been
> >>> given. I will raise a PR for that as well.
> >>>
> >>> --
> >>> aniket
> >>>
> >>>> Yufei
> >>>>
> >>>>
> >>>> On Wed, Sep 11, 2024 at 10:03 AM Yufei Gu <flyrain...@gmail.com>
> wrote:
> >>>>
> >>>>> Sorry, just realized that Edward has a new proposal to standardize
> >>>>> credentials, https://github.com/apache/iceberg/pull/10722. In that
> case,
> >>>>> we should add client.region as a field of AWS credential schema. I
> will
> >>>>> comment on it in the PR.
> >>>>>
> >>>>>> Does the Polaris community intend to maintain a separate REST
> protocol
> >>>>> and only sometimes upstream changes?
> >>>>>
> >>>>> I'm with JB for not diverging from the Iceberg REST spec. This is
> mainly
> >>>>> for consistency, and to ensure interoperability across engines.
> >>>>>
> >>>>> Yufei
> >>>>>
> >>>>>
> >>>>> On Wed, Sep 11, 2024 at 9:00 AM Jean-Baptiste Onofré <
> j...@nanthrax.net>
> >>>>> wrote:
> >>>>>
> >>>>>> About your point Ryan, my "view" on that is that Polaris should
> follow
> >>>>>> the REST spec, and the REST spec is at Iceberg.
> >>>>>>
> >>>>>> I don't think it would be a good idea to have a separate REST
> protocol
> >>>>>> in Polaris, especially for engine interoperability about Iceberg
> >>>>>> (engines should know only the Iceberg REST Spec/Client).
> >>>>>> So, I'm more in favor of proposing directly REST spec changes at
> Iceberg.
> >>>>>>
> >>>>>> Thoughts ?
> >>>>>>
> >>>>>> Regards
> >>>>>> JB
> >>>>>>
> >>>>>> On Wed, Sep 11, 2024 at 5:29 PM rdb...@gmail.com <rdb...@gmail.com>
> >>>>>> wrote:
> >>>>>>> Thanks for finding this, Aniket. It sounds like a good thing to
> fix in
> >>>>>> the
> >>>>>>> spec to me.
> >>>>>>>
> >>>>>>> This also brings up a question for the Polaris community. JB said
> he
> >>>>>> would
> >>>>>>> “draft a proposal to update the Iceberg REST Spec *as well*“. Does
> the
> >>>>>>> Polaris community intend to maintain a separate REST protocol and
> only
> >>>>>>> sometimes upstream changes? I think it would be good to have clear
> >>>>>> guidance
> >>>>>>> on this so that it is clear when an update should be taken upstream
> >>>>>> first
> >>>>>>> vs decided here and later standardized.
> >>>>>>>
> >>>>>>> Ryan
> >>>>>>>
> >>>>>>> On Wed, Sep 11, 2024 at 7:34 AM Jean-Baptiste Onofré <
> j...@nanthrax.net>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Aniket
> >>>>>>>>
> >>>>>>>> It looks good to me. I think AwsStorageConfigurationInfo is not
> >>>>>>>> enough, we also need the client.region in the Iceberg REST Spec
> (for
> >>>>>>>> consistency between engines).
> >>>>>>>>
> >>>>>>>> If there's no objection, I would draft a proposal to update the
> >>>>>>>> Iceberg REST Spec as well.
> >>>>>>>> As a workaround, we can store the client.region as property in the
> >>>>>>>> Polaris entity (e.g. table).
> >>>>>>>>
> >>>>>>>> Thoughts ?
> >>>>>>>>
> >>>>>>>> Regards
> >>>>>>>> JB
> >>>>>>>>
> >>>>>>>> On Wed, Sep 11, 2024 at 4:31 AM Aniket Kulkarni
> >>>>>>>> <aniket.kulka...@dremio.com.invalid> wrote:
> >>>>>>>>> For iceberg tables stored in AWS S3 buckets, knowing the region
> of
> >>>>>> the
> >>>>>>>> bucket is critical for engines using vended credentials (when
> >>>>>> configured)
> >>>>>>>> to access a table.
> >>>>>>>>> E.g - the vended credentials for AWS look like this
> >>>>>>>>>
> >>>>>>>>> { "s3.access-key-id": "ASI....”,
> >>>>>>>>>     "s3.secret-access-key": "gbVT9PpFBY...”,
> >>>>>>>>>     "s3.session-token": "IQoJb3JpZ2luX2VjEN3//////////...”,
> >>>>>>>>>      "expiration-time": “1725572949000” }
> >>>>>>>>>
> >>>>>>>>> An engine consuming this, would need to either infer (s3api
> >>>>>>>> get-bucket-location) the region or ask the end user to provide the
> >>>>>> region
> >>>>>>>> separately which misses the point of vended credentials.
> >>>>>>>>> A engine engine cannot use get-bucket-location, because the
> >>>>>> credential
> >>>>>>>> generation explicitly allows only s3:GetObject,
> s3:GetObjectVersion,
> >>>>>>>> s3:PutObject, s3:DeletObject, s3:ListBucket for the table location
> >>>>>> prefix.
> >>>>>>>> Refer -
> >>>>>>>>
> >>>>>>
> org.apache.polaris.core.storage.aws.AwsCredentialsStorageIntegration#policyString
> >>>>>>>>> I propose that
> >>>>>>>>>
> >>>>>>>>> - the storage setup for S3 should have parameter for the bucket
> >>>>>> region
> >>>>>>>> (org.apache.polaris.core.storage.aws.AwsStorageConfigurationInfo)
> >>>>>>>>> - if the parameter is not specified, then Polaris attempts to
> look
> >>>>>> up
> >>>>>>>> (get-bucket-location) the region.
> >>>>>>>>> - the information is returned in vended credentials (if enabled)
> as
> >>>>>>>> "s3.region”:…
> >>>>>>>>> Note - another option could be to allow ’s3:GetBucketLocation’ in
> >>>>>> the
> >>>>>>>> policyString when generating vended credentials’ iam role, but
> that
> >>>>>> is sub
> >>>>>>>> optimal and therefore I am not proposing it. It would engines to
> make
> >>>>>>>> multiple get-bucket-location calls - one per table being looked
> up.
> >>>>>>>>> --
> >>>>>>>>> aniket
> >>>>>>>>>
> >> --
> >> Robert Stupp
> >> @snazy
> >>
> --
> Robert Stupp
> @snazy
>
>

Reply via email to