Re: [DISCUSS] REST: Scan Planning mode

Péter Váry Wed, 14 Jan 2026 07:38:23 -0800

Hi Dan,

> While it is possible and may feel like it would prevent interoperability,
that would be easily circumvented by just copying the entire contents of
the table through scan/plan.


This enables the user to recreate a snapshot of the table, but it does not
provide the full history or complete table metadata. It is also
significantly more involved than simply calling the register table
operation.

> REST Catalog implementations have always been able to restrict access to
physical storage regardless of whether a client could load the table
metadata or not.

Previously, this was primarily a matter of gaining access to the underlying
storage. With the introduction of CATALOG_ONLY tables, storing Iceberg
metadata files is no longer required for any operation.

> there are lots of different ways closed systems can restrict access
already (e.g. jdbc only or proprietary APIs), so I don't feel like this is
changing that dynamic.

I’m not sure I understand this. Could you please provide more details?

The goal, as I understand it, is that if a Catalog implements the Iceberg
specification, migration to and from this Catalog should be possible with
any other Catalog that adheres to the same specification. Introducing
CATALOG_ONLY tables, however, feels like another step away from
interoperability.

> I think the motivation behind catalog only mode is more for cases where
the underlying data is either in a different representation or is being
adapted on-the-fly.  For example, if you wanted to expose a table from a
database that can export data to parquet, but doesn't natively support
Iceberg as a format, you can hide that behind scan plan interfaces.

Using the Scan Planning interface has been optional until now, but with the
introduction of CATALOG_ONLY tables, it becomes mandatory. As a result,
compliant engines will need to implement it.

> There may not be a full representation of the table metadata but using a
subset of Iceberg primitives, you can still achieve interoperability (at
least for read).

In earlier discussions, we agreed that tables should not implement only a
subset of the Iceberg specification. This proposal seems to move in a
different direction. While I’m not opposed to the feature and recognize the
benefits of integrating non-Iceberg tables into Iceberg catalogs and making
them queryable by compatible engines, I believe it would be useful to
clarify our current understanding of the boundaries and the level of
feature parity we aim to maintain. Establishing this would provide a
consistent framework for evaluating similar proposals going forward.

This seems like a good candidate for today’s catalog sync discussion.

Thanks,
Peter

Daniel Weeks <[email protected]> ezt írta (időpont: 2026. jan. 14., Sze,
0:23):

> I don't feel we should be too concerned about catalogs switching to a
> "catalog only" mode and not providing direct access.  While it is possible
> and may feel like it would prevent interoperability, that would be
> easily circumvented by just copying the entire contents of the table
> through scan/plan.  I wouldn't agree there was implied access just by
> having a metadata-location field either.  REST Catalog implementations have
> always been able to restrict access to physical storage regardless of
> whether a client could load the table metadata or not.  I understand the
> concern about lock-in, but there are lots of different ways closed systems
> can restrict access already (e.g. jdbc only or proprietary APIs), so I
> don't feel like this is changing that dynamic.
>
> I think the motivation behind catalog only mode is more for cases where
> the underlying data is either in a different representation or is being
> adapted on-the-fly.  For example, if you wanted to expose a table from a
> database that can export data to parquet, but doesn't natively support
> Iceberg as a format, you can hide that behind scan plan interfaces.  There
> may not be a full representation of the table metadata but using a subset
> of Iceberg primitives, you can still achieve interoperability (at least for
> read).
>
> Introducing modes just is a way to express the intent/availability for the
> scan plan and coordinate between the client and server, but I don't think
> it really affects whether a client could be prevented from reading table
> data directly (a catalog can do that regardless).
>
> I would add that I don't think the spec should include anything about the
> client modes (I added a comment to the PR on this).  The spec should only
> indicate what the server can return and what the expectations should be for
> a client.  What a client implements and what configurations it exposes is
> more of a client-side implementation detail and should not be part of the
> spec.
>
>
> -Dan
>
>
> On Tue, Jan 13, 2026 at 11:07 AM Prashant Singh <[email protected]>
> wrote:
>
>> Hello Peter,
>> Thank you for the feedback.
>>
>> IIUC, you mean to say an interpretation, could be a dummy file which
>> would in worst case simply not exist ? sure i believe we can be explicit
>> there to avoid this.
>> Note: this is predating this proposal though and happy to take a stab in
>> being explicit here.
>>
>> > users were required to have direct read access to the metadata files in
>> order to plan queries on the table. That implied an access requirement,
>> even though it was never explicitly documented
>>
>> while the requirement is true but it's not like every user would get
>> credentials to do so, it was strictly based on if the user is authorized to
>> read the table based on the privileges defined in the catalog, loadTable's
>> credential was optional meaning if a catalog wants it could very well not
>> vend any credentials despite the client
>> sending  X-Iceberg-Access-Delegation due to this [1]  and hence they can
>> cut off any client if they want to. I believe the flexibility
>> is there because we don't define authorization in IRC spec. As i said the
>> admin is the one who had given the access to storage to the catalog in the
>> first place so it can very well revoke that access to storage and migrate
>> if the catalog is misbehaving by calling every table to itself to do
>> planning and can move to a different catalog if the culprit catalog doesn't
>> fix it.
>>
>> > Maybe we add a sentence in the spec to enforce that there should be
>> some users where the catalog MUST provide access to the metadata files.
>>
>> Regarding the original feedback, there will always be an ADMIN user who
>> has configured the catalog in the first place with the storage permission
>> (lets say proving the IAM and establishing the trust relationship) who can
>> get hold of the storage directly and access those metadata files directly
>> from storage. So some are implicit in that sense.
>>
>> I believe by introducing CATALOG only mode for planning on existing
>> assumptions we are not introducing new ways to trap end users in getting
>> into vendor lock-in and like always existed a user has a way to walk out of
>> it with the constructs.
>>
>> Please let me know what WDYT is considering above ?
>>
>> [1]
>> https://github.com/apache/iceberg/blob/fc434997fbc63a3f1f47481c0878073b1ccf6359/open-api/rest-catalog-open-api.yaml#L1886-L1887
>>
>> Best,
>> Prashant Singh
>>
>> On Tue, Jan 13, 2026 at 6:11 AM Péter Váry <[email protected]>
>> wrote:
>>
>>> Hi Prashant,
>>>
>>> The specification states:
>>>
>>>> The corresponding file location of table metadata should be returned in
>>>> the `metadata-location` field
>>>
>>>
>>> However, it does not specify that this location must be readable by any
>>> users. (Perhaps this is something we should revisit and clarify going
>>> forward.)
>>>
>>> Before the introduction of CATALOG_ONLY tables, users were required to
>>> have direct read access to the metadata files in order to plan queries on
>>> the table. That implied an access requirement, even though it was never
>>> explicitly documented. With the introduction of CATALOG_ONLY, this implicit
>>> requirement no longer applies, and we currently do not have an explicit
>>> requirement defined in the specification either.
>>>
>>> Prashant Singh <[email protected]> ezt írta (időpont: 2026. jan.
>>> 12., H, 23:33):
>>>
>>>> Thank you for the feedback everyone !
>>>>
>>>> Eduard : I am open to being it named _ENFORCED or even not having _ONLY
>>>> or _ENFORCED in the first place as Dan suggested here, please let me know
>>>> if you are ok with that as per [1]
>>>>
>>>> Amogh : Thank you for the feedback on the _preference mode, i tried to
>>>> document some concrete use cases that could benefit with it [2] as I
>>>> believe it can provide some options for the catalog and client to negotiate
>>>> when they are open to it please let me know wdyt ?
>>>>
>>>> Peter : I believe such kind of vendor locking would not be possible to
>>>> have since the model we are going after i.e in the loadTable itself we get
>>>> back the metadata pointer which is self describing and can be used to
>>>> register this table in the new catalog, also the way the catalog (irc)
>>>> specially has been laid out it decouple compute from storage
>>>> so in the end it's the Admin user of the catalog which has given the
>>>> catalog admin cred which gets scoped down based on the grants it had to the
>>>> catalog defined and the ADMIN can simply revoke the catalog from doing it
>>>> or can configure a new catalog with a different admin storage creds.
>>>> I tried elaborating more on this on the PR feedback too [3] please let
>>>> me know what wdyt ?
>>>>
>>>> I will be on top of both the PR and thread moving forward ! Appreciate
>>>> all your feedback.
>>>>
>>>> [1] https://github.com/apache/iceberg/pull/14867#discussion_r2673087002
>>>> [2] https://github.com/apache/iceberg/pull/14867#discussion_r2678941794
>>>> [3] https://github.com/apache/iceberg/pull/14867#discussion_r2678376025
>>>>
>>>> Best,
>>>> Prashant Singh
>>>>
>>>> On Fri, Jan 9, 2026 at 10:34 PM Péter Váry <[email protected]>
>>>> wrote:
>>>>
>>>>> I have a concern about some catalogs starting to make every table
>>>>> `CATALOG_ONLY`, which would essentially lock users to the catalog without
>>>>> providing a way to migrate the data to another catalog.
>>>>> Maybe we add a sentence in the spec to enforce, that there should be
>>>>> some users where the catalog MUST provide access to the metadata files.
>>>>>
>>>>> WDYT?
>>>>>
>>>>> On Thu, Jan 8, 2026, 18:38 Amogh Jahagirdar <[email protected]> wrote:
>>>>>
>>>>>> I did a pass over PR but I guess I'm a little skeptical on what
>>>>>> notion of "preferences" truly gets us in the protocol. In case the 
>>>>>> endpoint
>>>>>> is available but not enforced, my mental model is to just let the client
>>>>>> make whatever choice it wants. If a server really thinks it's 
>>>>>> advantageous
>>>>>> to use the remote planning, I'd think it'd just say server side planning 
>>>>>> is
>>>>>> enforced. For the "momentary load" case, all a client would need to do is
>>>>>> just handle the server throttling and fallback to a client side planning
>>>>>> (don't think the protocol needs to expand just for that).
>>>>>>
>>>>>> On Wed, Jan 7, 2026 at 11:28 AM Russell Spitzer <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I'm in agreement with Prashsant's current plan, I have no preference
>>>>>>> on naming of Only vs Enforced"
>>>>>>>
>>>>>>> On Wed, Jan 7, 2026 at 4:42 AM Eduard Tudenhöfner <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Instead of calling it "ONLY", maybe "ENFORCED" would be a better
>>>>>>>> term? I think that would more naturally express the behavior without 
>>>>>>>> having
>>>>>>>> to define what "ONLY" really means.
>>>>>>>>
>>>>>>>> On Wed, Dec 24, 2025 at 12:05 AM Prashant Singh <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> *Hi everyone,*
>>>>>>>>>
>>>>>>>>> *JB:* Mostly yes, but it's more about what the server wants the
>>>>>>>>> client to do. The server can indicate if it supports a mode or not 
>>>>>>>>> via the
>>>>>>>>> /v1/config endpoint at this point.
>>>>>>>>>
>>>>>>>>> *Russell:* Thank you for the thorough feedback! I think it is a
>>>>>>>>> great idea to break the optional mode into *Prefer Client |
>>>>>>>>> Prefer Catalog*—it really opens up a lot of interesting use cases.
>>>>>>>>>
>>>>>>>>> For example, the server might support planning but, due to
>>>>>>>>> momentary load, wants the client to see if it's open to planning on 
>>>>>>>>> the
>>>>>>>>> client side. Similarly, an argument can be made that if the server 
>>>>>>>>> has a
>>>>>>>>> table cached in memory, it would prefer the client comes to the 
>>>>>>>>> server.
>>>>>>>>> Earlier, with just the optional value, we were simply falling back to
>>>>>>>>> server or client side planning based on whether the server supported 
>>>>>>>>> scan
>>>>>>>>> planning. Now, the client can express its own overrides via catalog 
>>>>>>>>> configs
>>>>>>>>> as well.
>>>>>>>>>
>>>>>>>>> Based on our offline discussion, I have incorporated the feedback
>>>>>>>>> into the updated matrix [1] to document what the planning modes would 
>>>>>>>>> be
>>>>>>>>> based on the server response and client overrides:
>>>>>>>>>
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    *CLIENT_ONLY + CATALOG_ONLY* = FAIL
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    *One "ONLY" + opposite "PREFERRED"* = ONLY wins
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    *Both "PREFERRED"* = Client config wins
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    *Client not configured* = Use server config or default
>>>>>>>>>
>>>>>>>>> I will update the reference implementation soon based on this. I
>>>>>>>>> would love to know what other folks think!
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Prashant Singh
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://github.com/apache/iceberg/pull/14867#issuecomment-3683989832
>>>>>>>>>
>>>>>>>>> On Sat, Dec 20, 2025 at 1:26 PM Russell Spitzer <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> I can imagine one more
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> (None - I would rename this) ClientOnly - Client can use Catalog
>>>>>>>>>> Planning or Local Planning
>>>>>>>>>>
>>>>>>>>>> PreferClient - Client should use local planning, but the plan api
>>>>>>>>>> is available for this table — I can only imagine this would be 
>>>>>>>>>> useful for a
>>>>>>>>>> scenario where most clients are heavy and have the resources to do 
>>>>>>>>>> local
>>>>>>>>>> planning (or engine distributed planning) but you still want to 
>>>>>>>>>> support
>>>>>>>>>> lightweight clients which can’t really do planning themselves.
>>>>>>>>>>
>>>>>>>>>> PreferCatalog - Client should use the plan API, but credentials
>>>>>>>>>> have been provided to enable local planning — This is probably a
>>>>>>>>>> transitional state as we move from clients that only support local 
>>>>>>>>>> planning
>>>>>>>>>> to those which can use the plan api.
>>>>>>>>>>
>>>>>>>>>> CatalogOnly - Clients are not provided with the credentials
>>>>>>>>>> required to read the table from the Metadata.json alone. If they do 
>>>>>>>>>> not
>>>>>>>>>> implement the scan plan API they should fail fast, otherwise they 
>>>>>>>>>> will fail
>>>>>>>>>> when they attempt to load a manifest_list file — This is used in
>>>>>>>>>> circumstances where the catalog is giving either file specific 
>>>>>>>>>> credentials
>>>>>>>>>> or is protecting the delivered files in some way such that their 
>>>>>>>>>> contents
>>>>>>>>>> has been specially redacted or something like that.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I assume most catalogs will start with “ClientOnly” or “None”
>>>>>>>>>>
>>>>>>>>>> Then as Catalogs being to support planning API we will see most
>>>>>>>>>> tables move to
>>>>>>>>>> PreferCatalog with some perhaps extremely heavy or large tables
>>>>>>>>>> staying as PreferClient or Client Only.
>>>>>>>>>>
>>>>>>>>>> Then catalogs with special protections may have some tables
>>>>>>>>>> return  CatalogOnly so they can either scope credentials more 
>>>>>>>>>> tightly or
>>>>>>>>>> manipulate the files that the client actually has access to in some 
>>>>>>>>>> way.
>>>>>>>>>>
>>>>>>>>>> On Sat, Dec 20, 2025 at 1:09 AM Jean-Baptiste Onofré <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Prashant
>>>>>>>>>>>
>>>>>>>>>>> It makes sense to me. I guess we are using Catalog properties to
>>>>>>>>>>> indicate what the REST server supports to the client, right ?
>>>>>>>>>>> I will take a look at the PR, but I like the idea.
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>> JB
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Dec 20, 2025 at 12:53 AM Prashant Singh <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey All,
>>>>>>>>>>>>
>>>>>>>>>>>> I wanted to bring up the discussion of introducing a concept of
>>>>>>>>>>>> rest scan planning mode which would help the server to instruct 
>>>>>>>>>>>> the client
>>>>>>>>>>>> on how to plan the table via loadTableResponse or config at table 
>>>>>>>>>>>> level
>>>>>>>>>>>> override.
>>>>>>>>>>>> There are three possible values which one could think of :
>>>>>>>>>>>> 1. *None* : i.e plan it on the client side, this may be the
>>>>>>>>>>>> table is too small and the additional rest request would add more 
>>>>>>>>>>>> overhead
>>>>>>>>>>>> than benefit.
>>>>>>>>>>>> 2. *Optional* : client can choose to plan it either locally or
>>>>>>>>>>>> can trigger server side planning.
>>>>>>>>>>>> 3. *Required* : client MUST do server side planning, the
>>>>>>>>>>>> server could suggest this if it has better indexed the iceberg 
>>>>>>>>>>>> metadata or
>>>>>>>>>>>> client is running on low resources or the table is protected. 
>>>>>>>>>>>> Server MAY
>>>>>>>>>>>> choose whatever way required to enforce the client cant bypass 
>>>>>>>>>>>> this for
>>>>>>>>>>>> example let's say don't vend cred as part of loadTable and only 
>>>>>>>>>>>> mint it
>>>>>>>>>>>> part of planning completion this would mean if the client doesn't 
>>>>>>>>>>>> call plan
>>>>>>>>>>>> table .
>>>>>>>>>>>>
>>>>>>>>>>>> I proactively have created a pull request [1], would love to
>>>>>>>>>>>> know all your feedback either here or in the PR directly !
>>>>>>>>>>>>
>>>>>>>>>>>> Wish you all a very happy Holidays, it has been great working
>>>>>>>>>>>> with you all.
>>>>>>>>>>>>
>>>>>>>>>>>> [1] https://github.com/apache/iceberg/pull/14867
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Prashant Singh
>>>>>>>>>>>>
>>>>>>>>>>>

Re: [DISCUSS] REST: Scan Planning mode

Reply via email to