Re: [DISCUSS] Table Identifiers in Iceberg View Spec

Benny Chow Fri, 25 Apr 2025 17:22:47 -0700

I'd like to contribute my opinions on this:

- I don't particularly like the current behavior of "default to the view's
catalog when default-catalog is not set".  Fundamentally, I believe the
intent of default-catalog and default-namespace is there to help users
write more concise SQL.
- spark session catalog is engine specific and I don't think we should
design something that says first use this catalog, then that catalog.. or
that catalog.  For example, resolving identifiers using default-catalog ->
view's catalog -> session catalog is not good.
- We gotta support non-Iceberg tables otherwise I see no value in putting
views in the catalog to share with other engines
- Interoperability between different engine types is very hard due to
dialect issues... so I think we should focus on supporting different
clusters of the same engine type on a shared catalog.  For example, AI and
BI clusters on Spark sharing the same views in a REST catalog.


Coincidentally, I think the ultimate solution is along the lines of
something Russell proposed last year:

https://lists.apache.org/thread/hoskfx8y3kvrcww52l4w9dxghp3pnlm7

We've been looking at this interoperable identifier problem through the
lens of catalog resolution but maybe the right approach is really about
templating.

I would extend Russell's idea to allow identifiers in a view to span
catalogs to support non-Iceberg tables.   Also, the default-catalog
property could be templated as well.

Thoughts?
Benny



On Fri, Apr 25, 2025 at 4:02 PM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Thanks Steven! How do you recommend making Spark implementation conform to
> the spec? Do we need Spark SQL extensions and/or Spark catalog APIs for
> that?
>
> How do you recommend reconciling the inconsistencies I shared regarding
> many resolution methods not consistently being followed in different
> scenarios (view vs child table resolution, query vs view resolution)? Note
> these occur when the default catalog is set to a non-null value. If it
> helps, I can share concrete examples.
>
> Thanks,
> Walaa.
>
> On Fri, Apr 25, 2025 at 3:52 PM Steven Wu <stevenz...@gmail.com> wrote:
>
>> The core issue is on the fall back behavior when `default-catalog` is
>> not defined. Current view spec says the fallback should be the catalog
>> where the view is defined. It doesn't really matter what the catalog
>> is named (catalogX) by the read engine.
>> - If a view refers to the tables in the same catalog, this is a
>> non-ambiguous and reasonable fallback behavior.
>> - If a view refers to tables from another catalog, catalog names
>> should be included in the reference name already. So no ambiguity
>> there either.
>>
>> Potential inconsistent naming of catalog is a separate problem, which
>> Iceberg view spec probably cannot solve. We can only recommend that
>> catalog should be named consistently across usage for better
>> interoperability on name references.
>>
>> This proposal is to change the fallback behavior to engine's session
>> default catalog. I am not sure it is better than the current fallback
>> behavior.
>>
>> > Today’s Spark behavior explicitly differs from this idea. Spark
>> resolves table identifiers during view creation using the session’s default
>> catalog, not a supplied `default-catalog`.
>>
>> I would argue that is a Spark implementation issue for not conforming
>> to the spec.
>>
>>
>> On Fri, Apr 25, 2025 at 1:17 PM Walaa Eldin Moustafa
>> <wa.moust...@gmail.com> wrote:
>> >
>> > Hi Jan,
>> >
>> > Thanks again for continuing the discussion. I want to highlight a few
>> fundamental issues around the interpretation of default-catalog:
>> >
>> > Here is the real catch:
>> >
>> > * default-catalog cannot logically be defined at view creation time. It
>> would be circular: the view needs to exist before its metadata (and hence
>> default-catalog) can exist. This is visible in Spark’s implementation,
>> where `default-catalog` is not used.
>> >
>> > * Introducing a creation-time default-catalog setting would require
>> extending SQL syntax and engine APIs to promote it to a first-class view
>> concept. This would be intrusive, non-intuitive, and realistically very
>> difficult to standardize across engines.
>> >
>> > * Today’s Spark behavior explicitly differs from this idea. Spark
>> resolves table identifiers during view creation using the session’s default
>> catalog, not a supplied `default-catalog`.
>> >
>> > * Hypothetically even if we patched in a creation-time default-catalog,
>> it would create an inconsistent binding model between tables vs views
>> (early vs late), and between tables in views and in queries (again early vs
>> late). For example, views and tables in queries can withstand default
>> catalog renames, but tables cannot when they are used inside views -- it
>> even applies to views inside views, which makes this very hard to reason
>> about considering nesting.
>> >
>> > Thanks,
>> > Walaa
>> >
>> > On Fri, Apr 25, 2025 at 7:00 AM Jan Kaul <jank...@mailbox.org.invalid>
>> wrote:
>> >>
>> >> @Walaa:
>> >>
>> >> I would argue that when you run a CREATE VIEW statement the query
>> engine knowns which catalog the view is being created in. So even though we
>> typically use late binding to resolve the view catalog at query time, it
>> can also be used at creation time.
>> >>
>> >> The query engine would need to keep track of the "view catalog" where
>> the view is going to be created in. It can use that catalog to resolve
>> partial table identifiers if "default-catalog" is not set.
>> >>
>> >> It can lead to some unintuitive behavior, where partial identifiers in
>> the view query resolve to a different catalog compared to using them
>> outside of a view.
>> >>
>> >> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from
>> sales.orders;
>> >>
>> >> If the session default catalog is not "catalogA", the "sales.orders"
>> in the view query would not be the same as just referencing "sales.orders"
>> in a normal SQL statement. This is because without a "default-catalog", the
>> catalog name of "sales.orders" would default to "catalogA", which is the
>> view's catalog.
>> >>
>> >> Thanks,
>> >>
>> >> Jan
>> >>
>> >> On 4/25/25 04:05, Manu Zhang wrote:
>> >>>
>> >>> For example, if we want to validate that the tables referenced in the
>> view exist, how can we do that when default-catalog isn't defined, since
>> the view hasn't been created or loaded yet?
>> >>
>> >> I don't think this is related to view spec. How do we validate that a
>> table exists without a default catalog, or do we always use the current
>> session catalog?
>> >>
>> >> Thanks,
>> >> Manu
>> >>
>> >> On Fri, Apr 25, 2025 at 5:59 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>> >>>
>> >>> Hi Jan,
>> >>>
>> >>> I think we still share the same understanding. Just to clarify: when
>> I referred to late binding as “similar” to the proposal, I was
>> acknowledging the distinction between view-level and table-level
>> resolution. But as you noted, both follow a late binding model.
>> >>>
>> >>> That said, this still raises an interesting question and a potential
>> gap: if default-catalog is only defined at query time, how should
>> resolution work during view creation? For example, if we want to validate
>> that the tables referenced in the view exist, how can we do that when
>> default-catalog isn't defined, since the view hasn't been created or loaded
>> yet?
>> >>>
>> >>> Thanks,
>> >>> Walaa.
>> >>>
>> >>> On Thu, Apr 24, 2025 at 7:02 AM Jan Kaul <jank...@mailbox.org.invalid>
>> wrote:
>> >>>>
>> >>>> Yes, I have the same understanding. The view catalog is resolved at
>> query time.
>> >>>>
>> >>>> As you mentioned before, it's good to distinguish between the
>> physical catalog and it's reference used in SQL statements. The important
>> part is that the physical catalog of the view and the tables referenced in
>> it's definition stay consistent. You could create a view in a given
>> physical catalog by referring to it as "catalogA", as in your first point.
>> If you then, given a different setup, refer to the same physical catalog as
>> "catalogB" in another session/environment, the behavior should still work.
>> >>>>
>> >>>> I would however rephrase your last point. Late binding applies to
>> the view catalog name and by extension to all partial table references when
>> no "default-catalog" is present. Resolving the view catalog name at query
>> time is not opposed to storing the view metadata in a catalog.
>> >>>>
>> >>>> Or maybe I don't entirely understand what you mean.
>> >>>>
>> >>>> Thanks
>> >>>>
>> >>>> Jan
>> >>>>
>> >>>> On 4/24/25 00:32, Walaa Eldin Moustafa wrote:
>> >>>>
>> >>>> Hi Jan,
>> >>>>
>> >>>> > The view is executed when it's being referenced in a SQL
>> statement. That statement contains the information for the query engine to
>> resolve the catalog of the view.
>> >>>>
>> >>>> If I’m understanding correctly, that means:
>> >>>>
>> >>>> * If the view is queried as SELECT * FROM catalogA.namespace.view,
>> then catalogA is considered the view’s catalog.
>> >>>>
>> >>>> * If the same view is later queried as SELECT * FROM
>> catalogB.namespace.view (after renaming catalogA to catalogB, and keeping
>> everything else the same), then catalogB becomes the view’s catalog.
>> >>>>
>> >>>> Is that interpretation correct? If so, it sounds to me like the
>> catalog is resolved at query time, based on how the view is referenced, not
>> from any stored metadata. That would imply some sort of a late binding
>> behavior (similar to the proposal), as opposed to using some catalog that
>> "stores" the view definition.
>> >>>>
>> >>>> Thanks,
>> >>>> Walaa
>> >>>>
>> >>>> On Tue, Apr 22, 2025 at 11:01 AM Jan Kaul
>> <jank...@mailbox.org.invalid> wrote:
>> >>>>>
>> >>>>> Hi Walaa,
>> >>>>>
>> >>>>> Thanks for clarifying the aspects of non-determinism. Let me try to
>> address your questions.
>> >>>>>
>> >>>>> 1. This is my interpretation of the current spec: The view is
>> executed when it's being referenced in a SQL statement. That statement
>> contains the information for the query engine to resolve the catalog of the
>> view. The query engine then uses that information to fetch the view
>> metadata from the catalog. It also needs to temporarily keep track of which
>> catalog it used to fetch the view metadata. It can then use that
>> information to resolve the table references in the views SQL definition in
>> case no default catalog is specified.
>> >>>>>
>> >>>>> 2. The important part is that the catalog can be referenced at
>> execution time. As long as that's the case I would assume the view can be
>> created in any catalog.
>> >>>>>
>> >>>>>
>> >>>>> I think your point is really valuable because the current
>> specification can lead to some unintuitive behavior. For example for the
>> following statement:
>> >>>>>
>> >>>>> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from
>> sales.orders;
>> >>>>>
>> >>>>> If the session default catalog is not "catalogA", the
>> "sales.orders" in the view query would not be the same as just referencing
>> "sales.orders" in a normal SQL statement. This is because without a
>> "default-catalog", the catalog name of "sales.orders" would default to
>> "catalogA".
>> >>>>>
>> >>>>>
>> >>>>> However, I like the current design of the view spec, because it has
>> the "closure" property. Because of the fact that the "view catalog" has to
>> be known when executing a view, all the information required to resolve the
>> table identifiers is contained in the view metadata (and the "view
>> catalog"). I think that if you make the identifier resolution dependent on
>> external parameters, it hinders portability.
>> >>>>>
>> >>>>> Thanks,
>> >>>>>
>> >>>>> Jan
>> >>>>>
>> >>>>> On 4/22/25 18:36, Walaa Eldin Moustafa wrote:
>> >>>>>
>> >>>>> Hi Jan,
>> >>>>>
>> >>>>> Thanks for the thoughtful feedback.
>> >>>>>
>> >>>>> I think it’s important we clarify a key point before going deeper:
>> >>>>>
>> >>>>> Non-determinism is not caused by session fallback behavior—it’s a
>> fundamental limitation of using table identifiers alone, regardless of
>> whether we use the current rule, the proposed fallback to the session’s
>> default catalog, or even early vs. late binding.
>> >>>>>
>> >>>>> The same fully qualified identifier (e.g.,
>> catalogA.namespace.table) can resolve to different objects depending solely
>> on engine-specific routing logic or catalog aliases. So determinism isn’t
>> guaranteed just because an identifier is "fully qualified." The only
>> reliable anchor for identity is the UUID. That’s why the proposed use of
>> UUIDs is not just a hardening strategy. It’s the actual fix for correctness.
>> >>>>>
>> >>>>> To move the conversation forward, could you help clarify two things
>> in the context of the current spec:
>> >>>>>
>> >>>>> * Where in the metadata is the “view catalog” stored, so that an
>> engine knows to fall back to it if default-catalog is null?
>> >>>>>
>> >>>>> * Are we even allowed to create views in the session's default
>> catalog (i.e., without specifying a catalog) in the current Iceberg spec?
>> >>>>>
>> >>>>> These questions are important because if we can’t unambiguously
>> recover the "view catalog" from metadata, then defaulting to it is
>> problematic. And if views can't be created in the default catalog, then the
>> fallback rule doesn’t generalize.
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Walaa.
>> >>>>>
>> >>>>>
>> >>>>> On Tue, Apr 22, 2025 at 3:14 AM Jan Kaul
>> <jank...@mailbox.org.invalid> wrote:
>> >>>>>>
>> >>>>>> Hi Walaa,
>> >>>>>>
>> >>>>>> thank you for your proposal. If I understood correctly, you
>> proposal is composed of three parts:
>> >>>>>>
>> >>>>>> - session default catalog as fallback for "default-catalog"
>> >>>>>>
>> >>>>>> - session default namespace as fallback for "default-namepace"
>> >>>>>>
>> >>>>>> - Late binding + UUID validation
>> >>>>>>
>> >>>>>> I have some comments regarding these points.
>> >>>>>>
>> >>>>>>
>> >>>>>> 1. Session default catalog as fallback for "default-catalog"
>> >>>>>>
>> >>>>>> Introducing a behavior that depends on the current session setup
>> is in my opinion the definition of "non-determinism". You could be running
>> the same query-engine and catalog-setup on different days, with different
>> default session catalogs (which is rather common), and would be getting
>> different results.
>> >>>>>>
>> >>>>>> Whereas with the current behavior, the view always produces the
>> same results. The current behavior has some rough edges in very niche use
>> cases but I think is solid for most uses cases.
>> >>>>>>
>> >>>>>> 2. Session default namespace as fallback for "default-namespace"
>> >>>>>>
>> >>>>>> Similar to the above.
>> >>>>>>
>> >>>>>> 3. Late binding + UUID validation
>> >>>>>>
>> >>>>>> If I understand it correctly, the current implementation already
>> uses late binding.
>> >>>>>>
>> >>>>>> Generally, having UUID validation makes the setup more robust.
>> Which is great. However, having UUID validation still requires us to have a
>> portable table identifier specification. Even if we have the UUIDs of the
>> referenced tables from the view, there simply isn't an interface that let's
>> us use those UUIDs. The catalog interface is defined in terms of table
>> identifiers.
>> >>>>>>
>> >>>>>> So we always require a working catalog setup and suiting table
>> identifiers to obtain the table metadata. We can use the UUIDs to verify if
>> we loaded the correct table. But this can only be done after we used some
>> identifier. Which means there is no way of using UUIDs without a
>> functioning catalog/identifier setup.
>> >>>>>>
>> >>>>>>
>> >>>>>> In conclusion, I prefer the current behavior for "default-catalog"
>> because it is more deterministic in my opinion. And I think the current
>> spec does a good job for multi-engine table identifier resolution. I see
>> the UUID validation more of an additional hardening strategy.
>> >>>>>>
>> >>>>>> Thanks
>> >>>>>>
>> >>>>>> Jan
>> >>>>>>
>> >>>>>> On 4/21/25 17:38, Walaa Eldin Moustafa wrote:
>> >>>>>>
>> >>>>>> Thanks Renjie!
>> >>>>>>
>> >>>>>> The existing spec has some guidance on resolving catalogs on the
>> fly already (to address the case of view text with table identifiers
>> missing the catalog part). The guidance is to use the catalog where the
>> view is stored. But I find this rule hard to interpret or use. The catalog
>> itself is a logical construct—such as a federated catalog that delegates to
>> multiple physical backends (e.g., HMS and REST). In such cases, the catalog
>> (e.g., `my_catalog` in `my_catalog.namespace1.table1`) doesn’t physically
>> store the tables; it only routes requests to underlying stores. Therefore,
>> defaulting identifier resolution based on the catalog where the view is
>> "stored" doesn’t align with how catalogs actually behave in practice.
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> Walaa.
>> >>>>>>
>> >>>>>> On Sun, Apr 20, 2025 at 11:17 PM Renjie Liu <
>> liurenjie2...@gmail.com> wrote:
>> >>>>>>>
>> >>>>>>> Hi, Walaa:
>> >>>>>>>
>> >>>>>>> Thanks for the proposal.
>> >>>>>>>
>> >>>>>>> I've reviewed the doc, but in general I have some concerns with
>> resolving catalog names on the fly with query engine defined catalog names.
>> This introduces some flexibility at first glance, but also makes
>> misconfiguration difficult to explain.
>> >>>>>>>
>> >>>>>>> But I agree with one part that we should store resolved table
>> uuid in view metadata, as table/view renaming may introduce errors that's
>> difficult to understand for user.
>> >>>>>>>
>> >>>>>>> On Sat, Apr 19, 2025 at 3:02 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>> >>>>>>>>
>> >>>>>>>> Hi Everyone,
>> >>>>>>>>
>> >>>>>>>> Looking forward to keeping up the momentum and closing out the
>> MV spec as well. I’m hoping we can proceed to a vote next week.
>> >>>>>>>>
>> >>>>>>>> Here is a summary in case that helps. The proposal outlines a
>> strategy for handling table identifiers in Iceberg view metadata, with the
>> goal of ensuring correctness, portability, and engine compatibility. It
>> recommends resolving table identifiers at read time (late binding) rather
>> than creation time, and introduces UUID-based validation to maintain
>> identity guarantees across engines, or sessions. It also revises how
>> default-catalog and default-namespace are handled (defaulting both to the
>> session context if not explicitly set) to better align with engine behavior
>> and improve cross-engine interoperability.
>> >>>>>>>>
>> >>>>>>>> Please let me know your thoughts.
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>> Walaa.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Wed, Apr 16, 2025 at 2:03 PM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>> >>>>>>>>>
>> >>>>>>>>> Thanks Eduard and Sung! I have addressed the comments.
>> >>>>>>>>>
>> >>>>>>>>> One key point to keep in mind is that catalog names in the spec
>> refer to logical catalogs—i.e., the first part of a three-part identifier.
>> These correspond to Spark's DataSourceV2 catalogs, Trino connectors, and
>> similar constructs. This is a level of abstraction above physical catalogs,
>> which are not referenced or used in the view spec. The reason is that table
>> identifiers in the view definition/text itself refer to logical catalogs,
>> not physical ones (since they interface directly with the engine and not a
>> specific metastore).
>> >>>>>>>>>
>> >>>>>>>>> Thanks,
>> >>>>>>>>> Walaa.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On Wed, Apr 16, 2025 at 6:15 AM Sung Yun <sungwy...@gmail.com>
>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Thank you Walaa for the proposal. I think view portability is
>> a very important topic for us to continue discussing as it relies on many
>> assumptions within the data ecosystem for it to function like you've
>> highlighted well in the document.
>> >>>>>>>>>>
>> >>>>>>>>>> I've added a few comments around how this may impact the
>> permission questions the engines will be asking, and whether that is the
>> desired behavior.
>> >>>>>>>>>>
>> >>>>>>>>>> Sung
>> >>>>>>>>>>
>> >>>>>>>>>> On Wed, Apr 16, 2025 at 7:32 AM Eduard Tudenhöfner <
>> etudenhoef...@apache.org> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thanks Walaa for tackling this problem. I've added a few
>> comments to get a better understanding of how this will look like in the
>> actual implementation.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Eduard
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Tue, Apr 15, 2025 at 7:09 PM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Hi Everyone,
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Starting this thread to resume our discussion on how to
>> reference table identifiers from Iceberg metadata, a key aspect of the view
>> specification, particularly in relation to the MV (materialized view)
>> extensions.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I had the chance to speak offline with a few community
>> members to better understand how the current spec is being interpreted.
>> Those conversations served as inputs to a new proposal on how table
>> identifier references could be represented in metadata.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> You can find the proposal here [1]. I look forward to your
>> feedback and working together to move this forward so we can finalize the
>> MV spec as well.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> [1]
>> https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>> Walaa.
>>
>

Re: [DISCUSS] Table Identifiers in Iceberg View Spec

Reply via email to