Re: [DISCUSS] Table Identifiers in Iceberg View Spec

Walaa Eldin Moustafa Fri, 25 Apr 2025 16:02:52 -0700

Thanks Steven! How do you recommend making Spark implementation conform to
the spec? Do we need Spark SQL extensions and/or Spark catalog APIs for
that?


How do you recommend reconciling the inconsistencies I shared regarding
many resolution methods not consistently being followed in different
scenarios (view vs child table resolution, query vs view resolution)? Note
these occur when the default catalog is set to a non-null value. If it
helps, I can share concrete examples.

Thanks,
Walaa.

On Fri, Apr 25, 2025 at 3:52 PM Steven Wu <stevenz...@gmail.com> wrote:

> The core issue is on the fall back behavior when `default-catalog` is
> not defined. Current view spec says the fallback should be the catalog
> where the view is defined. It doesn't really matter what the catalog
> is named (catalogX) by the read engine.
> - If a view refers to the tables in the same catalog, this is a
> non-ambiguous and reasonable fallback behavior.
> - If a view refers to tables from another catalog, catalog names
> should be included in the reference name already. So no ambiguity
> there either.
>
> Potential inconsistent naming of catalog is a separate problem, which
> Iceberg view spec probably cannot solve. We can only recommend that
> catalog should be named consistently across usage for better
> interoperability on name references.
>
> This proposal is to change the fallback behavior to engine's session
> default catalog. I am not sure it is better than the current fallback
> behavior.
>
> > Today’s Spark behavior explicitly differs from this idea. Spark resolves
> table identifiers during view creation using the session’s default catalog,
> not a supplied `default-catalog`.
>
> I would argue that is a Spark implementation issue for not conforming
> to the spec.
>
>
> On Fri, Apr 25, 2025 at 1:17 PM Walaa Eldin Moustafa
> <wa.moust...@gmail.com> wrote:
> >
> > Hi Jan,
> >
> > Thanks again for continuing the discussion. I want to highlight a few
> fundamental issues around the interpretation of default-catalog:
> >
> > Here is the real catch:
> >
> > * default-catalog cannot logically be defined at view creation time. It
> would be circular: the view needs to exist before its metadata (and hence
> default-catalog) can exist. This is visible in Spark’s implementation,
> where `default-catalog` is not used.
> >
> > * Introducing a creation-time default-catalog setting would require
> extending SQL syntax and engine APIs to promote it to a first-class view
> concept. This would be intrusive, non-intuitive, and realistically very
> difficult to standardize across engines.
> >
> > * Today’s Spark behavior explicitly differs from this idea. Spark
> resolves table identifiers during view creation using the session’s default
> catalog, not a supplied `default-catalog`.
> >
> > * Hypothetically even if we patched in a creation-time default-catalog,
> it would create an inconsistent binding model between tables vs views
> (early vs late), and between tables in views and in queries (again early vs
> late). For example, views and tables in queries can withstand default
> catalog renames, but tables cannot when they are used inside views -- it
> even applies to views inside views, which makes this very hard to reason
> about considering nesting.
> >
> > Thanks,
> > Walaa
> >
> > On Fri, Apr 25, 2025 at 7:00 AM Jan Kaul <jank...@mailbox.org.invalid>
> wrote:
> >>
> >> @Walaa:
> >>
> >> I would argue that when you run a CREATE VIEW statement the query
> engine knowns which catalog the view is being created in. So even though we
> typically use late binding to resolve the view catalog at query time, it
> can also be used at creation time.
> >>
> >> The query engine would need to keep track of the "view catalog" where
> the view is going to be created in. It can use that catalog to resolve
> partial table identifiers if "default-catalog" is not set.
> >>
> >> It can lead to some unintuitive behavior, where partial identifiers in
> the view query resolve to a different catalog compared to using them
> outside of a view.
> >>
> >> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from sales.orders;
> >>
> >> If the session default catalog is not "catalogA", the "sales.orders" in
> the view query would not be the same as just referencing "sales.orders" in
> a normal SQL statement. This is because without a "default-catalog", the
> catalog name of "sales.orders" would default to "catalogA", which is the
> view's catalog.
> >>
> >> Thanks,
> >>
> >> Jan
> >>
> >> On 4/25/25 04:05, Manu Zhang wrote:
> >>>
> >>> For example, if we want to validate that the tables referenced in the
> view exist, how can we do that when default-catalog isn't defined, since
> the view hasn't been created or loaded yet?
> >>
> >> I don't think this is related to view spec. How do we validate that a
> table exists without a default catalog, or do we always use the current
> session catalog?
> >>
> >> Thanks,
> >> Manu
> >>
> >> On Fri, Apr 25, 2025 at 5:59 AM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
> >>>
> >>> Hi Jan,
> >>>
> >>> I think we still share the same understanding. Just to clarify: when I
> referred to late binding as “similar” to the proposal, I was acknowledging
> the distinction between view-level and table-level resolution. But as you
> noted, both follow a late binding model.
> >>>
> >>> That said, this still raises an interesting question and a potential
> gap: if default-catalog is only defined at query time, how should
> resolution work during view creation? For example, if we want to validate
> that the tables referenced in the view exist, how can we do that when
> default-catalog isn't defined, since the view hasn't been created or loaded
> yet?
> >>>
> >>> Thanks,
> >>> Walaa.
> >>>
> >>> On Thu, Apr 24, 2025 at 7:02 AM Jan Kaul <jank...@mailbox.org.invalid>
> wrote:
> >>>>
> >>>> Yes, I have the same understanding. The view catalog is resolved at
> query time.
> >>>>
> >>>> As you mentioned before, it's good to distinguish between the
> physical catalog and it's reference used in SQL statements. The important
> part is that the physical catalog of the view and the tables referenced in
> it's definition stay consistent. You could create a view in a given
> physical catalog by referring to it as "catalogA", as in your first point.
> If you then, given a different setup, refer to the same physical catalog as
> "catalogB" in another session/environment, the behavior should still work.
> >>>>
> >>>> I would however rephrase your last point. Late binding applies to the
> view catalog name and by extension to all partial table references when no
> "default-catalog" is present. Resolving the view catalog name at query time
> is not opposed to storing the view metadata in a catalog.
> >>>>
> >>>> Or maybe I don't entirely understand what you mean.
> >>>>
> >>>> Thanks
> >>>>
> >>>> Jan
> >>>>
> >>>> On 4/24/25 00:32, Walaa Eldin Moustafa wrote:
> >>>>
> >>>> Hi Jan,
> >>>>
> >>>> > The view is executed when it's being referenced in a SQL statement.
> That statement contains the information for the query engine to resolve the
> catalog of the view.
> >>>>
> >>>> If I’m understanding correctly, that means:
> >>>>
> >>>> * If the view is queried as SELECT * FROM catalogA.namespace.view,
> then catalogA is considered the view’s catalog.
> >>>>
> >>>> * If the same view is later queried as SELECT * FROM
> catalogB.namespace.view (after renaming catalogA to catalogB, and keeping
> everything else the same), then catalogB becomes the view’s catalog.
> >>>>
> >>>> Is that interpretation correct? If so, it sounds to me like the
> catalog is resolved at query time, based on how the view is referenced, not
> from any stored metadata. That would imply some sort of a late binding
> behavior (similar to the proposal), as opposed to using some catalog that
> "stores" the view definition.
> >>>>
> >>>> Thanks,
> >>>> Walaa
> >>>>
> >>>> On Tue, Apr 22, 2025 at 11:01 AM Jan Kaul <jank...@mailbox.org.invalid>
> wrote:
> >>>>>
> >>>>> Hi Walaa,
> >>>>>
> >>>>> Thanks for clarifying the aspects of non-determinism. Let me try to
> address your questions.
> >>>>>
> >>>>> 1. This is my interpretation of the current spec: The view is
> executed when it's being referenced in a SQL statement. That statement
> contains the information for the query engine to resolve the catalog of the
> view. The query engine then uses that information to fetch the view
> metadata from the catalog. It also needs to temporarily keep track of which
> catalog it used to fetch the view metadata. It can then use that
> information to resolve the table references in the views SQL definition in
> case no default catalog is specified.
> >>>>>
> >>>>> 2. The important part is that the catalog can be referenced at
> execution time. As long as that's the case I would assume the view can be
> created in any catalog.
> >>>>>
> >>>>>
> >>>>> I think your point is really valuable because the current
> specification can lead to some unintuitive behavior. For example for the
> following statement:
> >>>>>
> >>>>> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from
> sales.orders;
> >>>>>
> >>>>> If the session default catalog is not "catalogA", the "sales.orders"
> in the view query would not be the same as just referencing "sales.orders"
> in a normal SQL statement. This is because without a "default-catalog", the
> catalog name of "sales.orders" would default to "catalogA".
> >>>>>
> >>>>>
> >>>>> However, I like the current design of the view spec, because it has
> the "closure" property. Because of the fact that the "view catalog" has to
> be known when executing a view, all the information required to resolve the
> table identifiers is contained in the view metadata (and the "view
> catalog"). I think that if you make the identifier resolution dependent on
> external parameters, it hinders portability.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Jan
> >>>>>
> >>>>> On 4/22/25 18:36, Walaa Eldin Moustafa wrote:
> >>>>>
> >>>>> Hi Jan,
> >>>>>
> >>>>> Thanks for the thoughtful feedback.
> >>>>>
> >>>>> I think it’s important we clarify a key point before going deeper:
> >>>>>
> >>>>> Non-determinism is not caused by session fallback behavior—it’s a
> fundamental limitation of using table identifiers alone, regardless of
> whether we use the current rule, the proposed fallback to the session’s
> default catalog, or even early vs. late binding.
> >>>>>
> >>>>> The same fully qualified identifier (e.g., catalogA.namespace.table)
> can resolve to different objects depending solely on engine-specific
> routing logic or catalog aliases. So determinism isn’t guaranteed just
> because an identifier is "fully qualified." The only reliable anchor for
> identity is the UUID. That’s why the proposed use of UUIDs is not just a
> hardening strategy. It’s the actual fix for correctness.
> >>>>>
> >>>>> To move the conversation forward, could you help clarify two things
> in the context of the current spec:
> >>>>>
> >>>>> * Where in the metadata is the “view catalog” stored, so that an
> engine knows to fall back to it if default-catalog is null?
> >>>>>
> >>>>> * Are we even allowed to create views in the session's default
> catalog (i.e., without specifying a catalog) in the current Iceberg spec?
> >>>>>
> >>>>> These questions are important because if we can’t unambiguously
> recover the "view catalog" from metadata, then defaulting to it is
> problematic. And if views can't be created in the default catalog, then the
> fallback rule doesn’t generalize.
> >>>>>
> >>>>> Thanks,
> >>>>> Walaa.
> >>>>>
> >>>>>
> >>>>> On Tue, Apr 22, 2025 at 3:14 AM Jan Kaul <jank...@mailbox.org.invalid>
> wrote:
> >>>>>>
> >>>>>> Hi Walaa,
> >>>>>>
> >>>>>> thank you for your proposal. If I understood correctly, you
> proposal is composed of three parts:
> >>>>>>
> >>>>>> - session default catalog as fallback for "default-catalog"
> >>>>>>
> >>>>>> - session default namespace as fallback for "default-namepace"
> >>>>>>
> >>>>>> - Late binding + UUID validation
> >>>>>>
> >>>>>> I have some comments regarding these points.
> >>>>>>
> >>>>>>
> >>>>>> 1. Session default catalog as fallback for "default-catalog"
> >>>>>>
> >>>>>> Introducing a behavior that depends on the current session setup is
> in my opinion the definition of "non-determinism". You could be running the
> same query-engine and catalog-setup on different days, with different
> default session catalogs (which is rather common), and would be getting
> different results.
> >>>>>>
> >>>>>> Whereas with the current behavior, the view always produces the
> same results. The current behavior has some rough edges in very niche use
> cases but I think is solid for most uses cases.
> >>>>>>
> >>>>>> 2. Session default namespace as fallback for "default-namespace"
> >>>>>>
> >>>>>> Similar to the above.
> >>>>>>
> >>>>>> 3. Late binding + UUID validation
> >>>>>>
> >>>>>> If I understand it correctly, the current implementation already
> uses late binding.
> >>>>>>
> >>>>>> Generally, having UUID validation makes the setup more robust.
> Which is great. However, having UUID validation still requires us to have a
> portable table identifier specification. Even if we have the UUIDs of the
> referenced tables from the view, there simply isn't an interface that let's
> us use those UUIDs. The catalog interface is defined in terms of table
> identifiers.
> >>>>>>
> >>>>>> So we always require a working catalog setup and suiting table
> identifiers to obtain the table metadata. We can use the UUIDs to verify if
> we loaded the correct table. But this can only be done after we used some
> identifier. Which means there is no way of using UUIDs without a
> functioning catalog/identifier setup.
> >>>>>>
> >>>>>>
> >>>>>> In conclusion, I prefer the current behavior for "default-catalog"
> because it is more deterministic in my opinion. And I think the current
> spec does a good job for multi-engine table identifier resolution. I see
> the UUID validation more of an additional hardening strategy.
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>> Jan
> >>>>>>
> >>>>>> On 4/21/25 17:38, Walaa Eldin Moustafa wrote:
> >>>>>>
> >>>>>> Thanks Renjie!
> >>>>>>
> >>>>>> The existing spec has some guidance on resolving catalogs on the
> fly already (to address the case of view text with table identifiers
> missing the catalog part). The guidance is to use the catalog where the
> view is stored. But I find this rule hard to interpret or use. The catalog
> itself is a logical construct—such as a federated catalog that delegates to
> multiple physical backends (e.g., HMS and REST). In such cases, the catalog
> (e.g., `my_catalog` in `my_catalog.namespace1.table1`) doesn’t physically
> store the tables; it only routes requests to underlying stores. Therefore,
> defaulting identifier resolution based on the catalog where the view is
> "stored" doesn’t align with how catalogs actually behave in practice.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Walaa.
> >>>>>>
> >>>>>> On Sun, Apr 20, 2025 at 11:17 PM Renjie Liu <
> liurenjie2...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Hi, Walaa:
> >>>>>>>
> >>>>>>> Thanks for the proposal.
> >>>>>>>
> >>>>>>> I've reviewed the doc, but in general I have some concerns with
> resolving catalog names on the fly with query engine defined catalog names.
> This introduces some flexibility at first glance, but also makes
> misconfiguration difficult to explain.
> >>>>>>>
> >>>>>>> But I agree with one part that we should store resolved table uuid
> in view metadata, as table/view renaming may introduce errors that's
> difficult to understand for user.
> >>>>>>>
> >>>>>>> On Sat, Apr 19, 2025 at 3:02 AM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> Hi Everyone,
> >>>>>>>>
> >>>>>>>> Looking forward to keeping up the momentum and closing out the MV
> spec as well. I’m hoping we can proceed to a vote next week.
> >>>>>>>>
> >>>>>>>> Here is a summary in case that helps. The proposal outlines a
> strategy for handling table identifiers in Iceberg view metadata, with the
> goal of ensuring correctness, portability, and engine compatibility. It
> recommends resolving table identifiers at read time (late binding) rather
> than creation time, and introduces UUID-based validation to maintain
> identity guarantees across engines, or sessions. It also revises how
> default-catalog and default-namespace are handled (defaulting both to the
> session context if not explicitly set) to better align with engine behavior
> and improve cross-engine interoperability.
> >>>>>>>>
> >>>>>>>> Please let me know your thoughts.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Walaa.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Apr 16, 2025 at 2:03 PM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Thanks Eduard and Sung! I have addressed the comments.
> >>>>>>>>>
> >>>>>>>>> One key point to keep in mind is that catalog names in the spec
> refer to logical catalogs—i.e., the first part of a three-part identifier.
> These correspond to Spark's DataSourceV2 catalogs, Trino connectors, and
> similar constructs. This is a level of abstraction above physical catalogs,
> which are not referenced or used in the view spec. The reason is that table
> identifiers in the view definition/text itself refer to logical catalogs,
> not physical ones (since they interface directly with the engine and not a
> specific metastore).
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Walaa.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Apr 16, 2025 at 6:15 AM Sung Yun <sungwy...@gmail.com>
> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Thank you Walaa for the proposal. I think view portability is a
> very important topic for us to continue discussing as it relies on many
> assumptions within the data ecosystem for it to function like you've
> highlighted well in the document.
> >>>>>>>>>>
> >>>>>>>>>> I've added a few comments around how this may impact the
> permission questions the engines will be asking, and whether that is the
> desired behavior.
> >>>>>>>>>>
> >>>>>>>>>> Sung
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Apr 16, 2025 at 7:32 AM Eduard Tudenhöfner <
> etudenhoef...@apache.org> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks Walaa for tackling this problem. I've added a few
> comments to get a better understanding of how this will look like in the
> actual implementation.
> >>>>>>>>>>>
> >>>>>>>>>>> Eduard
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Apr 15, 2025 at 7:09 PM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Everyone,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Starting this thread to resume our discussion on how to
> reference table identifiers from Iceberg metadata, a key aspect of the view
> specification, particularly in relation to the MV (materialized view)
> extensions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I had the chance to speak offline with a few community
> members to better understand how the current spec is being interpreted.
> Those conversations served as inputs to a new proposal on how table
> identifier references could be represented in metadata.
> >>>>>>>>>>>>
> >>>>>>>>>>>> You can find the proposal here [1]. I look forward to your
> feedback and working together to move this forward so we can finalize the
> MV spec as well.
> >>>>>>>>>>>>
> >>>>>>>>>>>> [1]
> https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Walaa.
>

Re: [DISCUSS] Table Identifiers in Iceberg View Spec

Reply via email to