Re: [DISCUSS] Table Identifiers in Iceberg View Spec

Walaa Eldin Moustafa Fri, 25 Apr 2025 17:19:42 -0700

Thanks for the contribution Benny! +1 to the confusion the fallback
creates. Also just to be clear, at this point and after clarifying the
current spec intentions, I am convinced that we should remove the default
catalog and default namespace fields altogether.


Thanks,
Walaa.

On Fri, Apr 25, 2025 at 5:13 PM Benny Chow <[email protected]> wrote:

> I'd like to contribute my opinions on this:
>
> - I don't particularly like the current behavior of "default to the view's
> catalog when default-catalog is not set".  Fundamentally, I believe the
> intent of default-catalog and default-namespace is there to help users
> write more concise SQL.
> - spark session catalog is engine specific and I don't think we should
> design something that says first use this catalog, then that catalog.. or
> that catalog.  For example, resolving identifiers using default-catalog ->
> view's catalog -> session catalog is not good.
> - We gotta support non-Iceberg tables otherwise I see no value in putting
> views in the catalog to share with other engines
> - Interoperability between different engine types is very hard due to
> dialect issues... so I think we should focus on supporting different
> clusters of the same engine type on a shared catalog.  For example, AI and
> BI clusters on Spark sharing the same views in a REST catalog.
>
> Coincidentally, I think the ultimate solution is along the lines of
> something Russell proposed last year:
>
> https://lists.apache.org/thread/hoskfx8y3kvrcww52l4w9dxghp3pnlm7
>
> We've been looking at this interoperable identifier problem through the
> lens of catalog resolution but maybe the right approach is really about
> templating.
>
> I would extend Russell's idea to allow identifiers in a view to span
> catalogs to support non-Iceberg tables.   Also, the default-catalog
> property could be templated as well.
>
> Thoughts?
> Benny
>
>
>
> On Fri, Apr 25, 2025 at 4:02 PM Walaa Eldin Moustafa <
> [email protected]> wrote:
>
>> Thanks Steven! How do you recommend making Spark implementation conform
>> to the spec? Do we need Spark SQL extensions and/or Spark catalog APIs for
>> that?
>>
>> How do you recommend reconciling the inconsistencies I shared regarding
>> many resolution methods not consistently being followed in different
>> scenarios (view vs child table resolution, query vs view resolution)? Note
>> these occur when the default catalog is set to a non-null value. If it
>> helps, I can share concrete examples.
>>
>> Thanks,
>> Walaa.
>>
>> On Fri, Apr 25, 2025 at 3:52 PM Steven Wu <[email protected]> wrote:
>>
>>> The core issue is on the fall back behavior when `default-catalog` is
>>> not defined. Current view spec says the fallback should be the catalog
>>> where the view is defined. It doesn't really matter what the catalog
>>> is named (catalogX) by the read engine.
>>> - If a view refers to the tables in the same catalog, this is a
>>> non-ambiguous and reasonable fallback behavior.
>>> - If a view refers to tables from another catalog, catalog names
>>> should be included in the reference name already. So no ambiguity
>>> there either.
>>>
>>> Potential inconsistent naming of catalog is a separate problem, which
>>> Iceberg view spec probably cannot solve. We can only recommend that
>>> catalog should be named consistently across usage for better
>>> interoperability on name references.
>>>
>>> This proposal is to change the fallback behavior to engine's session
>>> default catalog. I am not sure it is better than the current fallback
>>> behavior.
>>>
>>> > Today’s Spark behavior explicitly differs from this idea. Spark
>>> resolves table identifiers during view creation using the session’s default
>>> catalog, not a supplied `default-catalog`.
>>>
>>> I would argue that is a Spark implementation issue for not conforming
>>> to the spec.
>>>
>>>
>>> On Fri, Apr 25, 2025 at 1:17 PM Walaa Eldin Moustafa
>>> <[email protected]> wrote:
>>> >
>>> > Hi Jan,
>>> >
>>> > Thanks again for continuing the discussion. I want to highlight a few
>>> fundamental issues around the interpretation of default-catalog:
>>> >
>>> > Here is the real catch:
>>> >
>>> > * default-catalog cannot logically be defined at view creation time.
>>> It would be circular: the view needs to exist before its metadata (and
>>> hence default-catalog) can exist. This is visible in Spark’s
>>> implementation, where `default-catalog` is not used.
>>> >
>>> > * Introducing a creation-time default-catalog setting would require
>>> extending SQL syntax and engine APIs to promote it to a first-class view
>>> concept. This would be intrusive, non-intuitive, and realistically very
>>> difficult to standardize across engines.
>>> >
>>> > * Today’s Spark behavior explicitly differs from this idea. Spark
>>> resolves table identifiers during view creation using the session’s default
>>> catalog, not a supplied `default-catalog`.
>>> >
>>> > * Hypothetically even if we patched in a creation-time
>>> default-catalog, it would create an inconsistent binding model between
>>> tables vs views (early vs late), and between tables in views and in queries
>>> (again early vs late). For example, views and tables in queries can
>>> withstand default catalog renames, but tables cannot when they are used
>>> inside views -- it even applies to views inside views, which makes this
>>> very hard to reason about considering nesting.
>>> >
>>> > Thanks,
>>> > Walaa
>>> >
>>> > On Fri, Apr 25, 2025 at 7:00 AM Jan Kaul <[email protected]>
>>> wrote:
>>> >>
>>> >> @Walaa:
>>> >>
>>> >> I would argue that when you run a CREATE VIEW statement the query
>>> engine knowns which catalog the view is being created in. So even though we
>>> typically use late binding to resolve the view catalog at query time, it
>>> can also be used at creation time.
>>> >>
>>> >> The query engine would need to keep track of the "view catalog" where
>>> the view is going to be created in. It can use that catalog to resolve
>>> partial table identifiers if "default-catalog" is not set.
>>> >>
>>> >> It can lead to some unintuitive behavior, where partial identifiers
>>> in the view query resolve to a different catalog compared to using them
>>> outside of a view.
>>> >>
>>> >> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from
>>> sales.orders;
>>> >>
>>> >> If the session default catalog is not "catalogA", the "sales.orders"
>>> in the view query would not be the same as just referencing "sales.orders"
>>> in a normal SQL statement. This is because without a "default-catalog", the
>>> catalog name of "sales.orders" would default to "catalogA", which is the
>>> view's catalog.
>>> >>
>>> >> Thanks,
>>> >>
>>> >> Jan
>>> >>
>>> >> On 4/25/25 04:05, Manu Zhang wrote:
>>> >>>
>>> >>> For example, if we want to validate that the tables referenced in
>>> the view exist, how can we do that when default-catalog isn't defined,
>>> since the view hasn't been created or loaded yet?
>>> >>
>>> >> I don't think this is related to view spec. How do we validate that a
>>> table exists without a default catalog, or do we always use the current
>>> session catalog?
>>> >>
>>> >> Thanks,
>>> >> Manu
>>> >>
>>> >> On Fri, Apr 25, 2025 at 5:59 AM Walaa Eldin Moustafa <
>>> [email protected]> wrote:
>>> >>>
>>> >>> Hi Jan,
>>> >>>
>>> >>> I think we still share the same understanding. Just to clarify: when
>>> I referred to late binding as “similar” to the proposal, I was
>>> acknowledging the distinction between view-level and table-level
>>> resolution. But as you noted, both follow a late binding model.
>>> >>>
>>> >>> That said, this still raises an interesting question and a potential
>>> gap: if default-catalog is only defined at query time, how should
>>> resolution work during view creation? For example, if we want to validate
>>> that the tables referenced in the view exist, how can we do that when
>>> default-catalog isn't defined, since the view hasn't been created or loaded
>>> yet?
>>> >>>
>>> >>> Thanks,
>>> >>> Walaa.
>>> >>>
>>> >>> On Thu, Apr 24, 2025 at 7:02 AM Jan Kaul <[email protected]>
>>> wrote:
>>> >>>>
>>> >>>> Yes, I have the same understanding. The view catalog is resolved at
>>> query time.
>>> >>>>
>>> >>>> As you mentioned before, it's good to distinguish between the
>>> physical catalog and it's reference used in SQL statements. The important
>>> part is that the physical catalog of the view and the tables referenced in
>>> it's definition stay consistent. You could create a view in a given
>>> physical catalog by referring to it as "catalogA", as in your first point.
>>> If you then, given a different setup, refer to the same physical catalog as
>>> "catalogB" in another session/environment, the behavior should still work.
>>> >>>>
>>> >>>> I would however rephrase your last point. Late binding applies to
>>> the view catalog name and by extension to all partial table references when
>>> no "default-catalog" is present. Resolving the view catalog name at query
>>> time is not opposed to storing the view metadata in a catalog.
>>> >>>>
>>> >>>> Or maybe I don't entirely understand what you mean.
>>> >>>>
>>> >>>> Thanks
>>> >>>>
>>> >>>> Jan
>>> >>>>
>>> >>>> On 4/24/25 00:32, Walaa Eldin Moustafa wrote:
>>> >>>>
>>> >>>> Hi Jan,
>>> >>>>
>>> >>>> > The view is executed when it's being referenced in a SQL
>>> statement. That statement contains the information for the query engine to
>>> resolve the catalog of the view.
>>> >>>>
>>> >>>> If I’m understanding correctly, that means:
>>> >>>>
>>> >>>> * If the view is queried as SELECT * FROM catalogA.namespace.view,
>>> then catalogA is considered the view’s catalog.
>>> >>>>
>>> >>>> * If the same view is later queried as SELECT * FROM
>>> catalogB.namespace.view (after renaming catalogA to catalogB, and keeping
>>> everything else the same), then catalogB becomes the view’s catalog.
>>> >>>>
>>> >>>> Is that interpretation correct? If so, it sounds to me like the
>>> catalog is resolved at query time, based on how the view is referenced, not
>>> from any stored metadata. That would imply some sort of a late binding
>>> behavior (similar to the proposal), as opposed to using some catalog that
>>> "stores" the view definition.
>>> >>>>
>>> >>>> Thanks,
>>> >>>> Walaa
>>> >>>>
>>> >>>> On Tue, Apr 22, 2025 at 11:01 AM Jan Kaul
>>> <[email protected]> wrote:
>>> >>>>>
>>> >>>>> Hi Walaa,
>>> >>>>>
>>> >>>>> Thanks for clarifying the aspects of non-determinism. Let me try
>>> to address your questions.
>>> >>>>>
>>> >>>>> 1. This is my interpretation of the current spec: The view is
>>> executed when it's being referenced in a SQL statement. That statement
>>> contains the information for the query engine to resolve the catalog of the
>>> view. The query engine then uses that information to fetch the view
>>> metadata from the catalog. It also needs to temporarily keep track of which
>>> catalog it used to fetch the view metadata. It can then use that
>>> information to resolve the table references in the views SQL definition in
>>> case no default catalog is specified.
>>> >>>>>
>>> >>>>> 2. The important part is that the catalog can be referenced at
>>> execution time. As long as that's the case I would assume the view can be
>>> created in any catalog.
>>> >>>>>
>>> >>>>>
>>> >>>>> I think your point is really valuable because the current
>>> specification can lead to some unintuitive behavior. For example for the
>>> following statement:
>>> >>>>>
>>> >>>>> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from
>>> sales.orders;
>>> >>>>>
>>> >>>>> If the session default catalog is not "catalogA", the
>>> "sales.orders" in the view query would not be the same as just referencing
>>> "sales.orders" in a normal SQL statement. This is because without a
>>> "default-catalog", the catalog name of "sales.orders" would default to
>>> "catalogA".
>>> >>>>>
>>> >>>>>
>>> >>>>> However, I like the current design of the view spec, because it
>>> has the "closure" property. Because of the fact that the "view catalog" has
>>> to be known when executing a view, all the information required to resolve
>>> the table identifiers is contained in the view metadata (and the "view
>>> catalog"). I think that if you make the identifier resolution dependent on
>>> external parameters, it hinders portability.
>>> >>>>>
>>> >>>>> Thanks,
>>> >>>>>
>>> >>>>> Jan
>>> >>>>>
>>> >>>>> On 4/22/25 18:36, Walaa Eldin Moustafa wrote:
>>> >>>>>
>>> >>>>> Hi Jan,
>>> >>>>>
>>> >>>>> Thanks for the thoughtful feedback.
>>> >>>>>
>>> >>>>> I think it’s important we clarify a key point before going deeper:
>>> >>>>>
>>> >>>>> Non-determinism is not caused by session fallback behavior—it’s a
>>> fundamental limitation of using table identifiers alone, regardless of
>>> whether we use the current rule, the proposed fallback to the session’s
>>> default catalog, or even early vs. late binding.
>>> >>>>>
>>> >>>>> The same fully qualified identifier (e.g.,
>>> catalogA.namespace.table) can resolve to different objects depending solely
>>> on engine-specific routing logic or catalog aliases. So determinism isn’t
>>> guaranteed just because an identifier is "fully qualified." The only
>>> reliable anchor for identity is the UUID. That’s why the proposed use of
>>> UUIDs is not just a hardening strategy. It’s the actual fix for correctness.
>>> >>>>>
>>> >>>>> To move the conversation forward, could you help clarify two
>>> things in the context of the current spec:
>>> >>>>>
>>> >>>>> * Where in the metadata is the “view catalog” stored, so that an
>>> engine knows to fall back to it if default-catalog is null?
>>> >>>>>
>>> >>>>> * Are we even allowed to create views in the session's default
>>> catalog (i.e., without specifying a catalog) in the current Iceberg spec?
>>> >>>>>
>>> >>>>> These questions are important because if we can’t unambiguously
>>> recover the "view catalog" from metadata, then defaulting to it is
>>> problematic. And if views can't be created in the default catalog, then the
>>> fallback rule doesn’t generalize.
>>> >>>>>
>>> >>>>> Thanks,
>>> >>>>> Walaa.
>>> >>>>>
>>> >>>>>
>>> >>>>> On Tue, Apr 22, 2025 at 3:14 AM Jan Kaul
>>> <[email protected]> wrote:
>>> >>>>>>
>>> >>>>>> Hi Walaa,
>>> >>>>>>
>>> >>>>>> thank you for your proposal. If I understood correctly, you
>>> proposal is composed of three parts:
>>> >>>>>>
>>> >>>>>> - session default catalog as fallback for "default-catalog"
>>> >>>>>>
>>> >>>>>> - session default namespace as fallback for "default-namepace"
>>> >>>>>>
>>> >>>>>> - Late binding + UUID validation
>>> >>>>>>
>>> >>>>>> I have some comments regarding these points.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> 1. Session default catalog as fallback for "default-catalog"
>>> >>>>>>
>>> >>>>>> Introducing a behavior that depends on the current session setup
>>> is in my opinion the definition of "non-determinism". You could be running
>>> the same query-engine and catalog-setup on different days, with different
>>> default session catalogs (which is rather common), and would be getting
>>> different results.
>>> >>>>>>
>>> >>>>>> Whereas with the current behavior, the view always produces the
>>> same results. The current behavior has some rough edges in very niche use
>>> cases but I think is solid for most uses cases.
>>> >>>>>>
>>> >>>>>> 2. Session default namespace as fallback for "default-namespace"
>>> >>>>>>
>>> >>>>>> Similar to the above.
>>> >>>>>>
>>> >>>>>> 3. Late binding + UUID validation
>>> >>>>>>
>>> >>>>>> If I understand it correctly, the current implementation already
>>> uses late binding.
>>> >>>>>>
>>> >>>>>> Generally, having UUID validation makes the setup more robust.
>>> Which is great. However, having UUID validation still requires us to have a
>>> portable table identifier specification. Even if we have the UUIDs of the
>>> referenced tables from the view, there simply isn't an interface that let's
>>> us use those UUIDs. The catalog interface is defined in terms of table
>>> identifiers.
>>> >>>>>>
>>> >>>>>> So we always require a working catalog setup and suiting table
>>> identifiers to obtain the table metadata. We can use the UUIDs to verify if
>>> we loaded the correct table. But this can only be done after we used some
>>> identifier. Which means there is no way of using UUIDs without a
>>> functioning catalog/identifier setup.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> In conclusion, I prefer the current behavior for
>>> "default-catalog" because it is more deterministic in my opinion. And I
>>> think the current spec does a good job for multi-engine table identifier
>>> resolution. I see the UUID validation more of an additional hardening
>>> strategy.
>>> >>>>>>
>>> >>>>>> Thanks
>>> >>>>>>
>>> >>>>>> Jan
>>> >>>>>>
>>> >>>>>> On 4/21/25 17:38, Walaa Eldin Moustafa wrote:
>>> >>>>>>
>>> >>>>>> Thanks Renjie!
>>> >>>>>>
>>> >>>>>> The existing spec has some guidance on resolving catalogs on the
>>> fly already (to address the case of view text with table identifiers
>>> missing the catalog part). The guidance is to use the catalog where the
>>> view is stored. But I find this rule hard to interpret or use. The catalog
>>> itself is a logical construct—such as a federated catalog that delegates to
>>> multiple physical backends (e.g., HMS and REST). In such cases, the catalog
>>> (e.g., `my_catalog` in `my_catalog.namespace1.table1`) doesn’t physically
>>> store the tables; it only routes requests to underlying stores. Therefore,
>>> defaulting identifier resolution based on the catalog where the view is
>>> "stored" doesn’t align with how catalogs actually behave in practice.
>>> >>>>>>
>>> >>>>>> Thanks,
>>> >>>>>> Walaa.
>>> >>>>>>
>>> >>>>>> On Sun, Apr 20, 2025 at 11:17 PM Renjie Liu <
>>> [email protected]> wrote:
>>> >>>>>>>
>>> >>>>>>> Hi, Walaa:
>>> >>>>>>>
>>> >>>>>>> Thanks for the proposal.
>>> >>>>>>>
>>> >>>>>>> I've reviewed the doc, but in general I have some concerns with
>>> resolving catalog names on the fly with query engine defined catalog names.
>>> This introduces some flexibility at first glance, but also makes
>>> misconfiguration difficult to explain.
>>> >>>>>>>
>>> >>>>>>> But I agree with one part that we should store resolved table
>>> uuid in view metadata, as table/view renaming may introduce errors that's
>>> difficult to understand for user.
>>> >>>>>>>
>>> >>>>>>> On Sat, Apr 19, 2025 at 3:02 AM Walaa Eldin Moustafa <
>>> [email protected]> wrote:
>>> >>>>>>>>
>>> >>>>>>>> Hi Everyone,
>>> >>>>>>>>
>>> >>>>>>>> Looking forward to keeping up the momentum and closing out the
>>> MV spec as well. I’m hoping we can proceed to a vote next week.
>>> >>>>>>>>
>>> >>>>>>>> Here is a summary in case that helps. The proposal outlines a
>>> strategy for handling table identifiers in Iceberg view metadata, with the
>>> goal of ensuring correctness, portability, and engine compatibility. It
>>> recommends resolving table identifiers at read time (late binding) rather
>>> than creation time, and introduces UUID-based validation to maintain
>>> identity guarantees across engines, or sessions. It also revises how
>>> default-catalog and default-namespace are handled (defaulting both to the
>>> session context if not explicitly set) to better align with engine behavior
>>> and improve cross-engine interoperability.
>>> >>>>>>>>
>>> >>>>>>>> Please let me know your thoughts.
>>> >>>>>>>>
>>> >>>>>>>> Thanks,
>>> >>>>>>>> Walaa.
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Wed, Apr 16, 2025 at 2:03 PM Walaa Eldin Moustafa <
>>> [email protected]> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>> Thanks Eduard and Sung! I have addressed the comments.
>>> >>>>>>>>>
>>> >>>>>>>>> One key point to keep in mind is that catalog names in the
>>> spec refer to logical catalogs—i.e., the first part of a three-part
>>> identifier. These correspond to Spark's DataSourceV2 catalogs, Trino
>>> connectors, and similar constructs. This is a level of abstraction above
>>> physical catalogs, which are not referenced or used in the view spec. The
>>> reason is that table identifiers in the view definition/text itself refer
>>> to logical catalogs, not physical ones (since they interface directly with
>>> the engine and not a specific metastore).
>>> >>>>>>>>>
>>> >>>>>>>>> Thanks,
>>> >>>>>>>>> Walaa.
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> On Wed, Apr 16, 2025 at 6:15 AM Sung Yun <[email protected]>
>>> wrote:
>>> >>>>>>>>>>
>>> >>>>>>>>>> Thank you Walaa for the proposal. I think view portability is
>>> a very important topic for us to continue discussing as it relies on many
>>> assumptions within the data ecosystem for it to function like you've
>>> highlighted well in the document.
>>> >>>>>>>>>>
>>> >>>>>>>>>> I've added a few comments around how this may impact the
>>> permission questions the engines will be asking, and whether that is the
>>> desired behavior.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Sung
>>> >>>>>>>>>>
>>> >>>>>>>>>> On Wed, Apr 16, 2025 at 7:32 AM Eduard Tudenhöfner <
>>> [email protected]> wrote:
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Thanks Walaa for tackling this problem. I've added a few
>>> comments to get a better understanding of how this will look like in the
>>> actual implementation.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Eduard
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> On Tue, Apr 15, 2025 at 7:09 PM Walaa Eldin Moustafa <
>>> [email protected]> wrote:
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> Hi Everyone,
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> Starting this thread to resume our discussion on how to
>>> reference table identifiers from Iceberg metadata, a key aspect of the view
>>> specification, particularly in relation to the MV (materialized view)
>>> extensions.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> I had the chance to speak offline with a few community
>>> members to better understand how the current spec is being interpreted.
>>> Those conversations served as inputs to a new proposal on how table
>>> identifier references could be represented in metadata.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> You can find the proposal here [1]. I look forward to your
>>> feedback and working together to move this forward so we can finalize the
>>> MV spec as well.
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> [1]
>>> https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> Thanks,
>>> >>>>>>>>>>>> Walaa.
>>>
>>

Re: [DISCUSS] Table Identifiers in Iceberg View Spec

Reply via email to