To help folks catch up on the latest discussions and interpretation of the spec, I have summarized everything we discussed so far at the top of the proposal document (here <https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0>). I have slightly updated the proposal to be in sync with the new interpretation to avoid confusion. In summary:
* Remove default-catalog and default-namespace fields from the view spec completely. * Hence, we do not attempt to define separate view-level default catalogs or namespaces. Instead: * If a table identifier inside a view lacks a catalog qualifier, engines should resolve it using the current engine catalog at query time. * Reference table identifiers in the metadata exactly as they appear in the view SQL text. * If an identifier lacks the catalog part at creation, it should still lack a catalog in the stored metadata. * Store UUIDs alongside table identifiers whenever possible. Thanks, Walaa. On Fri, Apr 25, 2025 at 5:18 PM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Thanks for the contribution Benny! +1 to the confusion the fallback > creates. Also just to be clear, at this point and after clarifying the > current spec intentions, I am convinced that we should remove the default > catalog and default namespace fields altogether. > > Thanks, > Walaa. > > On Fri, Apr 25, 2025 at 5:13 PM Benny Chow <btc...@gmail.com> wrote: > >> I'd like to contribute my opinions on this: >> >> - I don't particularly like the current behavior of "default to the >> view's catalog when default-catalog is not set". Fundamentally, I believe >> the intent of default-catalog and default-namespace is there to help users >> write more concise SQL. >> - spark session catalog is engine specific and I don't think we should >> design something that says first use this catalog, then that catalog.. or >> that catalog. For example, resolving identifiers using default-catalog -> >> view's catalog -> session catalog is not good. >> - We gotta support non-Iceberg tables otherwise I see no value in putting >> views in the catalog to share with other engines >> - Interoperability between different engine types is very hard due to >> dialect issues... so I think we should focus on supporting different >> clusters of the same engine type on a shared catalog. For example, AI and >> BI clusters on Spark sharing the same views in a REST catalog. >> >> Coincidentally, I think the ultimate solution is along the lines of >> something Russell proposed last year: >> >> https://lists.apache.org/thread/hoskfx8y3kvrcww52l4w9dxghp3pnlm7 >> >> We've been looking at this interoperable identifier problem through the >> lens of catalog resolution but maybe the right approach is really about >> templating. >> >> I would extend Russell's idea to allow identifiers in a view to span >> catalogs to support non-Iceberg tables. Also, the default-catalog >> property could be templated as well. >> >> Thoughts? >> Benny >> >> >> >> On Fri, Apr 25, 2025 at 4:02 PM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>> Thanks Steven! How do you recommend making Spark implementation conform >>> to the spec? Do we need Spark SQL extensions and/or Spark catalog APIs for >>> that? >>> >>> How do you recommend reconciling the inconsistencies I shared regarding >>> many resolution methods not consistently being followed in different >>> scenarios (view vs child table resolution, query vs view resolution)? Note >>> these occur when the default catalog is set to a non-null value. If it >>> helps, I can share concrete examples. >>> >>> Thanks, >>> Walaa. >>> >>> On Fri, Apr 25, 2025 at 3:52 PM Steven Wu <stevenz...@gmail.com> wrote: >>> >>>> The core issue is on the fall back behavior when `default-catalog` is >>>> not defined. Current view spec says the fallback should be the catalog >>>> where the view is defined. It doesn't really matter what the catalog >>>> is named (catalogX) by the read engine. >>>> - If a view refers to the tables in the same catalog, this is a >>>> non-ambiguous and reasonable fallback behavior. >>>> - If a view refers to tables from another catalog, catalog names >>>> should be included in the reference name already. So no ambiguity >>>> there either. >>>> >>>> Potential inconsistent naming of catalog is a separate problem, which >>>> Iceberg view spec probably cannot solve. We can only recommend that >>>> catalog should be named consistently across usage for better >>>> interoperability on name references. >>>> >>>> This proposal is to change the fallback behavior to engine's session >>>> default catalog. I am not sure it is better than the current fallback >>>> behavior. >>>> >>>> > Today’s Spark behavior explicitly differs from this idea. Spark >>>> resolves table identifiers during view creation using the session’s default >>>> catalog, not a supplied `default-catalog`. >>>> >>>> I would argue that is a Spark implementation issue for not conforming >>>> to the spec. >>>> >>>> >>>> On Fri, Apr 25, 2025 at 1:17 PM Walaa Eldin Moustafa >>>> <wa.moust...@gmail.com> wrote: >>>> > >>>> > Hi Jan, >>>> > >>>> > Thanks again for continuing the discussion. I want to highlight a few >>>> fundamental issues around the interpretation of default-catalog: >>>> > >>>> > Here is the real catch: >>>> > >>>> > * default-catalog cannot logically be defined at view creation time. >>>> It would be circular: the view needs to exist before its metadata (and >>>> hence default-catalog) can exist. This is visible in Spark’s >>>> implementation, where `default-catalog` is not used. >>>> > >>>> > * Introducing a creation-time default-catalog setting would require >>>> extending SQL syntax and engine APIs to promote it to a first-class view >>>> concept. This would be intrusive, non-intuitive, and realistically very >>>> difficult to standardize across engines. >>>> > >>>> > * Today’s Spark behavior explicitly differs from this idea. Spark >>>> resolves table identifiers during view creation using the session’s default >>>> catalog, not a supplied `default-catalog`. >>>> > >>>> > * Hypothetically even if we patched in a creation-time >>>> default-catalog, it would create an inconsistent binding model between >>>> tables vs views (early vs late), and between tables in views and in queries >>>> (again early vs late). For example, views and tables in queries can >>>> withstand default catalog renames, but tables cannot when they are used >>>> inside views -- it even applies to views inside views, which makes this >>>> very hard to reason about considering nesting. >>>> > >>>> > Thanks, >>>> > Walaa >>>> > >>>> > On Fri, Apr 25, 2025 at 7:00 AM Jan Kaul <jank...@mailbox.org.invalid> >>>> wrote: >>>> >> >>>> >> @Walaa: >>>> >> >>>> >> I would argue that when you run a CREATE VIEW statement the query >>>> engine knowns which catalog the view is being created in. So even though we >>>> typically use late binding to resolve the view catalog at query time, it >>>> can also be used at creation time. >>>> >> >>>> >> The query engine would need to keep track of the "view catalog" >>>> where the view is going to be created in. It can use that catalog to >>>> resolve partial table identifiers if "default-catalog" is not set. >>>> >> >>>> >> It can lead to some unintuitive behavior, where partial identifiers >>>> in the view query resolve to a different catalog compared to using them >>>> outside of a view. >>>> >> >>>> >> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from >>>> sales.orders; >>>> >> >>>> >> If the session default catalog is not "catalogA", the "sales.orders" >>>> in the view query would not be the same as just referencing "sales.orders" >>>> in a normal SQL statement. This is because without a "default-catalog", the >>>> catalog name of "sales.orders" would default to "catalogA", which is the >>>> view's catalog. >>>> >> >>>> >> Thanks, >>>> >> >>>> >> Jan >>>> >> >>>> >> On 4/25/25 04:05, Manu Zhang wrote: >>>> >>> >>>> >>> For example, if we want to validate that the tables referenced in >>>> the view exist, how can we do that when default-catalog isn't defined, >>>> since the view hasn't been created or loaded yet? >>>> >> >>>> >> I don't think this is related to view spec. How do we validate that >>>> a table exists without a default catalog, or do we always use the current >>>> session catalog? >>>> >> >>>> >> Thanks, >>>> >> Manu >>>> >> >>>> >> On Fri, Apr 25, 2025 at 5:59 AM Walaa Eldin Moustafa < >>>> wa.moust...@gmail.com> wrote: >>>> >>> >>>> >>> Hi Jan, >>>> >>> >>>> >>> I think we still share the same understanding. Just to clarify: >>>> when I referred to late binding as “similar” to the proposal, I was >>>> acknowledging the distinction between view-level and table-level >>>> resolution. But as you noted, both follow a late binding model. >>>> >>> >>>> >>> That said, this still raises an interesting question and a >>>> potential gap: if default-catalog is only defined at query time, how should >>>> resolution work during view creation? For example, if we want to validate >>>> that the tables referenced in the view exist, how can we do that when >>>> default-catalog isn't defined, since the view hasn't been created or loaded >>>> yet? >>>> >>> >>>> >>> Thanks, >>>> >>> Walaa. >>>> >>> >>>> >>> On Thu, Apr 24, 2025 at 7:02 AM Jan Kaul >>>> <jank...@mailbox.org.invalid> wrote: >>>> >>>> >>>> >>>> Yes, I have the same understanding. The view catalog is resolved >>>> at query time. >>>> >>>> >>>> >>>> As you mentioned before, it's good to distinguish between the >>>> physical catalog and it's reference used in SQL statements. The important >>>> part is that the physical catalog of the view and the tables referenced in >>>> it's definition stay consistent. You could create a view in a given >>>> physical catalog by referring to it as "catalogA", as in your first point. >>>> If you then, given a different setup, refer to the same physical catalog as >>>> "catalogB" in another session/environment, the behavior should still work. >>>> >>>> >>>> >>>> I would however rephrase your last point. Late binding applies to >>>> the view catalog name and by extension to all partial table references when >>>> no "default-catalog" is present. Resolving the view catalog name at query >>>> time is not opposed to storing the view metadata in a catalog. >>>> >>>> >>>> >>>> Or maybe I don't entirely understand what you mean. >>>> >>>> >>>> >>>> Thanks >>>> >>>> >>>> >>>> Jan >>>> >>>> >>>> >>>> On 4/24/25 00:32, Walaa Eldin Moustafa wrote: >>>> >>>> >>>> >>>> Hi Jan, >>>> >>>> >>>> >>>> > The view is executed when it's being referenced in a SQL >>>> statement. That statement contains the information for the query engine to >>>> resolve the catalog of the view. >>>> >>>> >>>> >>>> If I’m understanding correctly, that means: >>>> >>>> >>>> >>>> * If the view is queried as SELECT * FROM catalogA.namespace.view, >>>> then catalogA is considered the view’s catalog. >>>> >>>> >>>> >>>> * If the same view is later queried as SELECT * FROM >>>> catalogB.namespace.view (after renaming catalogA to catalogB, and keeping >>>> everything else the same), then catalogB becomes the view’s catalog. >>>> >>>> >>>> >>>> Is that interpretation correct? If so, it sounds to me like the >>>> catalog is resolved at query time, based on how the view is referenced, not >>>> from any stored metadata. That would imply some sort of a late binding >>>> behavior (similar to the proposal), as opposed to using some catalog that >>>> "stores" the view definition. >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Walaa >>>> >>>> >>>> >>>> On Tue, Apr 22, 2025 at 11:01 AM Jan Kaul >>>> <jank...@mailbox.org.invalid> wrote: >>>> >>>>> >>>> >>>>> Hi Walaa, >>>> >>>>> >>>> >>>>> Thanks for clarifying the aspects of non-determinism. Let me try >>>> to address your questions. >>>> >>>>> >>>> >>>>> 1. This is my interpretation of the current spec: The view is >>>> executed when it's being referenced in a SQL statement. That statement >>>> contains the information for the query engine to resolve the catalog of the >>>> view. The query engine then uses that information to fetch the view >>>> metadata from the catalog. It also needs to temporarily keep track of which >>>> catalog it used to fetch the view metadata. It can then use that >>>> information to resolve the table references in the views SQL definition in >>>> case no default catalog is specified. >>>> >>>>> >>>> >>>>> 2. The important part is that the catalog can be referenced at >>>> execution time. As long as that's the case I would assume the view can be >>>> created in any catalog. >>>> >>>>> >>>> >>>>> >>>> >>>>> I think your point is really valuable because the current >>>> specification can lead to some unintuitive behavior. For example for the >>>> following statement: >>>> >>>>> >>>> >>>>> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from >>>> sales.orders; >>>> >>>>> >>>> >>>>> If the session default catalog is not "catalogA", the >>>> "sales.orders" in the view query would not be the same as just referencing >>>> "sales.orders" in a normal SQL statement. This is because without a >>>> "default-catalog", the catalog name of "sales.orders" would default to >>>> "catalogA". >>>> >>>>> >>>> >>>>> >>>> >>>>> However, I like the current design of the view spec, because it >>>> has the "closure" property. Because of the fact that the "view catalog" has >>>> to be known when executing a view, all the information required to resolve >>>> the table identifiers is contained in the view metadata (and the "view >>>> catalog"). I think that if you make the identifier resolution dependent on >>>> external parameters, it hinders portability. >>>> >>>>> >>>> >>>>> Thanks, >>>> >>>>> >>>> >>>>> Jan >>>> >>>>> >>>> >>>>> On 4/22/25 18:36, Walaa Eldin Moustafa wrote: >>>> >>>>> >>>> >>>>> Hi Jan, >>>> >>>>> >>>> >>>>> Thanks for the thoughtful feedback. >>>> >>>>> >>>> >>>>> I think it’s important we clarify a key point before going deeper: >>>> >>>>> >>>> >>>>> Non-determinism is not caused by session fallback behavior—it’s a >>>> fundamental limitation of using table identifiers alone, regardless of >>>> whether we use the current rule, the proposed fallback to the session’s >>>> default catalog, or even early vs. late binding. >>>> >>>>> >>>> >>>>> The same fully qualified identifier (e.g., >>>> catalogA.namespace.table) can resolve to different objects depending solely >>>> on engine-specific routing logic or catalog aliases. So determinism isn’t >>>> guaranteed just because an identifier is "fully qualified." The only >>>> reliable anchor for identity is the UUID. That’s why the proposed use of >>>> UUIDs is not just a hardening strategy. It’s the actual fix for >>>> correctness. >>>> >>>>> >>>> >>>>> To move the conversation forward, could you help clarify two >>>> things in the context of the current spec: >>>> >>>>> >>>> >>>>> * Where in the metadata is the “view catalog” stored, so that an >>>> engine knows to fall back to it if default-catalog is null? >>>> >>>>> >>>> >>>>> * Are we even allowed to create views in the session's default >>>> catalog (i.e., without specifying a catalog) in the current Iceberg spec? >>>> >>>>> >>>> >>>>> These questions are important because if we can’t unambiguously >>>> recover the "view catalog" from metadata, then defaulting to it is >>>> problematic. And if views can't be created in the default catalog, then the >>>> fallback rule doesn’t generalize. >>>> >>>>> >>>> >>>>> Thanks, >>>> >>>>> Walaa. >>>> >>>>> >>>> >>>>> >>>> >>>>> On Tue, Apr 22, 2025 at 3:14 AM Jan Kaul >>>> <jank...@mailbox.org.invalid> wrote: >>>> >>>>>> >>>> >>>>>> Hi Walaa, >>>> >>>>>> >>>> >>>>>> thank you for your proposal. If I understood correctly, you >>>> proposal is composed of three parts: >>>> >>>>>> >>>> >>>>>> - session default catalog as fallback for "default-catalog" >>>> >>>>>> >>>> >>>>>> - session default namespace as fallback for "default-namepace" >>>> >>>>>> >>>> >>>>>> - Late binding + UUID validation >>>> >>>>>> >>>> >>>>>> I have some comments regarding these points. >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> 1. Session default catalog as fallback for "default-catalog" >>>> >>>>>> >>>> >>>>>> Introducing a behavior that depends on the current session setup >>>> is in my opinion the definition of "non-determinism". You could be running >>>> the same query-engine and catalog-setup on different days, with different >>>> default session catalogs (which is rather common), and would be getting >>>> different results. >>>> >>>>>> >>>> >>>>>> Whereas with the current behavior, the view always produces the >>>> same results. The current behavior has some rough edges in very niche use >>>> cases but I think is solid for most uses cases. >>>> >>>>>> >>>> >>>>>> 2. Session default namespace as fallback for "default-namespace" >>>> >>>>>> >>>> >>>>>> Similar to the above. >>>> >>>>>> >>>> >>>>>> 3. Late binding + UUID validation >>>> >>>>>> >>>> >>>>>> If I understand it correctly, the current implementation already >>>> uses late binding. >>>> >>>>>> >>>> >>>>>> Generally, having UUID validation makes the setup more robust. >>>> Which is great. However, having UUID validation still requires us to have a >>>> portable table identifier specification. Even if we have the UUIDs of the >>>> referenced tables from the view, there simply isn't an interface that let's >>>> us use those UUIDs. The catalog interface is defined in terms of table >>>> identifiers. >>>> >>>>>> >>>> >>>>>> So we always require a working catalog setup and suiting table >>>> identifiers to obtain the table metadata. We can use the UUIDs to verify if >>>> we loaded the correct table. But this can only be done after we used some >>>> identifier. Which means there is no way of using UUIDs without a >>>> functioning catalog/identifier setup. >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> In conclusion, I prefer the current behavior for >>>> "default-catalog" because it is more deterministic in my opinion. And I >>>> think the current spec does a good job for multi-engine table identifier >>>> resolution. I see the UUID validation more of an additional hardening >>>> strategy. >>>> >>>>>> >>>> >>>>>> Thanks >>>> >>>>>> >>>> >>>>>> Jan >>>> >>>>>> >>>> >>>>>> On 4/21/25 17:38, Walaa Eldin Moustafa wrote: >>>> >>>>>> >>>> >>>>>> Thanks Renjie! >>>> >>>>>> >>>> >>>>>> The existing spec has some guidance on resolving catalogs on the >>>> fly already (to address the case of view text with table identifiers >>>> missing the catalog part). The guidance is to use the catalog where the >>>> view is stored. But I find this rule hard to interpret or use. The catalog >>>> itself is a logical construct—such as a federated catalog that delegates to >>>> multiple physical backends (e.g., HMS and REST). In such cases, the catalog >>>> (e.g., `my_catalog` in `my_catalog.namespace1.table1`) doesn’t physically >>>> store the tables; it only routes requests to underlying stores. Therefore, >>>> defaulting identifier resolution based on the catalog where the view is >>>> "stored" doesn’t align with how catalogs actually behave in practice. >>>> >>>>>> >>>> >>>>>> Thanks, >>>> >>>>>> Walaa. >>>> >>>>>> >>>> >>>>>> On Sun, Apr 20, 2025 at 11:17 PM Renjie Liu < >>>> liurenjie2...@gmail.com> wrote: >>>> >>>>>>> >>>> >>>>>>> Hi, Walaa: >>>> >>>>>>> >>>> >>>>>>> Thanks for the proposal. >>>> >>>>>>> >>>> >>>>>>> I've reviewed the doc, but in general I have some concerns with >>>> resolving catalog names on the fly with query engine defined catalog names. >>>> This introduces some flexibility at first glance, but also makes >>>> misconfiguration difficult to explain. >>>> >>>>>>> >>>> >>>>>>> But I agree with one part that we should store resolved table >>>> uuid in view metadata, as table/view renaming may introduce errors that's >>>> difficult to understand for user. >>>> >>>>>>> >>>> >>>>>>> On Sat, Apr 19, 2025 at 3:02 AM Walaa Eldin Moustafa < >>>> wa.moust...@gmail.com> wrote: >>>> >>>>>>>> >>>> >>>>>>>> Hi Everyone, >>>> >>>>>>>> >>>> >>>>>>>> Looking forward to keeping up the momentum and closing out the >>>> MV spec as well. I’m hoping we can proceed to a vote next week. >>>> >>>>>>>> >>>> >>>>>>>> Here is a summary in case that helps. The proposal outlines a >>>> strategy for handling table identifiers in Iceberg view metadata, with the >>>> goal of ensuring correctness, portability, and engine compatibility. It >>>> recommends resolving table identifiers at read time (late binding) rather >>>> than creation time, and introduces UUID-based validation to maintain >>>> identity guarantees across engines, or sessions. It also revises how >>>> default-catalog and default-namespace are handled (defaulting both to the >>>> session context if not explicitly set) to better align with engine behavior >>>> and improve cross-engine interoperability. >>>> >>>>>>>> >>>> >>>>>>>> Please let me know your thoughts. >>>> >>>>>>>> >>>> >>>>>>>> Thanks, >>>> >>>>>>>> Walaa. >>>> >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> On Wed, Apr 16, 2025 at 2:03 PM Walaa Eldin Moustafa < >>>> wa.moust...@gmail.com> wrote: >>>> >>>>>>>>> >>>> >>>>>>>>> Thanks Eduard and Sung! I have addressed the comments. >>>> >>>>>>>>> >>>> >>>>>>>>> One key point to keep in mind is that catalog names in the >>>> spec refer to logical catalogs—i.e., the first part of a three-part >>>> identifier. These correspond to Spark's DataSourceV2 catalogs, Trino >>>> connectors, and similar constructs. This is a level of abstraction above >>>> physical catalogs, which are not referenced or used in the view spec. The >>>> reason is that table identifiers in the view definition/text itself refer >>>> to logical catalogs, not physical ones (since they interface directly with >>>> the engine and not a specific metastore). >>>> >>>>>>>>> >>>> >>>>>>>>> Thanks, >>>> >>>>>>>>> Walaa. >>>> >>>>>>>>> >>>> >>>>>>>>> >>>> >>>>>>>>> On Wed, Apr 16, 2025 at 6:15 AM Sung Yun <sungwy...@gmail.com> >>>> wrote: >>>> >>>>>>>>>> >>>> >>>>>>>>>> Thank you Walaa for the proposal. I think view portability >>>> is a very important topic for us to continue discussing as it relies on >>>> many assumptions within the data ecosystem for it to function like you've >>>> highlighted well in the document. >>>> >>>>>>>>>> >>>> >>>>>>>>>> I've added a few comments around how this may impact the >>>> permission questions the engines will be asking, and whether that is the >>>> desired behavior. >>>> >>>>>>>>>> >>>> >>>>>>>>>> Sung >>>> >>>>>>>>>> >>>> >>>>>>>>>> On Wed, Apr 16, 2025 at 7:32 AM Eduard Tudenhöfner < >>>> etudenhoef...@apache.org> wrote: >>>> >>>>>>>>>>> >>>> >>>>>>>>>>> Thanks Walaa for tackling this problem. I've added a few >>>> comments to get a better understanding of how this will look like in the >>>> actual implementation. >>>> >>>>>>>>>>> >>>> >>>>>>>>>>> Eduard >>>> >>>>>>>>>>> >>>> >>>>>>>>>>> On Tue, Apr 15, 2025 at 7:09 PM Walaa Eldin Moustafa < >>>> wa.moust...@gmail.com> wrote: >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> Hi Everyone, >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> Starting this thread to resume our discussion on how to >>>> reference table identifiers from Iceberg metadata, a key aspect of the view >>>> specification, particularly in relation to the MV (materialized view) >>>> extensions. >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> I had the chance to speak offline with a few community >>>> members to better understand how the current spec is being interpreted. >>>> Those conversations served as inputs to a new proposal on how table >>>> identifier references could be represented in metadata. >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> You can find the proposal here [1]. I look forward to your >>>> feedback and working together to move this forward so we can finalize the >>>> MV spec as well. >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> [1] >>>> https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0 >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> Thanks, >>>> >>>>>>>>>>>> Walaa. >>>> >>>