I'd like to contribute my opinions on this: - I don't particularly like the current behavior of "default to the view's catalog when default-catalog is not set". Fundamentally, I believe the intent of default-catalog and default-namespace is there to help users write more concise SQL. - spark session catalog is engine specific and I don't think we should design something that says first use this catalog, then that catalog.. or that catalog. For example, resolving identifiers using default-catalog -> view's catalog -> session catalog is not good. - We gotta support non-Iceberg tables otherwise I see no value in putting views in the catalog to share with other engines - Interoperability between different engine types is very hard due to dialect issues... so I think we should focus on supporting different clusters of the same engine type on a shared catalog. For example, AI and BI clusters on Spark sharing the same views in a REST catalog.
Coincidentally, I think the ultimate solution is along the lines of something Russell proposed last year: https://lists.apache.org/thread/hoskfx8y3kvrcww52l4w9dxghp3pnlm7 We've been looking at this interoperable identifier problem through the lens of catalog resolution but maybe the right approach is really about templating. I would extend Russell's idea to allow identifiers in a view to span catalogs to support non-Iceberg tables. Also, the default-catalog property could be templated as well. Thoughts? Benny On Fri, Apr 25, 2025 at 4:02 PM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Thanks Steven! How do you recommend making Spark implementation conform to > the spec? Do we need Spark SQL extensions and/or Spark catalog APIs for > that? > > How do you recommend reconciling the inconsistencies I shared regarding > many resolution methods not consistently being followed in different > scenarios (view vs child table resolution, query vs view resolution)? Note > these occur when the default catalog is set to a non-null value. If it > helps, I can share concrete examples. > > Thanks, > Walaa. > > On Fri, Apr 25, 2025 at 3:52 PM Steven Wu <stevenz...@gmail.com> wrote: > >> The core issue is on the fall back behavior when `default-catalog` is >> not defined. Current view spec says the fallback should be the catalog >> where the view is defined. It doesn't really matter what the catalog >> is named (catalogX) by the read engine. >> - If a view refers to the tables in the same catalog, this is a >> non-ambiguous and reasonable fallback behavior. >> - If a view refers to tables from another catalog, catalog names >> should be included in the reference name already. So no ambiguity >> there either. >> >> Potential inconsistent naming of catalog is a separate problem, which >> Iceberg view spec probably cannot solve. We can only recommend that >> catalog should be named consistently across usage for better >> interoperability on name references. >> >> This proposal is to change the fallback behavior to engine's session >> default catalog. I am not sure it is better than the current fallback >> behavior. >> >> > Today’s Spark behavior explicitly differs from this idea. Spark >> resolves table identifiers during view creation using the session’s default >> catalog, not a supplied `default-catalog`. >> >> I would argue that is a Spark implementation issue for not conforming >> to the spec. >> >> >> On Fri, Apr 25, 2025 at 1:17 PM Walaa Eldin Moustafa >> <wa.moust...@gmail.com> wrote: >> > >> > Hi Jan, >> > >> > Thanks again for continuing the discussion. I want to highlight a few >> fundamental issues around the interpretation of default-catalog: >> > >> > Here is the real catch: >> > >> > * default-catalog cannot logically be defined at view creation time. It >> would be circular: the view needs to exist before its metadata (and hence >> default-catalog) can exist. This is visible in Spark’s implementation, >> where `default-catalog` is not used. >> > >> > * Introducing a creation-time default-catalog setting would require >> extending SQL syntax and engine APIs to promote it to a first-class view >> concept. This would be intrusive, non-intuitive, and realistically very >> difficult to standardize across engines. >> > >> > * Today’s Spark behavior explicitly differs from this idea. Spark >> resolves table identifiers during view creation using the session’s default >> catalog, not a supplied `default-catalog`. >> > >> > * Hypothetically even if we patched in a creation-time default-catalog, >> it would create an inconsistent binding model between tables vs views >> (early vs late), and between tables in views and in queries (again early vs >> late). For example, views and tables in queries can withstand default >> catalog renames, but tables cannot when they are used inside views -- it >> even applies to views inside views, which makes this very hard to reason >> about considering nesting. >> > >> > Thanks, >> > Walaa >> > >> > On Fri, Apr 25, 2025 at 7:00 AM Jan Kaul <jank...@mailbox.org.invalid> >> wrote: >> >> >> >> @Walaa: >> >> >> >> I would argue that when you run a CREATE VIEW statement the query >> engine knowns which catalog the view is being created in. So even though we >> typically use late binding to resolve the view catalog at query time, it >> can also be used at creation time. >> >> >> >> The query engine would need to keep track of the "view catalog" where >> the view is going to be created in. It can use that catalog to resolve >> partial table identifiers if "default-catalog" is not set. >> >> >> >> It can lead to some unintuitive behavior, where partial identifiers in >> the view query resolve to a different catalog compared to using them >> outside of a view. >> >> >> >> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from >> sales.orders; >> >> >> >> If the session default catalog is not "catalogA", the "sales.orders" >> in the view query would not be the same as just referencing "sales.orders" >> in a normal SQL statement. This is because without a "default-catalog", the >> catalog name of "sales.orders" would default to "catalogA", which is the >> view's catalog. >> >> >> >> Thanks, >> >> >> >> Jan >> >> >> >> On 4/25/25 04:05, Manu Zhang wrote: >> >>> >> >>> For example, if we want to validate that the tables referenced in the >> view exist, how can we do that when default-catalog isn't defined, since >> the view hasn't been created or loaded yet? >> >> >> >> I don't think this is related to view spec. How do we validate that a >> table exists without a default catalog, or do we always use the current >> session catalog? >> >> >> >> Thanks, >> >> Manu >> >> >> >> On Fri, Apr 25, 2025 at 5:59 AM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>> >> >>> Hi Jan, >> >>> >> >>> I think we still share the same understanding. Just to clarify: when >> I referred to late binding as “similar” to the proposal, I was >> acknowledging the distinction between view-level and table-level >> resolution. But as you noted, both follow a late binding model. >> >>> >> >>> That said, this still raises an interesting question and a potential >> gap: if default-catalog is only defined at query time, how should >> resolution work during view creation? For example, if we want to validate >> that the tables referenced in the view exist, how can we do that when >> default-catalog isn't defined, since the view hasn't been created or loaded >> yet? >> >>> >> >>> Thanks, >> >>> Walaa. >> >>> >> >>> On Thu, Apr 24, 2025 at 7:02 AM Jan Kaul <jank...@mailbox.org.invalid> >> wrote: >> >>>> >> >>>> Yes, I have the same understanding. The view catalog is resolved at >> query time. >> >>>> >> >>>> As you mentioned before, it's good to distinguish between the >> physical catalog and it's reference used in SQL statements. The important >> part is that the physical catalog of the view and the tables referenced in >> it's definition stay consistent. You could create a view in a given >> physical catalog by referring to it as "catalogA", as in your first point. >> If you then, given a different setup, refer to the same physical catalog as >> "catalogB" in another session/environment, the behavior should still work. >> >>>> >> >>>> I would however rephrase your last point. Late binding applies to >> the view catalog name and by extension to all partial table references when >> no "default-catalog" is present. Resolving the view catalog name at query >> time is not opposed to storing the view metadata in a catalog. >> >>>> >> >>>> Or maybe I don't entirely understand what you mean. >> >>>> >> >>>> Thanks >> >>>> >> >>>> Jan >> >>>> >> >>>> On 4/24/25 00:32, Walaa Eldin Moustafa wrote: >> >>>> >> >>>> Hi Jan, >> >>>> >> >>>> > The view is executed when it's being referenced in a SQL >> statement. That statement contains the information for the query engine to >> resolve the catalog of the view. >> >>>> >> >>>> If I’m understanding correctly, that means: >> >>>> >> >>>> * If the view is queried as SELECT * FROM catalogA.namespace.view, >> then catalogA is considered the view’s catalog. >> >>>> >> >>>> * If the same view is later queried as SELECT * FROM >> catalogB.namespace.view (after renaming catalogA to catalogB, and keeping >> everything else the same), then catalogB becomes the view’s catalog. >> >>>> >> >>>> Is that interpretation correct? If so, it sounds to me like the >> catalog is resolved at query time, based on how the view is referenced, not >> from any stored metadata. That would imply some sort of a late binding >> behavior (similar to the proposal), as opposed to using some catalog that >> "stores" the view definition. >> >>>> >> >>>> Thanks, >> >>>> Walaa >> >>>> >> >>>> On Tue, Apr 22, 2025 at 11:01 AM Jan Kaul >> <jank...@mailbox.org.invalid> wrote: >> >>>>> >> >>>>> Hi Walaa, >> >>>>> >> >>>>> Thanks for clarifying the aspects of non-determinism. Let me try to >> address your questions. >> >>>>> >> >>>>> 1. This is my interpretation of the current spec: The view is >> executed when it's being referenced in a SQL statement. That statement >> contains the information for the query engine to resolve the catalog of the >> view. The query engine then uses that information to fetch the view >> metadata from the catalog. It also needs to temporarily keep track of which >> catalog it used to fetch the view metadata. It can then use that >> information to resolve the table references in the views SQL definition in >> case no default catalog is specified. >> >>>>> >> >>>>> 2. The important part is that the catalog can be referenced at >> execution time. As long as that's the case I would assume the view can be >> created in any catalog. >> >>>>> >> >>>>> >> >>>>> I think your point is really valuable because the current >> specification can lead to some unintuitive behavior. For example for the >> following statement: >> >>>>> >> >>>>> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from >> sales.orders; >> >>>>> >> >>>>> If the session default catalog is not "catalogA", the >> "sales.orders" in the view query would not be the same as just referencing >> "sales.orders" in a normal SQL statement. This is because without a >> "default-catalog", the catalog name of "sales.orders" would default to >> "catalogA". >> >>>>> >> >>>>> >> >>>>> However, I like the current design of the view spec, because it has >> the "closure" property. Because of the fact that the "view catalog" has to >> be known when executing a view, all the information required to resolve the >> table identifiers is contained in the view metadata (and the "view >> catalog"). I think that if you make the identifier resolution dependent on >> external parameters, it hinders portability. >> >>>>> >> >>>>> Thanks, >> >>>>> >> >>>>> Jan >> >>>>> >> >>>>> On 4/22/25 18:36, Walaa Eldin Moustafa wrote: >> >>>>> >> >>>>> Hi Jan, >> >>>>> >> >>>>> Thanks for the thoughtful feedback. >> >>>>> >> >>>>> I think it’s important we clarify a key point before going deeper: >> >>>>> >> >>>>> Non-determinism is not caused by session fallback behavior—it’s a >> fundamental limitation of using table identifiers alone, regardless of >> whether we use the current rule, the proposed fallback to the session’s >> default catalog, or even early vs. late binding. >> >>>>> >> >>>>> The same fully qualified identifier (e.g., >> catalogA.namespace.table) can resolve to different objects depending solely >> on engine-specific routing logic or catalog aliases. So determinism isn’t >> guaranteed just because an identifier is "fully qualified." The only >> reliable anchor for identity is the UUID. That’s why the proposed use of >> UUIDs is not just a hardening strategy. It’s the actual fix for correctness. >> >>>>> >> >>>>> To move the conversation forward, could you help clarify two things >> in the context of the current spec: >> >>>>> >> >>>>> * Where in the metadata is the “view catalog” stored, so that an >> engine knows to fall back to it if default-catalog is null? >> >>>>> >> >>>>> * Are we even allowed to create views in the session's default >> catalog (i.e., without specifying a catalog) in the current Iceberg spec? >> >>>>> >> >>>>> These questions are important because if we can’t unambiguously >> recover the "view catalog" from metadata, then defaulting to it is >> problematic. And if views can't be created in the default catalog, then the >> fallback rule doesn’t generalize. >> >>>>> >> >>>>> Thanks, >> >>>>> Walaa. >> >>>>> >> >>>>> >> >>>>> On Tue, Apr 22, 2025 at 3:14 AM Jan Kaul >> <jank...@mailbox.org.invalid> wrote: >> >>>>>> >> >>>>>> Hi Walaa, >> >>>>>> >> >>>>>> thank you for your proposal. If I understood correctly, you >> proposal is composed of three parts: >> >>>>>> >> >>>>>> - session default catalog as fallback for "default-catalog" >> >>>>>> >> >>>>>> - session default namespace as fallback for "default-namepace" >> >>>>>> >> >>>>>> - Late binding + UUID validation >> >>>>>> >> >>>>>> I have some comments regarding these points. >> >>>>>> >> >>>>>> >> >>>>>> 1. Session default catalog as fallback for "default-catalog" >> >>>>>> >> >>>>>> Introducing a behavior that depends on the current session setup >> is in my opinion the definition of "non-determinism". You could be running >> the same query-engine and catalog-setup on different days, with different >> default session catalogs (which is rather common), and would be getting >> different results. >> >>>>>> >> >>>>>> Whereas with the current behavior, the view always produces the >> same results. The current behavior has some rough edges in very niche use >> cases but I think is solid for most uses cases. >> >>>>>> >> >>>>>> 2. Session default namespace as fallback for "default-namespace" >> >>>>>> >> >>>>>> Similar to the above. >> >>>>>> >> >>>>>> 3. Late binding + UUID validation >> >>>>>> >> >>>>>> If I understand it correctly, the current implementation already >> uses late binding. >> >>>>>> >> >>>>>> Generally, having UUID validation makes the setup more robust. >> Which is great. However, having UUID validation still requires us to have a >> portable table identifier specification. Even if we have the UUIDs of the >> referenced tables from the view, there simply isn't an interface that let's >> us use those UUIDs. The catalog interface is defined in terms of table >> identifiers. >> >>>>>> >> >>>>>> So we always require a working catalog setup and suiting table >> identifiers to obtain the table metadata. We can use the UUIDs to verify if >> we loaded the correct table. But this can only be done after we used some >> identifier. Which means there is no way of using UUIDs without a >> functioning catalog/identifier setup. >> >>>>>> >> >>>>>> >> >>>>>> In conclusion, I prefer the current behavior for "default-catalog" >> because it is more deterministic in my opinion. And I think the current >> spec does a good job for multi-engine table identifier resolution. I see >> the UUID validation more of an additional hardening strategy. >> >>>>>> >> >>>>>> Thanks >> >>>>>> >> >>>>>> Jan >> >>>>>> >> >>>>>> On 4/21/25 17:38, Walaa Eldin Moustafa wrote: >> >>>>>> >> >>>>>> Thanks Renjie! >> >>>>>> >> >>>>>> The existing spec has some guidance on resolving catalogs on the >> fly already (to address the case of view text with table identifiers >> missing the catalog part). The guidance is to use the catalog where the >> view is stored. But I find this rule hard to interpret or use. The catalog >> itself is a logical construct—such as a federated catalog that delegates to >> multiple physical backends (e.g., HMS and REST). In such cases, the catalog >> (e.g., `my_catalog` in `my_catalog.namespace1.table1`) doesn’t physically >> store the tables; it only routes requests to underlying stores. Therefore, >> defaulting identifier resolution based on the catalog where the view is >> "stored" doesn’t align with how catalogs actually behave in practice. >> >>>>>> >> >>>>>> Thanks, >> >>>>>> Walaa. >> >>>>>> >> >>>>>> On Sun, Apr 20, 2025 at 11:17 PM Renjie Liu < >> liurenjie2...@gmail.com> wrote: >> >>>>>>> >> >>>>>>> Hi, Walaa: >> >>>>>>> >> >>>>>>> Thanks for the proposal. >> >>>>>>> >> >>>>>>> I've reviewed the doc, but in general I have some concerns with >> resolving catalog names on the fly with query engine defined catalog names. >> This introduces some flexibility at first glance, but also makes >> misconfiguration difficult to explain. >> >>>>>>> >> >>>>>>> But I agree with one part that we should store resolved table >> uuid in view metadata, as table/view renaming may introduce errors that's >> difficult to understand for user. >> >>>>>>> >> >>>>>>> On Sat, Apr 19, 2025 at 3:02 AM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>>>>>>> >> >>>>>>>> Hi Everyone, >> >>>>>>>> >> >>>>>>>> Looking forward to keeping up the momentum and closing out the >> MV spec as well. I’m hoping we can proceed to a vote next week. >> >>>>>>>> >> >>>>>>>> Here is a summary in case that helps. The proposal outlines a >> strategy for handling table identifiers in Iceberg view metadata, with the >> goal of ensuring correctness, portability, and engine compatibility. It >> recommends resolving table identifiers at read time (late binding) rather >> than creation time, and introduces UUID-based validation to maintain >> identity guarantees across engines, or sessions. It also revises how >> default-catalog and default-namespace are handled (defaulting both to the >> session context if not explicitly set) to better align with engine behavior >> and improve cross-engine interoperability. >> >>>>>>>> >> >>>>>>>> Please let me know your thoughts. >> >>>>>>>> >> >>>>>>>> Thanks, >> >>>>>>>> Walaa. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> On Wed, Apr 16, 2025 at 2:03 PM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>>>>>>>> >> >>>>>>>>> Thanks Eduard and Sung! I have addressed the comments. >> >>>>>>>>> >> >>>>>>>>> One key point to keep in mind is that catalog names in the spec >> refer to logical catalogs—i.e., the first part of a three-part identifier. >> These correspond to Spark's DataSourceV2 catalogs, Trino connectors, and >> similar constructs. This is a level of abstraction above physical catalogs, >> which are not referenced or used in the view spec. The reason is that table >> identifiers in the view definition/text itself refer to logical catalogs, >> not physical ones (since they interface directly with the engine and not a >> specific metastore). >> >>>>>>>>> >> >>>>>>>>> Thanks, >> >>>>>>>>> Walaa. >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> On Wed, Apr 16, 2025 at 6:15 AM Sung Yun <sungwy...@gmail.com> >> wrote: >> >>>>>>>>>> >> >>>>>>>>>> Thank you Walaa for the proposal. I think view portability is >> a very important topic for us to continue discussing as it relies on many >> assumptions within the data ecosystem for it to function like you've >> highlighted well in the document. >> >>>>>>>>>> >> >>>>>>>>>> I've added a few comments around how this may impact the >> permission questions the engines will be asking, and whether that is the >> desired behavior. >> >>>>>>>>>> >> >>>>>>>>>> Sung >> >>>>>>>>>> >> >>>>>>>>>> On Wed, Apr 16, 2025 at 7:32 AM Eduard Tudenhöfner < >> etudenhoef...@apache.org> wrote: >> >>>>>>>>>>> >> >>>>>>>>>>> Thanks Walaa for tackling this problem. I've added a few >> comments to get a better understanding of how this will look like in the >> actual implementation. >> >>>>>>>>>>> >> >>>>>>>>>>> Eduard >> >>>>>>>>>>> >> >>>>>>>>>>> On Tue, Apr 15, 2025 at 7:09 PM Walaa Eldin Moustafa < >> wa.moust...@gmail.com> wrote: >> >>>>>>>>>>>> >> >>>>>>>>>>>> Hi Everyone, >> >>>>>>>>>>>> >> >>>>>>>>>>>> Starting this thread to resume our discussion on how to >> reference table identifiers from Iceberg metadata, a key aspect of the view >> specification, particularly in relation to the MV (materialized view) >> extensions. >> >>>>>>>>>>>> >> >>>>>>>>>>>> I had the chance to speak offline with a few community >> members to better understand how the current spec is being interpreted. >> Those conversations served as inputs to a new proposal on how table >> identifier references could be represented in metadata. >> >>>>>>>>>>>> >> >>>>>>>>>>>> You can find the proposal here [1]. I look forward to your >> feedback and working together to move this forward so we can finalize the >> MV spec as well. >> >>>>>>>>>>>> >> >>>>>>>>>>>> [1] >> https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0 >> >>>>>>>>>>>> >> >>>>>>>>>>>> Thanks, >> >>>>>>>>>>>> Walaa. >> >