Thanks for the contribution Benny! +1 to the confusion the fallback creates. Also just to be clear, at this point and after clarifying the current spec intentions, I am convinced that we should remove the default catalog and default namespace fields altogether.
Thanks, Walaa. On Fri, Apr 25, 2025 at 5:13 PM Benny Chow <btc...@gmail.com> wrote: > I'd like to contribute my opinions on this: > > - I don't particularly like the current behavior of "default to the view's > catalog when default-catalog is not set". Fundamentally, I believe the > intent of default-catalog and default-namespace is there to help users > write more concise SQL. > - spark session catalog is engine specific and I don't think we should > design something that says first use this catalog, then that catalog.. or > that catalog. For example, resolving identifiers using default-catalog -> > view's catalog -> session catalog is not good. > - We gotta support non-Iceberg tables otherwise I see no value in putting > views in the catalog to share with other engines > - Interoperability between different engine types is very hard due to > dialect issues... so I think we should focus on supporting different > clusters of the same engine type on a shared catalog. For example, AI and > BI clusters on Spark sharing the same views in a REST catalog. > > Coincidentally, I think the ultimate solution is along the lines of > something Russell proposed last year: > > https://lists.apache.org/thread/hoskfx8y3kvrcww52l4w9dxghp3pnlm7 > > We've been looking at this interoperable identifier problem through the > lens of catalog resolution but maybe the right approach is really about > templating. > > I would extend Russell's idea to allow identifiers in a view to span > catalogs to support non-Iceberg tables. Also, the default-catalog > property could be templated as well. > > Thoughts? > Benny > > > > On Fri, Apr 25, 2025 at 4:02 PM Walaa Eldin Moustafa < > wa.moust...@gmail.com> wrote: > >> Thanks Steven! How do you recommend making Spark implementation conform >> to the spec? Do we need Spark SQL extensions and/or Spark catalog APIs for >> that? >> >> How do you recommend reconciling the inconsistencies I shared regarding >> many resolution methods not consistently being followed in different >> scenarios (view vs child table resolution, query vs view resolution)? Note >> these occur when the default catalog is set to a non-null value. If it >> helps, I can share concrete examples. >> >> Thanks, >> Walaa. >> >> On Fri, Apr 25, 2025 at 3:52 PM Steven Wu <stevenz...@gmail.com> wrote: >> >>> The core issue is on the fall back behavior when `default-catalog` is >>> not defined. Current view spec says the fallback should be the catalog >>> where the view is defined. It doesn't really matter what the catalog >>> is named (catalogX) by the read engine. >>> - If a view refers to the tables in the same catalog, this is a >>> non-ambiguous and reasonable fallback behavior. >>> - If a view refers to tables from another catalog, catalog names >>> should be included in the reference name already. So no ambiguity >>> there either. >>> >>> Potential inconsistent naming of catalog is a separate problem, which >>> Iceberg view spec probably cannot solve. We can only recommend that >>> catalog should be named consistently across usage for better >>> interoperability on name references. >>> >>> This proposal is to change the fallback behavior to engine's session >>> default catalog. I am not sure it is better than the current fallback >>> behavior. >>> >>> > Today’s Spark behavior explicitly differs from this idea. Spark >>> resolves table identifiers during view creation using the session’s default >>> catalog, not a supplied `default-catalog`. >>> >>> I would argue that is a Spark implementation issue for not conforming >>> to the spec. >>> >>> >>> On Fri, Apr 25, 2025 at 1:17 PM Walaa Eldin Moustafa >>> <wa.moust...@gmail.com> wrote: >>> > >>> > Hi Jan, >>> > >>> > Thanks again for continuing the discussion. I want to highlight a few >>> fundamental issues around the interpretation of default-catalog: >>> > >>> > Here is the real catch: >>> > >>> > * default-catalog cannot logically be defined at view creation time. >>> It would be circular: the view needs to exist before its metadata (and >>> hence default-catalog) can exist. This is visible in Spark’s >>> implementation, where `default-catalog` is not used. >>> > >>> > * Introducing a creation-time default-catalog setting would require >>> extending SQL syntax and engine APIs to promote it to a first-class view >>> concept. This would be intrusive, non-intuitive, and realistically very >>> difficult to standardize across engines. >>> > >>> > * Today’s Spark behavior explicitly differs from this idea. Spark >>> resolves table identifiers during view creation using the session’s default >>> catalog, not a supplied `default-catalog`. >>> > >>> > * Hypothetically even if we patched in a creation-time >>> default-catalog, it would create an inconsistent binding model between >>> tables vs views (early vs late), and between tables in views and in queries >>> (again early vs late). For example, views and tables in queries can >>> withstand default catalog renames, but tables cannot when they are used >>> inside views -- it even applies to views inside views, which makes this >>> very hard to reason about considering nesting. >>> > >>> > Thanks, >>> > Walaa >>> > >>> > On Fri, Apr 25, 2025 at 7:00 AM Jan Kaul <jank...@mailbox.org.invalid> >>> wrote: >>> >> >>> >> @Walaa: >>> >> >>> >> I would argue that when you run a CREATE VIEW statement the query >>> engine knowns which catalog the view is being created in. So even though we >>> typically use late binding to resolve the view catalog at query time, it >>> can also be used at creation time. >>> >> >>> >> The query engine would need to keep track of the "view catalog" where >>> the view is going to be created in. It can use that catalog to resolve >>> partial table identifiers if "default-catalog" is not set. >>> >> >>> >> It can lead to some unintuitive behavior, where partial identifiers >>> in the view query resolve to a different catalog compared to using them >>> outside of a view. >>> >> >>> >> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from >>> sales.orders; >>> >> >>> >> If the session default catalog is not "catalogA", the "sales.orders" >>> in the view query would not be the same as just referencing "sales.orders" >>> in a normal SQL statement. This is because without a "default-catalog", the >>> catalog name of "sales.orders" would default to "catalogA", which is the >>> view's catalog. >>> >> >>> >> Thanks, >>> >> >>> >> Jan >>> >> >>> >> On 4/25/25 04:05, Manu Zhang wrote: >>> >>> >>> >>> For example, if we want to validate that the tables referenced in >>> the view exist, how can we do that when default-catalog isn't defined, >>> since the view hasn't been created or loaded yet? >>> >> >>> >> I don't think this is related to view spec. How do we validate that a >>> table exists without a default catalog, or do we always use the current >>> session catalog? >>> >> >>> >> Thanks, >>> >> Manu >>> >> >>> >> On Fri, Apr 25, 2025 at 5:59 AM Walaa Eldin Moustafa < >>> wa.moust...@gmail.com> wrote: >>> >>> >>> >>> Hi Jan, >>> >>> >>> >>> I think we still share the same understanding. Just to clarify: when >>> I referred to late binding as “similar” to the proposal, I was >>> acknowledging the distinction between view-level and table-level >>> resolution. But as you noted, both follow a late binding model. >>> >>> >>> >>> That said, this still raises an interesting question and a potential >>> gap: if default-catalog is only defined at query time, how should >>> resolution work during view creation? For example, if we want to validate >>> that the tables referenced in the view exist, how can we do that when >>> default-catalog isn't defined, since the view hasn't been created or loaded >>> yet? >>> >>> >>> >>> Thanks, >>> >>> Walaa. >>> >>> >>> >>> On Thu, Apr 24, 2025 at 7:02 AM Jan Kaul <jank...@mailbox.org.invalid> >>> wrote: >>> >>>> >>> >>>> Yes, I have the same understanding. The view catalog is resolved at >>> query time. >>> >>>> >>> >>>> As you mentioned before, it's good to distinguish between the >>> physical catalog and it's reference used in SQL statements. The important >>> part is that the physical catalog of the view and the tables referenced in >>> it's definition stay consistent. You could create a view in a given >>> physical catalog by referring to it as "catalogA", as in your first point. >>> If you then, given a different setup, refer to the same physical catalog as >>> "catalogB" in another session/environment, the behavior should still work. >>> >>>> >>> >>>> I would however rephrase your last point. Late binding applies to >>> the view catalog name and by extension to all partial table references when >>> no "default-catalog" is present. Resolving the view catalog name at query >>> time is not opposed to storing the view metadata in a catalog. >>> >>>> >>> >>>> Or maybe I don't entirely understand what you mean. >>> >>>> >>> >>>> Thanks >>> >>>> >>> >>>> Jan >>> >>>> >>> >>>> On 4/24/25 00:32, Walaa Eldin Moustafa wrote: >>> >>>> >>> >>>> Hi Jan, >>> >>>> >>> >>>> > The view is executed when it's being referenced in a SQL >>> statement. That statement contains the information for the query engine to >>> resolve the catalog of the view. >>> >>>> >>> >>>> If I’m understanding correctly, that means: >>> >>>> >>> >>>> * If the view is queried as SELECT * FROM catalogA.namespace.view, >>> then catalogA is considered the view’s catalog. >>> >>>> >>> >>>> * If the same view is later queried as SELECT * FROM >>> catalogB.namespace.view (after renaming catalogA to catalogB, and keeping >>> everything else the same), then catalogB becomes the view’s catalog. >>> >>>> >>> >>>> Is that interpretation correct? If so, it sounds to me like the >>> catalog is resolved at query time, based on how the view is referenced, not >>> from any stored metadata. That would imply some sort of a late binding >>> behavior (similar to the proposal), as opposed to using some catalog that >>> "stores" the view definition. >>> >>>> >>> >>>> Thanks, >>> >>>> Walaa >>> >>>> >>> >>>> On Tue, Apr 22, 2025 at 11:01 AM Jan Kaul >>> <jank...@mailbox.org.invalid> wrote: >>> >>>>> >>> >>>>> Hi Walaa, >>> >>>>> >>> >>>>> Thanks for clarifying the aspects of non-determinism. Let me try >>> to address your questions. >>> >>>>> >>> >>>>> 1. This is my interpretation of the current spec: The view is >>> executed when it's being referenced in a SQL statement. That statement >>> contains the information for the query engine to resolve the catalog of the >>> view. The query engine then uses that information to fetch the view >>> metadata from the catalog. It also needs to temporarily keep track of which >>> catalog it used to fetch the view metadata. It can then use that >>> information to resolve the table references in the views SQL definition in >>> case no default catalog is specified. >>> >>>>> >>> >>>>> 2. The important part is that the catalog can be referenced at >>> execution time. As long as that's the case I would assume the view can be >>> created in any catalog. >>> >>>>> >>> >>>>> >>> >>>>> I think your point is really valuable because the current >>> specification can lead to some unintuitive behavior. For example for the >>> following statement: >>> >>>>> >>> >>>>> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from >>> sales.orders; >>> >>>>> >>> >>>>> If the session default catalog is not "catalogA", the >>> "sales.orders" in the view query would not be the same as just referencing >>> "sales.orders" in a normal SQL statement. This is because without a >>> "default-catalog", the catalog name of "sales.orders" would default to >>> "catalogA". >>> >>>>> >>> >>>>> >>> >>>>> However, I like the current design of the view spec, because it >>> has the "closure" property. Because of the fact that the "view catalog" has >>> to be known when executing a view, all the information required to resolve >>> the table identifiers is contained in the view metadata (and the "view >>> catalog"). I think that if you make the identifier resolution dependent on >>> external parameters, it hinders portability. >>> >>>>> >>> >>>>> Thanks, >>> >>>>> >>> >>>>> Jan >>> >>>>> >>> >>>>> On 4/22/25 18:36, Walaa Eldin Moustafa wrote: >>> >>>>> >>> >>>>> Hi Jan, >>> >>>>> >>> >>>>> Thanks for the thoughtful feedback. >>> >>>>> >>> >>>>> I think it’s important we clarify a key point before going deeper: >>> >>>>> >>> >>>>> Non-determinism is not caused by session fallback behavior—it’s a >>> fundamental limitation of using table identifiers alone, regardless of >>> whether we use the current rule, the proposed fallback to the session’s >>> default catalog, or even early vs. late binding. >>> >>>>> >>> >>>>> The same fully qualified identifier (e.g., >>> catalogA.namespace.table) can resolve to different objects depending solely >>> on engine-specific routing logic or catalog aliases. So determinism isn’t >>> guaranteed just because an identifier is "fully qualified." The only >>> reliable anchor for identity is the UUID. That’s why the proposed use of >>> UUIDs is not just a hardening strategy. It’s the actual fix for correctness. >>> >>>>> >>> >>>>> To move the conversation forward, could you help clarify two >>> things in the context of the current spec: >>> >>>>> >>> >>>>> * Where in the metadata is the “view catalog” stored, so that an >>> engine knows to fall back to it if default-catalog is null? >>> >>>>> >>> >>>>> * Are we even allowed to create views in the session's default >>> catalog (i.e., without specifying a catalog) in the current Iceberg spec? >>> >>>>> >>> >>>>> These questions are important because if we can’t unambiguously >>> recover the "view catalog" from metadata, then defaulting to it is >>> problematic. And if views can't be created in the default catalog, then the >>> fallback rule doesn’t generalize. >>> >>>>> >>> >>>>> Thanks, >>> >>>>> Walaa. >>> >>>>> >>> >>>>> >>> >>>>> On Tue, Apr 22, 2025 at 3:14 AM Jan Kaul >>> <jank...@mailbox.org.invalid> wrote: >>> >>>>>> >>> >>>>>> Hi Walaa, >>> >>>>>> >>> >>>>>> thank you for your proposal. If I understood correctly, you >>> proposal is composed of three parts: >>> >>>>>> >>> >>>>>> - session default catalog as fallback for "default-catalog" >>> >>>>>> >>> >>>>>> - session default namespace as fallback for "default-namepace" >>> >>>>>> >>> >>>>>> - Late binding + UUID validation >>> >>>>>> >>> >>>>>> I have some comments regarding these points. >>> >>>>>> >>> >>>>>> >>> >>>>>> 1. Session default catalog as fallback for "default-catalog" >>> >>>>>> >>> >>>>>> Introducing a behavior that depends on the current session setup >>> is in my opinion the definition of "non-determinism". You could be running >>> the same query-engine and catalog-setup on different days, with different >>> default session catalogs (which is rather common), and would be getting >>> different results. >>> >>>>>> >>> >>>>>> Whereas with the current behavior, the view always produces the >>> same results. The current behavior has some rough edges in very niche use >>> cases but I think is solid for most uses cases. >>> >>>>>> >>> >>>>>> 2. Session default namespace as fallback for "default-namespace" >>> >>>>>> >>> >>>>>> Similar to the above. >>> >>>>>> >>> >>>>>> 3. Late binding + UUID validation >>> >>>>>> >>> >>>>>> If I understand it correctly, the current implementation already >>> uses late binding. >>> >>>>>> >>> >>>>>> Generally, having UUID validation makes the setup more robust. >>> Which is great. However, having UUID validation still requires us to have a >>> portable table identifier specification. Even if we have the UUIDs of the >>> referenced tables from the view, there simply isn't an interface that let's >>> us use those UUIDs. The catalog interface is defined in terms of table >>> identifiers. >>> >>>>>> >>> >>>>>> So we always require a working catalog setup and suiting table >>> identifiers to obtain the table metadata. We can use the UUIDs to verify if >>> we loaded the correct table. But this can only be done after we used some >>> identifier. Which means there is no way of using UUIDs without a >>> functioning catalog/identifier setup. >>> >>>>>> >>> >>>>>> >>> >>>>>> In conclusion, I prefer the current behavior for >>> "default-catalog" because it is more deterministic in my opinion. And I >>> think the current spec does a good job for multi-engine table identifier >>> resolution. I see the UUID validation more of an additional hardening >>> strategy. >>> >>>>>> >>> >>>>>> Thanks >>> >>>>>> >>> >>>>>> Jan >>> >>>>>> >>> >>>>>> On 4/21/25 17:38, Walaa Eldin Moustafa wrote: >>> >>>>>> >>> >>>>>> Thanks Renjie! >>> >>>>>> >>> >>>>>> The existing spec has some guidance on resolving catalogs on the >>> fly already (to address the case of view text with table identifiers >>> missing the catalog part). The guidance is to use the catalog where the >>> view is stored. But I find this rule hard to interpret or use. The catalog >>> itself is a logical construct—such as a federated catalog that delegates to >>> multiple physical backends (e.g., HMS and REST). In such cases, the catalog >>> (e.g., `my_catalog` in `my_catalog.namespace1.table1`) doesn’t physically >>> store the tables; it only routes requests to underlying stores. Therefore, >>> defaulting identifier resolution based on the catalog where the view is >>> "stored" doesn’t align with how catalogs actually behave in practice. >>> >>>>>> >>> >>>>>> Thanks, >>> >>>>>> Walaa. >>> >>>>>> >>> >>>>>> On Sun, Apr 20, 2025 at 11:17 PM Renjie Liu < >>> liurenjie2...@gmail.com> wrote: >>> >>>>>>> >>> >>>>>>> Hi, Walaa: >>> >>>>>>> >>> >>>>>>> Thanks for the proposal. >>> >>>>>>> >>> >>>>>>> I've reviewed the doc, but in general I have some concerns with >>> resolving catalog names on the fly with query engine defined catalog names. >>> This introduces some flexibility at first glance, but also makes >>> misconfiguration difficult to explain. >>> >>>>>>> >>> >>>>>>> But I agree with one part that we should store resolved table >>> uuid in view metadata, as table/view renaming may introduce errors that's >>> difficult to understand for user. >>> >>>>>>> >>> >>>>>>> On Sat, Apr 19, 2025 at 3:02 AM Walaa Eldin Moustafa < >>> wa.moust...@gmail.com> wrote: >>> >>>>>>>> >>> >>>>>>>> Hi Everyone, >>> >>>>>>>> >>> >>>>>>>> Looking forward to keeping up the momentum and closing out the >>> MV spec as well. I’m hoping we can proceed to a vote next week. >>> >>>>>>>> >>> >>>>>>>> Here is a summary in case that helps. The proposal outlines a >>> strategy for handling table identifiers in Iceberg view metadata, with the >>> goal of ensuring correctness, portability, and engine compatibility. It >>> recommends resolving table identifiers at read time (late binding) rather >>> than creation time, and introduces UUID-based validation to maintain >>> identity guarantees across engines, or sessions. It also revises how >>> default-catalog and default-namespace are handled (defaulting both to the >>> session context if not explicitly set) to better align with engine behavior >>> and improve cross-engine interoperability. >>> >>>>>>>> >>> >>>>>>>> Please let me know your thoughts. >>> >>>>>>>> >>> >>>>>>>> Thanks, >>> >>>>>>>> Walaa. >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> On Wed, Apr 16, 2025 at 2:03 PM Walaa Eldin Moustafa < >>> wa.moust...@gmail.com> wrote: >>> >>>>>>>>> >>> >>>>>>>>> Thanks Eduard and Sung! I have addressed the comments. >>> >>>>>>>>> >>> >>>>>>>>> One key point to keep in mind is that catalog names in the >>> spec refer to logical catalogs—i.e., the first part of a three-part >>> identifier. These correspond to Spark's DataSourceV2 catalogs, Trino >>> connectors, and similar constructs. This is a level of abstraction above >>> physical catalogs, which are not referenced or used in the view spec. The >>> reason is that table identifiers in the view definition/text itself refer >>> to logical catalogs, not physical ones (since they interface directly with >>> the engine and not a specific metastore). >>> >>>>>>>>> >>> >>>>>>>>> Thanks, >>> >>>>>>>>> Walaa. >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> On Wed, Apr 16, 2025 at 6:15 AM Sung Yun <sungwy...@gmail.com> >>> wrote: >>> >>>>>>>>>> >>> >>>>>>>>>> Thank you Walaa for the proposal. I think view portability is >>> a very important topic for us to continue discussing as it relies on many >>> assumptions within the data ecosystem for it to function like you've >>> highlighted well in the document. >>> >>>>>>>>>> >>> >>>>>>>>>> I've added a few comments around how this may impact the >>> permission questions the engines will be asking, and whether that is the >>> desired behavior. >>> >>>>>>>>>> >>> >>>>>>>>>> Sung >>> >>>>>>>>>> >>> >>>>>>>>>> On Wed, Apr 16, 2025 at 7:32 AM Eduard Tudenhöfner < >>> etudenhoef...@apache.org> wrote: >>> >>>>>>>>>>> >>> >>>>>>>>>>> Thanks Walaa for tackling this problem. I've added a few >>> comments to get a better understanding of how this will look like in the >>> actual implementation. >>> >>>>>>>>>>> >>> >>>>>>>>>>> Eduard >>> >>>>>>>>>>> >>> >>>>>>>>>>> On Tue, Apr 15, 2025 at 7:09 PM Walaa Eldin Moustafa < >>> wa.moust...@gmail.com> wrote: >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> Hi Everyone, >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> Starting this thread to resume our discussion on how to >>> reference table identifiers from Iceberg metadata, a key aspect of the view >>> specification, particularly in relation to the MV (materialized view) >>> extensions. >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> I had the chance to speak offline with a few community >>> members to better understand how the current spec is being interpreted. >>> Those conversations served as inputs to a new proposal on how table >>> identifier references could be represented in metadata. >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> You can find the proposal here [1]. I look forward to your >>> feedback and working together to move this forward so we can finalize the >>> MV spec as well. >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> [1] >>> https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0 >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> Thanks, >>> >>>>>>>>>>>> Walaa. >>> >>