> If the current model is considered deterministic, do you think `default-catalog` and `default-namespace` fields provide enough determinism to eliminate the need for UUIDs when storing table identifiers?
I am fine with storing UUIDs for table identifiers in the view. Basically, view creation resolves all referenced tables/views with UUIDs. View consumers can validate resolved tables/views with the stored UUIDs and fail the query if mismatch. The UUID change doesn't really change the table identifier resolution rule though. It is more of a safety protection. On Wed, May 7, 2025 at 10:02 PM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Hi Steven, > > Thanks for the reply. > > > I agree with Dan that we shouldn't solve catalog naming in the Iceberg > view spec. > > To clarify, I don't believe the proposal is trying to solve catalog > naming. What it’s doing is simply this: > > * Proposing that table names inside views resolve the same way as they do > elsewhere (e.g., queries). > * Adopting a model that is already widely used and supported in the > existing ecosystem, which allows for: > -- Renaming catalog aliases > -- Swapping catalog implementations behind consistent names > -- Having different default catalog names across engines that still > point to the same underlying tables > > These are common patterns in production data lakes. Saying Iceberg views > cannot operate in those environments feels unrealistic. In practice, it > means the spec breaks down in situations that users encounter regularly. > > > The recommendation of using engines’ current catalog and database can > cause context-dependent resolution results. > > * As noted in the doc and earlier replies, fixing a catalog name doesn’t > actually guarantee determinism either. All the failure scenarios above > still apply even when a default-catalog is stored. > * The current spec also allows default-catalog to be null, in which case > it falls back to the view’s catalog, yet that catalog is determined based > on how the view is referenced in the query, which would be considered > non-deterministic based on the same criteria you shared. > * The only true form of determinism here is UUID-based validation, which > protects against silent drift in any resolution model. > > If the current model is considered deterministic, do you think > `default-catalog` and `default-namespace` fields provide enough determinism > to eliminate the need for UUIDs when storing table identifiers? > Or put another way: Would you be comfortable relying solely on > default-catalog + default-namespace + table name to re-identify the correct > table, without UUID validation? > > +1 on involving other communities. I’m happy to help facilitate a > cross-community discussion if we aren’t able to reach a resolution here. > > Thanks, > Walaa. > > > > On Wed, May 7, 2025 at 9:20 PM Steven Wu <stevenz...@gmail.com> wrote: > >> I agree with Dan that we shouldn't solve catalog naming in the Iceberg >> view spec. I am not convinced that the proposed change will make the table >> identifier resolution more clear and portable. The recommendation of using >> engines' current catalog and database can cause context dependent >> resolution results, which seems non-deterministic to me. >> >> Walaa, you raised a point in the doc that the current catalog resolution >> logic (default-catalog field, then view catalog) is challenging and >> unrealistic for engines (like Spark and Trino). It will be great to get >> more inputs from the broader community on this part. >> >> >> On Tue, May 6, 2025 at 9:21 AM Benny Chow <btc...@gmail.com> wrote: >> >>> In Spark, I believe that the USE commands sets the current catalog and >>> namespace. This affects both where the view is created and how unqualified >>> table identifiers are resolved. I also don't see an issue with saving the >>> current catalog and namespace into the view metadata's default-catalog and >>> default-namespace fields. >>> >>> On Wed, Apr 30, 2025 at 5:12 PM Walaa Eldin Moustafa < >>> wa.moust...@gmail.com> wrote: >>> >>>> > I think that's the lesser evil compared to Iceberg specifying how >>>> engines should resolve identifiers >>>> >>>> I think this is also similar to the previous point. It is the other way >>>> around. Right now the spec dictates how to resolve (through employing a >>>> view-specific `default-catalog` field). The proposal is suggesting to get >>>> out of this space and let engines handle it similar to how they handle all >>>> identifiers. >>>> >>>> On Wed, Apr 30, 2025 at 5:07 PM Walaa Eldin Moustafa < >>>> wa.moust...@gmail.com> wrote: >>>> >>>>> > I thought "default-catalog" could be set via the USE command. >>>>> >>>>> Benny, I think this is a misconception or miscommunication. The USE >>>>> command has no impact on the `default-catalog` field. In fact, the >>>>> proposal's direction is exactly to establish that USE command should >>>>> influence how tables are resolved, same like everywhere else. Right now it >>>>> is not the case under the current spec. >>>>> >>>>> >>>>> On Wed, Apr 30, 2025 at 3:17 PM Benny Chow <btc...@gmail.com> wrote: >>>>> >>>>>> > there is no SQL construct today to explicitly set default-catalog >>>>>> >>>>>> I thought "default-catalog" could be set via the USE command. >>>>>> >>>>>> I generally agree with Dan about requiring consistent catalog names. >>>>>> I think that's the lesser evil compared to Iceberg specifying how engines >>>>>> should resolve identifiers. Another thing to consider is that identifier >>>>>> resolution can be very expensive at query validation time if identifiers >>>>>> need to be looked up from a bunch of places. Hopefully, it should be >>>>>> possible to define a view in such a way that identifiers can be resolved >>>>>> on >>>>>> the first try. >>>>>> >>>>>> Benny >>>>>> >>>>>> On Tue, Apr 29, 2025 at 10:29 PM Walaa Eldin Moustafa < >>>>>> wa.moust...@gmail.com> wrote: >>>>>> >>>>>>> Hi Rishabh, >>>>>>> >>>>>>> You're right that the proposal touches on two aspects, and >>>>>>> resolution rules are one of them. The other aspect is the proposal's >>>>>>> position that table identifiers should be stored in metadata exactly as >>>>>>> they appear in the view text (e.g., even if they're two-part or >>>>>>> partially >>>>>>> qualified), along with their corresponding UUIDs for validation. This >>>>>>> applies both to referenced input tables and the storage table >>>>>>> identifier in >>>>>>> materialized views. >>>>>>> >>>>>>> We may be able to converge on this storage format even if we haven't >>>>>>> yet converged on the resolution fallback rules. I believe both >>>>>>> resolution >>>>>>> strategies currently being discussed would still lead to storing >>>>>>> identifiers in this way. >>>>>>> >>>>>>> I'm supportive of moving forward with consensus on the identifier >>>>>>> storage format. That said, we may continue to run into questions >>>>>>> related to >>>>>>> resolution during implementation. For example: Should the storage table >>>>>>> identifier follow the same default-catalog and default-namespace >>>>>>> resolution >>>>>>> behavior as other table references? >>>>>>> >>>>>>> Thanks, >>>>>>> Walaa. >>>>>>> >>>>>>> On Tue, Apr 29, 2025 at 10:07 PM Rishabh Bhatia < >>>>>>> bhatiarishab...@gmail.com> wrote: >>>>>>> >>>>>>>> Hello Walaa, >>>>>>>> >>>>>>>> Thanks for starting this discussion. >>>>>>>> >>>>>>>> I think we should decouple at least the MV Spec from the proposal >>>>>>>> to change the current behavior of view resolution. >>>>>>>> >>>>>>>> We can continue having the discussion if the current view spec >>>>>>>> needs to be changed or not. Based on the decision at a later point if >>>>>>>> required we can update the view resolution rule. >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Rishabh >>>>>>>> >>>>>>>> On Mon, Apr 28, 2025 at 3:22 PM Walaa Eldin Moustafa < >>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Correction of typo: both engines seem to set default-catalog to >>>>>>>>> the view catalog if it is defined, or to null if the view catalog is >>>>>>>>> not >>>>>>>>> defined. >>>>>>>>> >>>>>>>>> On Mon, Apr 28, 2025 at 3:06 PM Walaa Eldin Moustafa < >>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Dan, >>>>>>>>>> >>>>>>>>>> Thanks again for your response. >>>>>>>>>> >>>>>>>>>> I agree that catalog renaming is an environmental event, but it's >>>>>>>>>> a real one that happens frequently in practice. >>>>>>>>>> Saying that the Iceberg spec cannot accommodate something as >>>>>>>>>> common as catalog renaming feels very restrictive, and could make >>>>>>>>>> the spec >>>>>>>>>> less practical, even unusable, for real-world deployments. >>>>>>>>>> I’m sharing this from the perspective of a large data lake >>>>>>>>>> environment where views are heavily deployed and operationalized. >>>>>>>>>> >>>>>>>>>> Further, it's worth noting that the table spec is resilient to >>>>>>>>>> catalog renaming, but the view spec is not. If we have an >>>>>>>>>> opportunity to >>>>>>>>>> make the view spec similarly resilient, I wonder why not? >>>>>>>>>> Both specifications are deterministic in their definition, but >>>>>>>>>> one is more fragile to environmental changes than the other. >>>>>>>>>> Improving >>>>>>>>>> resilience does not sacrifice determinism. It simply makes views >>>>>>>>>> safer and >>>>>>>>>> more portable over time. >>>>>>>>>> >>>>>>>>>> Separately, given that there is no SQL construct today to >>>>>>>>>> explicitly set default-catalog at creation time, what is the >>>>>>>>>> intuition >>>>>>>>>> behind how engines like Spark and Trino currently assign >>>>>>>>>> default-catalog? >>>>>>>>>> Today, both engines seem to set default-catalog to null if the >>>>>>>>>> view catalog is defined, or to the view catalog if not. >>>>>>>>>> What was the intended thought process behind this behavior? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Walaa >>>>>>>>>> >>>>>>>>>> On Mon, Apr 28, 2025 at 1:33 PM Daniel Weeks <dwe...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Walaa, >>>>>>>>>>> >>>>>>>>>>> > tables inside views remain reachable after a catalog rename >>>>>>>>>>> >>>>>>>>>>> This problem stems from the exact environmental/configuration >>>>>>>>>>> issue that we should not be trying to address. I don't think we >>>>>>>>>>> would >>>>>>>>>>> expect references to survive a catalog rename. That's not something >>>>>>>>>>> covered by the spec and needs to be handled separately as a >>>>>>>>>>> platform-level >>>>>>>>>>> migration specific to the affected environment. >>>>>>>>>>> >>>>>>>>>>> The identifier resolution logic is clear and deterministic. It >>>>>>>>>>> should not matter whether an engine resolves and encodes the >>>>>>>>>>> default-catalog or leaves it to the resolution rules. >>>>>>>>>>> >>>>>>>>>>> The issue isn't with how the spec is defined, but rather view >>>>>>>>>>> behavior when you start altering the environment around it, which >>>>>>>>>>> isn't >>>>>>>>>>> something we should be trying to define here. >>>>>>>>>>> >>>>>>>>>>> -Dan >>>>>>>>>>> >>>>>>>>>>> On Mon, Apr 28, 2025 at 12:17 PM Walaa Eldin Moustafa < >>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Dan, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for chiming in. >>>>>>>>>>>> >>>>>>>>>>>> I believe the issues we’re seeing now go beyond just catalog >>>>>>>>>>>> naming consistency. The behavior around default-catalog itself >>>>>>>>>>>> introduces >>>>>>>>>>>> resolution inconsistencies even when catalog names are consistent. >>>>>>>>>>>> For example: >>>>>>>>>>>> >>>>>>>>>>>> * When default-catalog is set to null, tables inside views >>>>>>>>>>>> remain reachable after a catalog rename. But if it is set to a >>>>>>>>>>>> non-null >>>>>>>>>>>> value, table references will break. >>>>>>>>>>>> >>>>>>>>>>>> * default-catalog causes table references inside views to be >>>>>>>>>>>> early bound (i.e., bound at view creation time, especially when >>>>>>>>>>>> using a >>>>>>>>>>>> non-null value), while table references inside standalone queries >>>>>>>>>>>> are late >>>>>>>>>>>> bound (bound at query time). This creates inconsistencies when >>>>>>>>>>>> resolving >>>>>>>>>>>> the same table name inside and outside views, even within the same >>>>>>>>>>>> job. >>>>>>>>>>>> >>>>>>>>>>>> * It causes Spark's and Trino behavior to drift from the spec. >>>>>>>>>>>> There is no way to fully align Spark's behavior without making >>>>>>>>>>>> invasive >>>>>>>>>>>> changes to the Spark SQL grammar and the View DataSource API >>>>>>>>>>>> (specifically >>>>>>>>>>>> on the CREATE side). This challenge would extend to other engines >>>>>>>>>>>> too. Both >>>>>>>>>>>> Spark and Trino set this field based on a heuristic in today's >>>>>>>>>>>> implementation. >>>>>>>>>>>> >>>>>>>>>>>> * With view nesting (views depending on views), these >>>>>>>>>>>> inconsistencies amplify further, forcing users and engines to >>>>>>>>>>>> reason about >>>>>>>>>>>> catalog resolution at every level in the view tree. >>>>>>>>>>>> >>>>>>>>>>>> * It will be difficult to migrate Hive views to Iceberg with >>>>>>>>>>>> that model. Migrated Hive views will have to unfollow that spec. >>>>>>>>>>>> >>>>>>>>>>>> How would you suggest approaching the engine-level changes >>>>>>>>>>>> required to support the current default-catalog field? >>>>>>>>>>>> Also, do you believe the Spark and Trino communities would >>>>>>>>>>>> align around having table resolution behave inconsistently between >>>>>>>>>>>> queries >>>>>>>>>>>> and views, or inconsistency between Iceberg and other types of >>>>>>>>>>>> views? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Walaa >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Apr 28, 2025 at 11:34 AM Daniel Weeks < >>>>>>>>>>>> dwe...@apache.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I would agree with Jan's summary of why 'default-catalog' was >>>>>>>>>>>>> introduced, but I think we need to step back and align on what we >>>>>>>>>>>>> are >>>>>>>>>>>>> really attempting to support in the spec. >>>>>>>>>>>>> >>>>>>>>>>>>> The issues we're discussing largely stem from using multiple >>>>>>>>>>>>> engines with cross catalog references and configurations where >>>>>>>>>>>>> catalog >>>>>>>>>>>>> names are not aligned. If we have multiple engines that all have >>>>>>>>>>>>> the same >>>>>>>>>>>>> catalog names/configurations, the current spec implementation is >>>>>>>>>>>>> well >>>>>>>>>>>>> defined for table resolution even across catalogs. The >>>>>>>>>>>>> 'default-catalog' >>>>>>>>>>>>> (and namespace equivalent) was intended to address the resolution >>>>>>>>>>>>> within >>>>>>>>>>>>> the context of the sql text, not to address catalog/naming >>>>>>>>>>>>> inconsistencies. >>>>>>>>>>>>> >>>>>>>>>>>>> I feel like we're trying to adapt the original intent to >>>>>>>>>>>>> address the catalog naming/configuration and would argue that we >>>>>>>>>>>>> shouldn't >>>>>>>>>>>>> attempt to do that as part of the spec. Inconsistently named >>>>>>>>>>>>> catalogs are >>>>>>>>>>>>> a reality, but we should consider that a >>>>>>>>>>>>> configuration/environmental issue, >>>>>>>>>>>>> not something to solve for in the spec. >>>>>>>>>>>>> >>>>>>>>>>>>> We should support and advocate for consistency in catalog >>>>>>>>>>>>> naming and define the spec along those lines. The fact is that >>>>>>>>>>>>> with all of >>>>>>>>>>>>> the recent work that's gone into making catalogs pluggable, it >>>>>>>>>>>>> makes more >>>>>>>>>>>>> sense to just register catalog configuration with consistent >>>>>>>>>>>>> names (even if >>>>>>>>>>>>> you have to duplicate the configuration for supporting existing >>>>>>>>>>>>> readers/writers). I think it's better to provide a path toward >>>>>>>>>>>>> consistency >>>>>>>>>>>>> than to normalize complicated schemes to workaround the issues >>>>>>>>>>>>> caused by >>>>>>>>>>>>> environmental/configuration inconsistencies. >>>>>>>>>>>>> >>>>>>>>>>>>> If the goal is to create clever ways to hack the late binding >>>>>>>>>>>>> resolution to swap in different catalogs or make references >>>>>>>>>>>>> contextual, I >>>>>>>>>>>>> feel like that is something we should strongly discourage as it >>>>>>>>>>>>> leads to >>>>>>>>>>>>> confusion about what is resolved as part of the query. >>>>>>>>>>>>> >>>>>>>>>>>>> At this point, I don't see a good argument to add >>>>>>>>>>>>> additional configuration or change the resolution behaviors. >>>>>>>>>>>>> >>>>>>>>>>>>> -Dan >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Apr 28, 2025 at 12:40 AM Jan Kaul >>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I think the intention with the "default-catalog" was that >>>>>>>>>>>>>> every query engine uses it to store its session default catalog >>>>>>>>>>>>>> at the time >>>>>>>>>>>>>> of creating the view. This way the view could be reused in >>>>>>>>>>>>>> another session. >>>>>>>>>>>>>> The idea was not to introduce an additional SQL syntax to set the >>>>>>>>>>>>>> default-catalog. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Generally we have different environments we want to support >>>>>>>>>>>>>> with the view spec: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. Consistent catalog naming >>>>>>>>>>>>>> >>>>>>>>>>>>>> When the environment supports it, using consistent catalog >>>>>>>>>>>>>> names can have a great benefit for multi-catalog, multi-engine >>>>>>>>>>>>>> setups. With >>>>>>>>>>>>>> consistent catalog names, using the "default-catalog" field >>>>>>>>>>>>>> works without >>>>>>>>>>>>>> any issues. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2. Inconsistent catalog naming >>>>>>>>>>>>>> >>>>>>>>>>>>>> This can be the case when different query engines refer to >>>>>>>>>>>>>> the same physical catalog by different names. This often happens >>>>>>>>>>>>>> because >>>>>>>>>>>>>> different query engines use different strategies to setup the >>>>>>>>>>>>>> catalogs. If >>>>>>>>>>>>>> catalogs have inconsistent naming, using the "default-catalog" >>>>>>>>>>>>>> field does >>>>>>>>>>>>>> not work because it is not guaranteed that the catalog name can >>>>>>>>>>>>>> be resolved >>>>>>>>>>>>>> with another engine. Using the "view catalog" as a fallback is a >>>>>>>>>>>>>> better >>>>>>>>>>>>>> solution for this use case, as it avoids catalog names >>>>>>>>>>>>>> altogether. It is >>>>>>>>>>>>>> however limited to table references in the same catalog. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> What do you think of introducing a view property that >>>>>>>>>>>>>> specifies if the "default-catalog" or the "view catalog" should >>>>>>>>>>>>>> be used? >>>>>>>>>>>>>> This way, you could use the "default-catalog" in environments >>>>>>>>>>>>>> where you can >>>>>>>>>>>>>> guarantee consistent naming, but you would be able to directly >>>>>>>>>>>>>> fallback to >>>>>>>>>>>>>> the "view-catalog" when you don't have consistent naming. The >>>>>>>>>>>>>> query engines >>>>>>>>>>>>>> could set the default for this view property at creation time. >>>>>>>>>>>>>> Spark for >>>>>>>>>>>>>> example could set it to automatically use the "view catalog". >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> >>>>>>>>>>>>>> Jan >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 4/26/25 05:33, Walaa Eldin Moustafa wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> To help folks catch up on the latest discussions and >>>>>>>>>>>>>> interpretation of the spec, I have summarized everything we >>>>>>>>>>>>>> discussed so >>>>>>>>>>>>>> far at the top of the proposal document (here >>>>>>>>>>>>>> <https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0>). >>>>>>>>>>>>>> I have slightly updated the proposal to be in sync with the new >>>>>>>>>>>>>> interpretation to avoid confusion. In summary: >>>>>>>>>>>>>> >>>>>>>>>>>>>> * Remove default-catalog and default-namespace fields from >>>>>>>>>>>>>> the view spec completely. >>>>>>>>>>>>>> >>>>>>>>>>>>>> * Hence, we do not attempt to define separate view-level >>>>>>>>>>>>>> default catalogs or namespaces. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Instead: >>>>>>>>>>>>>> >>>>>>>>>>>>>> * If a table identifier inside a view lacks a catalog >>>>>>>>>>>>>> qualifier, engines should resolve it using the current engine >>>>>>>>>>>>>> catalog at >>>>>>>>>>>>>> query time. >>>>>>>>>>>>>> >>>>>>>>>>>>>> * Reference table identifiers in the metadata exactly as they >>>>>>>>>>>>>> appear in the view SQL text. >>>>>>>>>>>>>> >>>>>>>>>>>>>> * If an identifier lacks the catalog part at creation, it >>>>>>>>>>>>>> should still lack a catalog in the stored metadata. >>>>>>>>>>>>>> >>>>>>>>>>>>>> * Store UUIDs alongside table identifiers whenever possible. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Walaa. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Apr 25, 2025 at 5:18 PM Walaa Eldin Moustafa < >>>>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks for the contribution Benny! +1 to the confusion the >>>>>>>>>>>>>>> fallback creates. Also just to be clear, at this point and >>>>>>>>>>>>>>> after clarifying >>>>>>>>>>>>>>> the current spec intentions, I am convinced that we should >>>>>>>>>>>>>>> remove the >>>>>>>>>>>>>>> default catalog and default namespace fields altogether. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Walaa. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Apr 25, 2025 at 5:13 PM Benny Chow <btc...@gmail.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'd like to contribute my opinions on this: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - I don't particularly like the current behavior of >>>>>>>>>>>>>>>> "default to the view's catalog when default-catalog is not >>>>>>>>>>>>>>>> set". >>>>>>>>>>>>>>>> Fundamentally, I believe the intent of default-catalog and >>>>>>>>>>>>>>>> default-namespace is there to help users write more concise >>>>>>>>>>>>>>>> SQL. >>>>>>>>>>>>>>>> - spark session catalog is engine specific and I don't >>>>>>>>>>>>>>>> think we should design something that says first use this >>>>>>>>>>>>>>>> catalog, then >>>>>>>>>>>>>>>> that catalog.. or that catalog. For example, resolving >>>>>>>>>>>>>>>> identifiers using >>>>>>>>>>>>>>>> default-catalog -> view's catalog -> session catalog is not >>>>>>>>>>>>>>>> good. >>>>>>>>>>>>>>>> - We gotta support non-Iceberg tables otherwise I see no >>>>>>>>>>>>>>>> value in putting views in the catalog to share with other >>>>>>>>>>>>>>>> engines >>>>>>>>>>>>>>>> - Interoperability between different engine types is very >>>>>>>>>>>>>>>> hard due to dialect issues... so I think we should focus on >>>>>>>>>>>>>>>> supporting >>>>>>>>>>>>>>>> different clusters of the same engine type on a shared >>>>>>>>>>>>>>>> catalog. For >>>>>>>>>>>>>>>> example, AI and BI clusters on Spark sharing the same views in >>>>>>>>>>>>>>>> a REST >>>>>>>>>>>>>>>> catalog. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Coincidentally, I think the ultimate solution is along the >>>>>>>>>>>>>>>> lines of something Russell proposed last year: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> https://lists.apache.org/thread/hoskfx8y3kvrcww52l4w9dxghp3pnlm7 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We've been looking at this interoperable identifier problem >>>>>>>>>>>>>>>> through the lens of catalog resolution but maybe the right >>>>>>>>>>>>>>>> approach is >>>>>>>>>>>>>>>> really about templating. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I would extend Russell's idea to allow identifiers in a >>>>>>>>>>>>>>>> view to span catalogs to support non-Iceberg tables. Also, >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> default-catalog property could be templated as well. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thoughts? >>>>>>>>>>>>>>>> Benny >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Apr 25, 2025 at 4:02 PM Walaa Eldin Moustafa < >>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks Steven! How do you recommend making Spark >>>>>>>>>>>>>>>>> implementation conform to the spec? Do we need Spark SQL >>>>>>>>>>>>>>>>> extensions and/or >>>>>>>>>>>>>>>>> Spark catalog APIs for that? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> How do you recommend reconciling the inconsistencies I >>>>>>>>>>>>>>>>> shared regarding many resolution methods not consistently >>>>>>>>>>>>>>>>> being followed in >>>>>>>>>>>>>>>>> different scenarios (view vs child table resolution, query vs >>>>>>>>>>>>>>>>> view >>>>>>>>>>>>>>>>> resolution)? Note these occur when the default catalog is set >>>>>>>>>>>>>>>>> to a non-null >>>>>>>>>>>>>>>>> value. If it helps, I can share concrete examples. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> Walaa. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Apr 25, 2025 at 3:52 PM Steven Wu < >>>>>>>>>>>>>>>>> stevenz...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The core issue is on the fall back behavior when >>>>>>>>>>>>>>>>>> `default-catalog` is >>>>>>>>>>>>>>>>>> not defined. Current view spec says the fallback should >>>>>>>>>>>>>>>>>> be the catalog >>>>>>>>>>>>>>>>>> where the view is defined. It doesn't really matter what >>>>>>>>>>>>>>>>>> the catalog >>>>>>>>>>>>>>>>>> is named (catalogX) by the read engine. >>>>>>>>>>>>>>>>>> - If a view refers to the tables in the same catalog, >>>>>>>>>>>>>>>>>> this is a >>>>>>>>>>>>>>>>>> non-ambiguous and reasonable fallback behavior. >>>>>>>>>>>>>>>>>> - If a view refers to tables from another catalog, >>>>>>>>>>>>>>>>>> catalog names >>>>>>>>>>>>>>>>>> should be included in the reference name already. So no >>>>>>>>>>>>>>>>>> ambiguity >>>>>>>>>>>>>>>>>> there either. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Potential inconsistent naming of catalog is a separate >>>>>>>>>>>>>>>>>> problem, which >>>>>>>>>>>>>>>>>> Iceberg view spec probably cannot solve. We can only >>>>>>>>>>>>>>>>>> recommend that >>>>>>>>>>>>>>>>>> catalog should be named consistently across usage for >>>>>>>>>>>>>>>>>> better >>>>>>>>>>>>>>>>>> interoperability on name references. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> This proposal is to change the fallback behavior to >>>>>>>>>>>>>>>>>> engine's session >>>>>>>>>>>>>>>>>> default catalog. I am not sure it is better than the >>>>>>>>>>>>>>>>>> current fallback >>>>>>>>>>>>>>>>>> behavior. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> > Today’s Spark behavior explicitly differs from this >>>>>>>>>>>>>>>>>> idea. Spark resolves table identifiers during view creation >>>>>>>>>>>>>>>>>> using the >>>>>>>>>>>>>>>>>> session’s default catalog, not a supplied `default-catalog`. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I would argue that is a Spark implementation issue for >>>>>>>>>>>>>>>>>> not conforming >>>>>>>>>>>>>>>>>> to the spec. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Apr 25, 2025 at 1:17 PM Walaa Eldin Moustafa >>>>>>>>>>>>>>>>>> <wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > Hi Jan, >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > Thanks again for continuing the discussion. I want to >>>>>>>>>>>>>>>>>> highlight a few fundamental issues around the interpretation >>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>> default-catalog: >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > Here is the real catch: >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > * default-catalog cannot logically be defined at view >>>>>>>>>>>>>>>>>> creation time. It would be circular: the view needs to exist >>>>>>>>>>>>>>>>>> before its >>>>>>>>>>>>>>>>>> metadata (and hence default-catalog) can exist. This is >>>>>>>>>>>>>>>>>> visible in Spark’s >>>>>>>>>>>>>>>>>> implementation, where `default-catalog` is not used. >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > * Introducing a creation-time default-catalog setting >>>>>>>>>>>>>>>>>> would require extending SQL syntax and engine APIs to >>>>>>>>>>>>>>>>>> promote it to a >>>>>>>>>>>>>>>>>> first-class view concept. This would be intrusive, >>>>>>>>>>>>>>>>>> non-intuitive, and >>>>>>>>>>>>>>>>>> realistically very difficult to standardize across engines. >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > * Today’s Spark behavior explicitly differs from this >>>>>>>>>>>>>>>>>> idea. Spark resolves table identifiers during view creation >>>>>>>>>>>>>>>>>> using the >>>>>>>>>>>>>>>>>> session’s default catalog, not a supplied `default-catalog`. >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > * Hypothetically even if we patched in a creation-time >>>>>>>>>>>>>>>>>> default-catalog, it would create an inconsistent binding >>>>>>>>>>>>>>>>>> model between >>>>>>>>>>>>>>>>>> tables vs views (early vs late), and between tables in views >>>>>>>>>>>>>>>>>> and in queries >>>>>>>>>>>>>>>>>> (again early vs late). For example, views and tables in >>>>>>>>>>>>>>>>>> queries can >>>>>>>>>>>>>>>>>> withstand default catalog renames, but tables cannot when >>>>>>>>>>>>>>>>>> they are used >>>>>>>>>>>>>>>>>> inside views -- it even applies to views inside views, which >>>>>>>>>>>>>>>>>> makes this >>>>>>>>>>>>>>>>>> very hard to reason about considering nesting. >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > Thanks, >>>>>>>>>>>>>>>>>> > Walaa >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > On Fri, Apr 25, 2025 at 7:00 AM Jan Kaul >>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> >>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote: >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> @Walaa: >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> I would argue that when you run a CREATE VIEW >>>>>>>>>>>>>>>>>> statement the query engine knowns which catalog the view is >>>>>>>>>>>>>>>>>> being created >>>>>>>>>>>>>>>>>> in. So even though we typically use late binding to resolve >>>>>>>>>>>>>>>>>> the view >>>>>>>>>>>>>>>>>> catalog at query time, it can also be used at creation time. >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> The query engine would need to keep track of the "view >>>>>>>>>>>>>>>>>> catalog" where the view is going to be created in. It can >>>>>>>>>>>>>>>>>> use that catalog >>>>>>>>>>>>>>>>>> to resolve partial table identifiers if "default-catalog" is >>>>>>>>>>>>>>>>>> not set. >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> It can lead to some unintuitive behavior, where >>>>>>>>>>>>>>>>>> partial identifiers in the view query resolve to a different >>>>>>>>>>>>>>>>>> catalog >>>>>>>>>>>>>>>>>> compared to using them outside of a view. >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> CREATE VIEW catalogA.sales.monthly_orders AS SELECT * >>>>>>>>>>>>>>>>>> from sales.orders; >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> If the session default catalog is not "catalogA", the >>>>>>>>>>>>>>>>>> "sales.orders" in the view query would not be the same as >>>>>>>>>>>>>>>>>> just referencing >>>>>>>>>>>>>>>>>> "sales.orders" in a normal SQL statement. This is because >>>>>>>>>>>>>>>>>> without a >>>>>>>>>>>>>>>>>> "default-catalog", the catalog name of "sales.orders" would >>>>>>>>>>>>>>>>>> default to >>>>>>>>>>>>>>>>>> "catalogA", which is the view's catalog. >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> Thanks, >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> Jan >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> On 4/25/25 04:05, Manu Zhang wrote: >>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>> >>> For example, if we want to validate that the tables >>>>>>>>>>>>>>>>>> referenced in the view exist, how can we do that when >>>>>>>>>>>>>>>>>> default-catalog isn't >>>>>>>>>>>>>>>>>> defined, since the view hasn't been created or loaded yet? >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> I don't think this is related to view spec. How do we >>>>>>>>>>>>>>>>>> validate that a table exists without a default catalog, or >>>>>>>>>>>>>>>>>> do we always use >>>>>>>>>>>>>>>>>> the current session catalog? >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> Thanks, >>>>>>>>>>>>>>>>>> >> Manu >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> On Fri, Apr 25, 2025 at 5:59 AM Walaa Eldin Moustafa < >>>>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>> >>> Hi Jan, >>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>> >>> I think we still share the same understanding. Just >>>>>>>>>>>>>>>>>> to clarify: when I referred to late binding as “similar” to >>>>>>>>>>>>>>>>>> the proposal, I >>>>>>>>>>>>>>>>>> was acknowledging the distinction between view-level and >>>>>>>>>>>>>>>>>> table-level >>>>>>>>>>>>>>>>>> resolution. But as you noted, both follow a late binding >>>>>>>>>>>>>>>>>> model. >>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>> >>> That said, this still raises an interesting question >>>>>>>>>>>>>>>>>> and a potential gap: if default-catalog is only defined at >>>>>>>>>>>>>>>>>> query time, how >>>>>>>>>>>>>>>>>> should resolution work during view creation? For example, if >>>>>>>>>>>>>>>>>> we want to >>>>>>>>>>>>>>>>>> validate that the tables referenced in the view exist, how >>>>>>>>>>>>>>>>>> can we do that >>>>>>>>>>>>>>>>>> when default-catalog isn't defined, since the view hasn't >>>>>>>>>>>>>>>>>> been created or >>>>>>>>>>>>>>>>>> loaded yet? >>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>> >>> Thanks, >>>>>>>>>>>>>>>>>> >>> Walaa. >>>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>> >>> On Thu, Apr 24, 2025 at 7:02 AM Jan Kaul >>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> >>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote: >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> >>>> Yes, I have the same understanding. The view catalog >>>>>>>>>>>>>>>>>> is resolved at query time. >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> >>>> As you mentioned before, it's good to distinguish >>>>>>>>>>>>>>>>>> between the physical catalog and it's reference used in SQL >>>>>>>>>>>>>>>>>> statements. The >>>>>>>>>>>>>>>>>> important part is that the physical catalog of the view and >>>>>>>>>>>>>>>>>> the tables >>>>>>>>>>>>>>>>>> referenced in it's definition stay consistent. You could >>>>>>>>>>>>>>>>>> create a view in a >>>>>>>>>>>>>>>>>> given physical catalog by referring to it as "catalogA", as >>>>>>>>>>>>>>>>>> in your first >>>>>>>>>>>>>>>>>> point. If you then, given a different setup, refer to the >>>>>>>>>>>>>>>>>> same physical >>>>>>>>>>>>>>>>>> catalog as "catalogB" in another session/environment, the >>>>>>>>>>>>>>>>>> behavior should >>>>>>>>>>>>>>>>>> still work. >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> >>>> I would however rephrase your last point. Late >>>>>>>>>>>>>>>>>> binding applies to the view catalog name and by extension to >>>>>>>>>>>>>>>>>> all partial >>>>>>>>>>>>>>>>>> table references when no "default-catalog" is present. >>>>>>>>>>>>>>>>>> Resolving the view >>>>>>>>>>>>>>>>>> catalog name at query time is not opposed to storing the >>>>>>>>>>>>>>>>>> view metadata in a >>>>>>>>>>>>>>>>>> catalog. >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> >>>> Or maybe I don't entirely understand what you mean. >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> >>>> Thanks >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> >>>> Jan >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> >>>> On 4/24/25 00:32, Walaa Eldin Moustafa wrote: >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> >>>> Hi Jan, >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> >>>> > The view is executed when it's being referenced in >>>>>>>>>>>>>>>>>> a SQL statement. That statement contains the information for >>>>>>>>>>>>>>>>>> the query >>>>>>>>>>>>>>>>>> engine to resolve the catalog of the view. >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> >>>> If I’m understanding correctly, that means: >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> >>>> * If the view is queried as SELECT * FROM >>>>>>>>>>>>>>>>>> catalogA.namespace.view, then catalogA is considered the >>>>>>>>>>>>>>>>>> view’s catalog. >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> >>>> * If the same view is later queried as SELECT * FROM >>>>>>>>>>>>>>>>>> catalogB.namespace.view (after renaming catalogA to >>>>>>>>>>>>>>>>>> catalogB, and keeping >>>>>>>>>>>>>>>>>> everything else the same), then catalogB becomes the view’s >>>>>>>>>>>>>>>>>> catalog. >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> >>>> Is that interpretation correct? If so, it sounds to >>>>>>>>>>>>>>>>>> me like the catalog is resolved at query time, based on how >>>>>>>>>>>>>>>>>> the view is >>>>>>>>>>>>>>>>>> referenced, not from any stored metadata. That would imply >>>>>>>>>>>>>>>>>> some sort of a >>>>>>>>>>>>>>>>>> late binding behavior (similar to the proposal), as opposed >>>>>>>>>>>>>>>>>> to using some >>>>>>>>>>>>>>>>>> catalog that "stores" the view definition. >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> >>>> Thanks, >>>>>>>>>>>>>>>>>> >>>> Walaa >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> >>>> On Tue, Apr 22, 2025 at 11:01 AM Jan Kaul >>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> >>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote: >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> Hi Walaa, >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> Thanks for clarifying the aspects of >>>>>>>>>>>>>>>>>> non-determinism. Let me try to address your questions. >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> 1. This is my interpretation of the current spec: >>>>>>>>>>>>>>>>>> The view is executed when it's being referenced in a SQL >>>>>>>>>>>>>>>>>> statement. That >>>>>>>>>>>>>>>>>> statement contains the information for the query engine to >>>>>>>>>>>>>>>>>> resolve the >>>>>>>>>>>>>>>>>> catalog of the view. The query engine then uses that >>>>>>>>>>>>>>>>>> information to fetch >>>>>>>>>>>>>>>>>> the view metadata from the catalog. It also needs to >>>>>>>>>>>>>>>>>> temporarily keep track >>>>>>>>>>>>>>>>>> of which catalog it used to fetch the view metadata. It can >>>>>>>>>>>>>>>>>> then use that >>>>>>>>>>>>>>>>>> information to resolve the table references in the views SQL >>>>>>>>>>>>>>>>>> definition in >>>>>>>>>>>>>>>>>> case no default catalog is specified. >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> 2. The important part is that the catalog can be >>>>>>>>>>>>>>>>>> referenced at execution time. As long as that's the case I >>>>>>>>>>>>>>>>>> would assume the >>>>>>>>>>>>>>>>>> view can be created in any catalog. >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> I think your point is really valuable because the >>>>>>>>>>>>>>>>>> current specification can lead to some unintuitive behavior. >>>>>>>>>>>>>>>>>> For example >>>>>>>>>>>>>>>>>> for the following statement: >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> CREATE VIEW catalogA.sales.monthly_orders AS SELECT >>>>>>>>>>>>>>>>>> * from sales.orders; >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> If the session default catalog is not "catalogA", >>>>>>>>>>>>>>>>>> the "sales.orders" in the view query would not be the same >>>>>>>>>>>>>>>>>> as just >>>>>>>>>>>>>>>>>> referencing "sales.orders" in a normal SQL statement. This >>>>>>>>>>>>>>>>>> is because >>>>>>>>>>>>>>>>>> without a "default-catalog", the catalog name of >>>>>>>>>>>>>>>>>> "sales.orders" would >>>>>>>>>>>>>>>>>> default to "catalogA". >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> However, I like the current design of the view >>>>>>>>>>>>>>>>>> spec, because it has the "closure" property. Because of the >>>>>>>>>>>>>>>>>> fact that the >>>>>>>>>>>>>>>>>> "view catalog" has to be known when executing a view, all >>>>>>>>>>>>>>>>>> the information >>>>>>>>>>>>>>>>>> required to resolve the table identifiers is contained in >>>>>>>>>>>>>>>>>> the view metadata >>>>>>>>>>>>>>>>>> (and the "view catalog"). I think that if you make the >>>>>>>>>>>>>>>>>> identifier >>>>>>>>>>>>>>>>>> resolution dependent on external parameters, it hinders >>>>>>>>>>>>>>>>>> portability. >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> Thanks, >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> Jan >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> On 4/22/25 18:36, Walaa Eldin Moustafa wrote: >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> Hi Jan, >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> Thanks for the thoughtful feedback. >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> I think it’s important we clarify a key point >>>>>>>>>>>>>>>>>> before going deeper: >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> Non-determinism is not caused by session fallback >>>>>>>>>>>>>>>>>> behavior—it’s a fundamental limitation of using table >>>>>>>>>>>>>>>>>> identifiers alone, >>>>>>>>>>>>>>>>>> regardless of whether we use the current rule, the proposed >>>>>>>>>>>>>>>>>> fallback to the >>>>>>>>>>>>>>>>>> session’s default catalog, or even early vs. late binding. >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> The same fully qualified identifier (e.g., >>>>>>>>>>>>>>>>>> catalogA.namespace.table) can resolve to different objects >>>>>>>>>>>>>>>>>> depending solely >>>>>>>>>>>>>>>>>> on engine-specific routing logic or catalog aliases. So >>>>>>>>>>>>>>>>>> determinism isn’t >>>>>>>>>>>>>>>>>> guaranteed just because an identifier is "fully qualified." >>>>>>>>>>>>>>>>>> The only >>>>>>>>>>>>>>>>>> reliable anchor for identity is the UUID. That’s why the >>>>>>>>>>>>>>>>>> proposed use of >>>>>>>>>>>>>>>>>> UUIDs is not just a hardening strategy. It’s the actual fix >>>>>>>>>>>>>>>>>> for correctness. >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> To move the conversation forward, could you help >>>>>>>>>>>>>>>>>> clarify two things in the context of the current spec: >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> * Where in the metadata is the “view catalog” >>>>>>>>>>>>>>>>>> stored, so that an engine knows to fall back to it if >>>>>>>>>>>>>>>>>> default-catalog is >>>>>>>>>>>>>>>>>> null? >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> * Are we even allowed to create views in the >>>>>>>>>>>>>>>>>> session's default catalog (i.e., without specifying a >>>>>>>>>>>>>>>>>> catalog) in the >>>>>>>>>>>>>>>>>> current Iceberg spec? >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> These questions are important because if we can’t >>>>>>>>>>>>>>>>>> unambiguously recover the "view catalog" from metadata, then >>>>>>>>>>>>>>>>>> defaulting to >>>>>>>>>>>>>>>>>> it is problematic. And if views can't be created in the >>>>>>>>>>>>>>>>>> default catalog, >>>>>>>>>>>>>>>>>> then the fallback rule doesn’t generalize. >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> Thanks, >>>>>>>>>>>>>>>>>> >>>>> Walaa. >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> >>>>> On Tue, Apr 22, 2025 at 3:14 AM Jan Kaul >>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> >>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote: >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> Hi Walaa, >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> thank you for your proposal. If I understood >>>>>>>>>>>>>>>>>> correctly, you proposal is composed of three parts: >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> - session default catalog as fallback for >>>>>>>>>>>>>>>>>> "default-catalog" >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> - session default namespace as fallback for >>>>>>>>>>>>>>>>>> "default-namepace" >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> - Late binding + UUID validation >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> I have some comments regarding these points. >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> 1. Session default catalog as fallback for >>>>>>>>>>>>>>>>>> "default-catalog" >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> Introducing a behavior that depends on the current >>>>>>>>>>>>>>>>>> session setup is in my opinion the definition of >>>>>>>>>>>>>>>>>> "non-determinism". You >>>>>>>>>>>>>>>>>> could be running the same query-engine and catalog-setup on >>>>>>>>>>>>>>>>>> different days, >>>>>>>>>>>>>>>>>> with different default session catalogs (which is rather >>>>>>>>>>>>>>>>>> common), and would >>>>>>>>>>>>>>>>>> be getting different results. >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> Whereas with the current behavior, the view always >>>>>>>>>>>>>>>>>> produces the same results. The current behavior has some >>>>>>>>>>>>>>>>>> rough edges in >>>>>>>>>>>>>>>>>> very niche use cases but I think is solid for most uses >>>>>>>>>>>>>>>>>> cases. >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> 2. Session default namespace as fallback for >>>>>>>>>>>>>>>>>> "default-namespace" >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> Similar to the above. >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> 3. Late binding + UUID validation >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> If I understand it correctly, the current >>>>>>>>>>>>>>>>>> implementation already uses late binding. >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> Generally, having UUID validation makes the setup >>>>>>>>>>>>>>>>>> more robust. Which is great. However, having UUID validation >>>>>>>>>>>>>>>>>> still requires >>>>>>>>>>>>>>>>>> us to have a portable table identifier specification. Even >>>>>>>>>>>>>>>>>> if we have the >>>>>>>>>>>>>>>>>> UUIDs of the referenced tables from the view, there simply >>>>>>>>>>>>>>>>>> isn't an >>>>>>>>>>>>>>>>>> interface that let's us use those UUIDs. The catalog >>>>>>>>>>>>>>>>>> interface is defined >>>>>>>>>>>>>>>>>> in terms of table identifiers. >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> So we always require a working catalog setup and >>>>>>>>>>>>>>>>>> suiting table identifiers to obtain the table metadata. We >>>>>>>>>>>>>>>>>> can use the >>>>>>>>>>>>>>>>>> UUIDs to verify if we loaded the correct table. But this can >>>>>>>>>>>>>>>>>> only be done >>>>>>>>>>>>>>>>>> after we used some identifier. Which means there is no way >>>>>>>>>>>>>>>>>> of using UUIDs >>>>>>>>>>>>>>>>>> without a functioning catalog/identifier setup. >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> In conclusion, I prefer the current behavior for >>>>>>>>>>>>>>>>>> "default-catalog" because it is more deterministic in my >>>>>>>>>>>>>>>>>> opinion. And I >>>>>>>>>>>>>>>>>> think the current spec does a good job for multi-engine >>>>>>>>>>>>>>>>>> table identifier >>>>>>>>>>>>>>>>>> resolution. I see the UUID validation more of an additional >>>>>>>>>>>>>>>>>> hardening >>>>>>>>>>>>>>>>>> strategy. >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> Thanks >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> Jan >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> On 4/21/25 17:38, Walaa Eldin Moustafa wrote: >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> Thanks Renjie! >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> The existing spec has some guidance on resolving >>>>>>>>>>>>>>>>>> catalogs on the fly already (to address the case of view >>>>>>>>>>>>>>>>>> text with table >>>>>>>>>>>>>>>>>> identifiers missing the catalog part). The guidance is to >>>>>>>>>>>>>>>>>> use the catalog >>>>>>>>>>>>>>>>>> where the view is stored. But I find this rule hard to >>>>>>>>>>>>>>>>>> interpret or use. >>>>>>>>>>>>>>>>>> The catalog itself is a logical construct—such as a >>>>>>>>>>>>>>>>>> federated catalog that >>>>>>>>>>>>>>>>>> delegates to multiple physical backends (e.g., HMS and >>>>>>>>>>>>>>>>>> REST). In such >>>>>>>>>>>>>>>>>> cases, the catalog (e.g., `my_catalog` in >>>>>>>>>>>>>>>>>> `my_catalog.namespace1.table1`) >>>>>>>>>>>>>>>>>> doesn’t physically store the tables; it only routes requests >>>>>>>>>>>>>>>>>> to underlying >>>>>>>>>>>>>>>>>> stores. Therefore, defaulting identifier resolution based on >>>>>>>>>>>>>>>>>> the catalog >>>>>>>>>>>>>>>>>> where the view is "stored" doesn’t align with how catalogs >>>>>>>>>>>>>>>>>> actually behave >>>>>>>>>>>>>>>>>> in practice. >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> Thanks, >>>>>>>>>>>>>>>>>> >>>>>> Walaa. >>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>> On Sun, Apr 20, 2025 at 11:17 PM Renjie Liu < >>>>>>>>>>>>>>>>>> liurenjie2...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>> Hi, Walaa: >>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>> Thanks for the proposal. >>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>> I've reviewed the doc, but in general I have some >>>>>>>>>>>>>>>>>> concerns with resolving catalog names on the fly with query >>>>>>>>>>>>>>>>>> engine defined >>>>>>>>>>>>>>>>>> catalog names. This introduces some flexibility at first >>>>>>>>>>>>>>>>>> glance, but also >>>>>>>>>>>>>>>>>> makes misconfiguration difficult to explain. >>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>> But I agree with one part that we should store >>>>>>>>>>>>>>>>>> resolved table uuid in view metadata, as table/view renaming >>>>>>>>>>>>>>>>>> may introduce >>>>>>>>>>>>>>>>>> errors that's difficult to understand for user. >>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>> On Sat, Apr 19, 2025 at 3:02 AM Walaa Eldin >>>>>>>>>>>>>>>>>> Moustafa <wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>> Hi Everyone, >>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>> Looking forward to keeping up the momentum and >>>>>>>>>>>>>>>>>> closing out the MV spec as well. I’m hoping we can proceed >>>>>>>>>>>>>>>>>> to a vote next >>>>>>>>>>>>>>>>>> week. >>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>> Here is a summary in case that helps. The >>>>>>>>>>>>>>>>>> proposal outlines a strategy for handling table identifiers >>>>>>>>>>>>>>>>>> in Iceberg view >>>>>>>>>>>>>>>>>> metadata, with the goal of ensuring correctness, >>>>>>>>>>>>>>>>>> portability, and engine >>>>>>>>>>>>>>>>>> compatibility. It recommends resolving table identifiers at >>>>>>>>>>>>>>>>>> read time (late >>>>>>>>>>>>>>>>>> binding) rather than creation time, and introduces >>>>>>>>>>>>>>>>>> UUID-based validation to >>>>>>>>>>>>>>>>>> maintain identity guarantees across engines, or sessions. It >>>>>>>>>>>>>>>>>> also revises >>>>>>>>>>>>>>>>>> how default-catalog and default-namespace are handled >>>>>>>>>>>>>>>>>> (defaulting both to >>>>>>>>>>>>>>>>>> the session context if not explicitly set) to better align >>>>>>>>>>>>>>>>>> with engine >>>>>>>>>>>>>>>>>> behavior and improve cross-engine interoperability. >>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>> Please let me know your thoughts. >>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> >>>>>>>> Walaa. >>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>> On Wed, Apr 16, 2025 at 2:03 PM Walaa Eldin >>>>>>>>>>>>>>>>>> Moustafa <wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>> Thanks Eduard and Sung! I have addressed the >>>>>>>>>>>>>>>>>> comments. >>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>> One key point to keep in mind is that catalog >>>>>>>>>>>>>>>>>> names in the spec refer to logical catalogs—i.e., the first >>>>>>>>>>>>>>>>>> part of a >>>>>>>>>>>>>>>>>> three-part identifier. These correspond to Spark's >>>>>>>>>>>>>>>>>> DataSourceV2 catalogs, >>>>>>>>>>>>>>>>>> Trino connectors, and similar constructs. This is a level of >>>>>>>>>>>>>>>>>> abstraction >>>>>>>>>>>>>>>>>> above physical catalogs, which are not referenced or used in >>>>>>>>>>>>>>>>>> the view spec. >>>>>>>>>>>>>>>>>> The reason is that table identifiers in the view >>>>>>>>>>>>>>>>>> definition/text itself >>>>>>>>>>>>>>>>>> refer to logical catalogs, not physical ones (since they >>>>>>>>>>>>>>>>>> interface directly >>>>>>>>>>>>>>>>>> with the engine and not a specific metastore). >>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> >>>>>>>>> Walaa. >>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>> On Wed, Apr 16, 2025 at 6:15 AM Sung Yun < >>>>>>>>>>>>>>>>>> sungwy...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>> Thank you Walaa for the proposal. I think view >>>>>>>>>>>>>>>>>> portability is a very important topic for us to continue >>>>>>>>>>>>>>>>>> discussing as it >>>>>>>>>>>>>>>>>> relies on many assumptions within the data ecosystem for it >>>>>>>>>>>>>>>>>> to function >>>>>>>>>>>>>>>>>> like you've highlighted well in the document. >>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>> I've added a few comments around how this may >>>>>>>>>>>>>>>>>> impact the permission questions the engines will be asking, >>>>>>>>>>>>>>>>>> and whether >>>>>>>>>>>>>>>>>> that is the desired behavior. >>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>> Sung >>>>>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>> On Wed, Apr 16, 2025 at 7:32 AM Eduard >>>>>>>>>>>>>>>>>> Tudenhöfner <etudenhoef...@apache.org> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>> Thanks Walaa for tackling this problem. I've >>>>>>>>>>>>>>>>>> added a few comments to get a better understanding of how >>>>>>>>>>>>>>>>>> this will look >>>>>>>>>>>>>>>>>> like in the actual implementation. >>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>> Eduard >>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>> On Tue, Apr 15, 2025 at 7:09 PM Walaa Eldin >>>>>>>>>>>>>>>>>> Moustafa <wa.moust...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>> Hi Everyone, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>> Starting this thread to resume our >>>>>>>>>>>>>>>>>> discussion on how to reference table identifiers from >>>>>>>>>>>>>>>>>> Iceberg metadata, a >>>>>>>>>>>>>>>>>> key aspect of the view specification, particularly in >>>>>>>>>>>>>>>>>> relation to the MV >>>>>>>>>>>>>>>>>> (materialized view) extensions. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>> I had the chance to speak offline with a few >>>>>>>>>>>>>>>>>> community members to better understand how the current spec >>>>>>>>>>>>>>>>>> is being >>>>>>>>>>>>>>>>>> interpreted. Those conversations served as inputs to a new >>>>>>>>>>>>>>>>>> proposal on how >>>>>>>>>>>>>>>>>> table identifier references could be represented in metadata. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>> You can find the proposal here [1]. I look >>>>>>>>>>>>>>>>>> forward to your feedback and working together to move this >>>>>>>>>>>>>>>>>> forward so we >>>>>>>>>>>>>>>>>> can finalize the MV spec as well. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>> Walaa. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>