Re: [DISCUSS] Table Identifiers in Iceberg View Spec

Jan Kaul Mon, 28 Apr 2025 00:40:20 -0700

I think the intention with the "default-catalog" was that every queryengine uses it to store its session default catalog at the time ofcreating the view. This way the view could be reused in another session.The idea was not to introduce an additional SQL syntax to set thedefault-catalog.

Generally we have different environments we want to support with theview spec:


1. Consistent catalog naming

When the environment supports it, using consistent catalog names canhave a great benefit for multi-catalog, multi-engine setups. Withconsistent catalog names, using the "default-catalog" field workswithout any issues.


2. Inconsistent catalog naming

This can be the case when different query engines refer to the samephysical catalog by different names. This often happens becausedifferent query engines use different strategies to setup the catalogs.If catalogs have inconsistent naming, using the "default-catalog" fielddoes not work because it is not guaranteed that the catalog name can beresolved with another engine. Using the "view catalog" as a fallback isa better solution for this use case, as it avoids catalog namesaltogether. It is however limited to table references in the same catalog.

What do you think of introducing a view property that specifies if the"default-catalog" or the "view catalog" should be used? This way, youcould use the "default-catalog" in environments where you can guaranteeconsistent naming, but you would be able to directly fallback to the"view-catalog" when you don't have consistent naming. The query enginescould set the default for this view property at creation time. Spark forexample could set it to automatically use the "view catalog".


Thanks

Jan


On 4/26/25 05:33, Walaa Eldin Moustafa wrote:

To help folks catch up on the latest discussions and interpretation ofthe spec, I have summarized everything we discussed so far at the topof the proposal document (here<https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0>).I have slightly updated the proposal to be in sync with the newinterpretation to avoid confusion. In summary:

* Remove default-catalog and default-namespace fields from the viewspec completely.

* Hence, we do not attempt to define separate view-level defaultcatalogs or namespaces.


Instead:

* If a table identifier inside a view lacks a catalog qualifier,engines should resolve it using the current engine catalog at query time.

* Reference table identifiers in the metadata exactly as they appearin the view SQL text.

* If an identifier lacks the catalog part at creation, it should stilllack a catalog in the stored metadata.


* Store UUIDs alongside table identifiers whenever possible.

Thanks,
Walaa.

On Fri, Apr 25, 2025 at 5:18 PM Walaa Eldin Moustafa<wa.moust...@gmail.com> wrote:


    Thanks for the contribution Benny! +1 to the confusion the
    fallback creates. Also just to be clear, at this point and after
    clarifying the current spec intentions, I am convinced that we
    should remove the default catalog and default namespace fields
    altogether.

    Thanks,
    Walaa.

    On Fri, Apr 25, 2025 at 5:13 PM Benny Chow <btc...@gmail.com> wrote:

        I'd like to contribute my opinions on this:

        - I don't particularly like the current behavior of "default
        to the view's catalog when default-catalog is not set". 
        Fundamentally, I believe the intent of default-catalog and
        default-namespace is there to help users write more concise SQL.
        - spark session catalog is engine specific and I don't think
        we should design something that says first use this catalog,
        then that catalog.. or that catalog.  For example, resolving
        identifiers using default-catalog -> view's catalog -> session
        catalog is not good.
        - We gotta support non-Iceberg tables otherwise I see no value
        in putting views in the catalog to share with other engines
        - Interoperability between different engine types is very hard
        due to dialect issues... so I think we should focus on
        supporting different clusters of the same engine type on a
        shared catalog.  For example, AI and BI clusters on Spark
        sharing the same views in a REST catalog.

        Coincidentally, I think the ultimate solution is along the
        lines of something Russell proposed last year:

        https://lists.apache.org/thread/hoskfx8y3kvrcww52l4w9dxghp3pnlm7

        We've been looking at this interoperable identifier problem
        through the lens of catalog resolution but maybe the right
        approach is really about templating.

        I would extend Russell's idea to allow identifiers in a view
        to span catalogs to support non-Iceberg tables.   Also, the
        default-catalog property could be templated as well.

        Thoughts?
        Benny



        On Fri, Apr 25, 2025 at 4:02 PM Walaa Eldin Moustafa
        <wa.moust...@gmail.com> wrote:

            Thanks Steven! How do you recommend making Spark
            implementation conform to the spec? Do we need Spark SQL
            extensions and/or Spark catalog APIs for that?

            How do you recommend reconciling the inconsistencies I
            shared regarding many resolution methods not consistently
            being followed in different scenarios (view vs child table
            resolution, query vs view resolution)? Note these occur
            when the default catalog is set to a non-null value. If it
            helps, I can share concrete examples.

            Thanks,
            Walaa.

            On Fri, Apr 25, 2025 at 3:52 PM Steven Wu
            <stevenz...@gmail.com> wrote:

                The core issue is on the fall back behavior when
                `default-catalog` is
                not defined. Current view spec says the fallback
                should be the catalog
                where the view is defined. It doesn't really matter
                what the catalog
                is named (catalogX) by the read engine.
                - If a view refers to the tables in the same catalog,
                this is a
                non-ambiguous and reasonable fallback behavior.
                - If a view refers to tables from another catalog,
                catalog names
                should be included in the reference name already. So
                no ambiguity
                there either.

                Potential inconsistent naming of catalog is a separate
                problem, which
                Iceberg view spec probably cannot solve. We can only
                recommend that
                catalog should be named consistently across usage for
                better
                interoperability on name references.

                This proposal is to change the fallback behavior to
                engine's session
                default catalog. I am not sure it is better than the
                current fallback
                behavior.

                > Today’s Spark behavior explicitly differs from this
                idea. Spark resolves table identifiers during view
                creation using the session’s default catalog, not a
                supplied `default-catalog`.

                I would argue that is a Spark implementation issue for
                not conforming
                to the spec.


                On Fri, Apr 25, 2025 at 1:17 PM Walaa Eldin Moustafa
                <wa.moust...@gmail.com> wrote:
                >
                > Hi Jan,
                >
                > Thanks again for continuing the discussion. I want
                to highlight a few fundamental issues around the
                interpretation of default-catalog:
                >
                > Here is the real catch:
                >
                > * default-catalog cannot logically be defined at
                view creation time. It would be circular: the view
                needs to exist before its metadata (and hence
                default-catalog) can exist. This is visible in Spark’s
                implementation, where `default-catalog` is not used.
                >
                > * Introducing a creation-time default-catalog
                setting would require extending SQL syntax and engine
                APIs to promote it to a first-class view concept. This
                would be intrusive, non-intuitive, and realistically
                very difficult to standardize across engines.
                >
                > * Today’s Spark behavior explicitly differs from
                this idea. Spark resolves table identifiers during
                view creation using the session’s default catalog, not
                a supplied `default-catalog`.
                >
                > * Hypothetically even if we patched in a
                creation-time default-catalog, it would create an
                inconsistent binding model between tables vs views
                (early vs late), and between tables in views and in
                queries (again early vs late). For example, views and
                tables in queries can withstand default catalog
                renames, but tables cannot when they are used inside
                views -- it even applies to views inside views, which
                makes this very hard to reason about considering nesting.
                >
                > Thanks,
                > Walaa
                >
                > On Fri, Apr 25, 2025 at 7:00 AM Jan Kaul
                <jank...@mailbox.org.invalid> wrote:
                >>
                >> @Walaa:
                >>
                >> I would argue that when you run a CREATE VIEW
                statement the query engine knowns which catalog the
                view is being created in. So even though we typically
                use late binding to resolve the view catalog at query
                time, it can also be used at creation time.
                >>
                >> The query engine would need to keep track of the
                "view catalog" where the view is going to be created
                in. It can use that catalog to resolve partial table
                identifiers if "default-catalog" is not set.
                >>
                >> It can lead to some unintuitive behavior, where
                partial identifiers in the view query resolve to a
                different catalog compared to using them outside of a
                view.
                >>
                >> CREATE VIEW catalogA.sales.monthly_orders AS SELECT
                * from sales.orders;
                >>
                >> If the session default catalog is not "catalogA",
                the "sales.orders" in the view query would not be the
                same as just referencing "sales.orders" in a normal
                SQL statement. This is because without a
                "default-catalog", the catalog name of "sales.orders"
                would default to "catalogA", which is the view's catalog.
                >>
                >> Thanks,
                >>
                >> Jan
                >>
                >> On 4/25/25 04:05, Manu Zhang wrote:
                >>>
                >>> For example, if we want to validate that the
                tables referenced in the view exist, how can we do
                that when default-catalog isn't defined, since the
                view hasn't been created or loaded yet?
                >>
                >> I don't think this is related to view spec. How do
                we validate that a table exists without a default
                catalog, or do we always use the current session catalog?
                >>
                >> Thanks,
                >> Manu
                >>
                >> On Fri, Apr 25, 2025 at 5:59 AM Walaa Eldin
                Moustafa <wa.moust...@gmail.com> wrote:
                >>>
                >>> Hi Jan,
                >>>
                >>> I think we still share the same understanding.
                Just to clarify: when I referred to late binding as
                “similar” to the proposal, I was acknowledging the
                distinction between view-level and table-level
                resolution. But as you noted, both follow a late
                binding model.
                >>>
                >>> That said, this still raises an interesting
                question and a potential gap: if default-catalog is
                only defined at query time, how should resolution work
                during view creation? For example, if we want to
                validate that the tables referenced in the view exist,
                how can we do that when default-catalog isn't defined,
                since the view hasn't been created or loaded yet?
                >>>
                >>> Thanks,
                >>> Walaa.
                >>>
                >>> On Thu, Apr 24, 2025 at 7:02 AM Jan Kaul
                <jank...@mailbox.org.invalid> wrote:
                >>>>
                >>>> Yes, I have the same understanding. The view
                catalog is resolved at query time.
                >>>>
                >>>> As you mentioned before, it's good to distinguish
                between the physical catalog and it's reference used
                in SQL statements. The important part is that the
                physical catalog of the view and the tables referenced
                in it's definition stay consistent. You could create a
                view in a given physical catalog by referring to it as
                "catalogA", as in your first point. If you then, given
                a different setup, refer to the same physical catalog
                as "catalogB" in another session/environment, the
                behavior should still work.
                >>>>
                >>>> I would however rephrase your last point. Late
                binding applies to the view catalog name and by
                extension to all partial table references when no
                "default-catalog" is present. Resolving the view
                catalog name at query time is not opposed to storing
                the view metadata in a catalog.
                >>>>
                >>>> Or maybe I don't entirely understand what you mean.
                >>>>
                >>>> Thanks
                >>>>
                >>>> Jan
                >>>>
                >>>> On 4/24/25 00:32, Walaa Eldin Moustafa wrote:
                >>>>
                >>>> Hi Jan,
                >>>>
                >>>> > The view is executed when it's being referenced
                in a SQL statement. That statement contains the
                information for the query engine to resolve the
                catalog of the view.
                >>>>
                >>>> If I’m understanding correctly, that means:
                >>>>
                >>>> * If the view is queried as SELECT * FROM
                catalogA.namespace.view, then catalogA is considered
                the view’s catalog.
                >>>>
                >>>> * If the same view is later queried as SELECT *
                FROM catalogB.namespace.view (after renaming catalogA
                to catalogB, and keeping everything else the same),
                then catalogB becomes the view’s catalog.
                >>>>
                >>>> Is that interpretation correct? If so, it sounds
                to me like the catalog is resolved at query time,
                based on how the view is referenced, not from any
                stored metadata. That would imply some sort of a late
                binding behavior (similar to the proposal), as opposed
                to using some catalog that "stores" the view definition.
                >>>>
                >>>> Thanks,
                >>>> Walaa
                >>>>
                >>>> On Tue, Apr 22, 2025 at 11:01 AM Jan Kaul
                <jank...@mailbox.org.invalid> wrote:
                >>>>>
                >>>>> Hi Walaa,
                >>>>>
                >>>>> Thanks for clarifying the aspects of
                non-determinism. Let me try to address your questions.
                >>>>>
                >>>>> 1. This is my interpretation of the current
                spec: The view is executed when it's being referenced
                in a SQL statement. That statement contains the
                information for the query engine to resolve the
                catalog of the view. The query engine then uses that
                information to fetch the view metadata from the
                catalog. It also needs to temporarily keep track of
                which catalog it used to fetch the view metadata. It
                can then use that information to resolve the table
                references in the views SQL definition in case no
                default catalog is specified.
                >>>>>
                >>>>> 2. The important part is that the catalog can be
                referenced at execution time. As long as that's the
                case I would assume the view can be created in any
                catalog.
                >>>>>
                >>>>>
                >>>>> I think your point is really valuable because
                the current specification can lead to some unintuitive
                behavior. For example for the following statement:
                >>>>>
                >>>>> CREATE VIEW catalogA.sales.monthly_orders AS
                SELECT * from sales.orders;
                >>>>>
                >>>>> If the session default catalog is not
                "catalogA", the "sales.orders" in the view query would
                not be the same as just referencing "sales.orders" in
                a normal SQL statement. This is because without a
                "default-catalog", the catalog name of "sales.orders"
                would default to "catalogA".
                >>>>>
                >>>>>
                >>>>> However, I like the current design of the view
                spec, because it has the "closure" property. Because
                of the fact that the "view catalog" has to be known
                when executing a view, all the information required to
                resolve the table identifiers is contained in the view
                metadata (and the "view catalog"). I think that if you
                make the identifier resolution dependent on external
                parameters, it hinders portability.
                >>>>>
                >>>>> Thanks,
                >>>>>
                >>>>> Jan
                >>>>>
                >>>>> On 4/22/25 18:36, Walaa Eldin Moustafa wrote:
                >>>>>
                >>>>> Hi Jan,
                >>>>>
                >>>>> Thanks for the thoughtful feedback.
                >>>>>
                >>>>> I think it’s important we clarify a key point
                before going deeper:
                >>>>>
                >>>>> Non-determinism is not caused by session
                fallback behavior—it’s a fundamental limitation of
                using table identifiers alone, regardless of whether
                we use the current rule, the proposed fallback to the
                session’s default catalog, or even early vs. late binding.
                >>>>>
                >>>>> The same fully qualified identifier (e.g.,
                catalogA.namespace.table) can resolve to different
                objects depending solely on engine-specific routing
                logic or catalog aliases. So determinism isn’t
                guaranteed just because an identifier is "fully
                qualified." The only reliable anchor for identity is
                the UUID. That’s why the proposed use of UUIDs is not
                just a hardening strategy. It’s the actual fix for
                correctness.
                >>>>>
                >>>>> To move the conversation forward, could you help
                clarify two things in the context of the current spec:
                >>>>>
                >>>>> * Where in the metadata is the “view catalog”
                stored, so that an engine knows to fall back to it if
                default-catalog is null?
                >>>>>
                >>>>> * Are we even allowed to create views in the
                session's default catalog (i.e., without specifying a
                catalog) in the current Iceberg spec?
                >>>>>
                >>>>> These questions are important because if we
                can’t unambiguously recover the "view catalog" from
                metadata, then defaulting to it is problematic. And if
                views can't be created in the default catalog, then
                the fallback rule doesn’t generalize.
                >>>>>
                >>>>> Thanks,
                >>>>> Walaa.
                >>>>>
                >>>>>
                >>>>> On Tue, Apr 22, 2025 at 3:14 AM Jan Kaul
                <jank...@mailbox.org.invalid> wrote:
                >>>>>>
                >>>>>> Hi Walaa,
                >>>>>>
                >>>>>> thank you for your proposal. If I understood
                correctly, you proposal is composed of three parts:
                >>>>>>
                >>>>>> - session default catalog as fallback for
                "default-catalog"
                >>>>>>
                >>>>>> - session default namespace as fallback for
                "default-namepace"
                >>>>>>
                >>>>>> - Late binding + UUID validation
                >>>>>>
                >>>>>> I have some comments regarding these points.
                >>>>>>
                >>>>>>
                >>>>>> 1. Session default catalog as fallback for
                "default-catalog"
                >>>>>>
                >>>>>> Introducing a behavior that depends on the
                current session setup is in my opinion the definition
                of "non-determinism". You could be running the same
                query-engine and catalog-setup on different days, with
                different default session catalogs (which is rather
                common), and would be getting different results.
                >>>>>>
                >>>>>> Whereas with the current behavior, the view
                always produces the same results. The current behavior
                has some rough edges in very niche use cases but I
                think is solid for most uses cases.
                >>>>>>
                >>>>>> 2. Session default namespace as fallback for
                "default-namespace"
                >>>>>>
                >>>>>> Similar to the above.
                >>>>>>
                >>>>>> 3. Late binding + UUID validation
                >>>>>>
                >>>>>> If I understand it correctly, the current
                implementation already uses late binding.
                >>>>>>
                >>>>>> Generally, having UUID validation makes the
                setup more robust. Which is great. However, having
                UUID validation still requires us to have a portable
                table identifier specification. Even if we have the
                UUIDs of the referenced tables from the view, there
                simply isn't an interface that let's us use those
                UUIDs. The catalog interface is defined in terms of
                table identifiers.
                >>>>>>
                >>>>>> So we always require a working catalog setup
                and suiting table identifiers to obtain the table
                metadata. We can use the UUIDs to verify if we loaded
                the correct table. But this can only be done after we
                used some identifier. Which means there is no way of
                using UUIDs without a functioning catalog/identifier
                setup.
                >>>>>>
                >>>>>>
                >>>>>> In conclusion, I prefer the current behavior
                for "default-catalog" because it is more deterministic
                in my opinion. And I think the current spec does a
                good job for multi-engine table identifier resolution.
                I see the UUID validation more of an additional
                hardening strategy.
                >>>>>>
                >>>>>> Thanks
                >>>>>>
                >>>>>> Jan
                >>>>>>
                >>>>>> On 4/21/25 17:38, Walaa Eldin Moustafa wrote:
                >>>>>>
                >>>>>> Thanks Renjie!
                >>>>>>
                >>>>>> The existing spec has some guidance on
                resolving catalogs on the fly already (to address the
                case of view text with table identifiers missing the
                catalog part). The guidance is to use the catalog
                where the view is stored. But I find this rule hard to
                interpret or use. The catalog itself is a logical
                construct—such as a federated catalog that delegates
                to multiple physical backends (e.g., HMS and REST). In
                such cases, the catalog (e.g., `my_catalog` in
                `my_catalog.namespace1.table1`) doesn’t physically
                store the tables; it only routes requests to
                underlying stores. Therefore, defaulting identifier
                resolution based on the catalog where the view is
                "stored" doesn’t align with how catalogs actually
                behave in practice.
                >>>>>>
                >>>>>> Thanks,
                >>>>>> Walaa.
                >>>>>>
                >>>>>> On Sun, Apr 20, 2025 at 11:17 PM Renjie Liu
                <liurenjie2...@gmail.com> wrote:
                >>>>>>>
                >>>>>>> Hi, Walaa:
                >>>>>>>
                >>>>>>> Thanks for the proposal.
                >>>>>>>
                >>>>>>> I've reviewed the doc, but in general I have
                some concerns with resolving catalog names on the fly
                with query engine defined catalog names. This
                introduces some flexibility at first glance, but also
                makes misconfiguration difficult to explain.
                >>>>>>>
                >>>>>>> But I agree with one part that we should store
                resolved table uuid in view metadata, as table/view
                renaming may introduce errors that's difficult to
                understand for user.
                >>>>>>>
                >>>>>>> On Sat, Apr 19, 2025 at 3:02 AM Walaa Eldin
                Moustafa <wa.moust...@gmail.com> wrote:
                >>>>>>>>
                >>>>>>>> Hi Everyone,
                >>>>>>>>
                >>>>>>>> Looking forward to keeping up the momentum
                and closing out the MV spec as well. I’m hoping we can
                proceed to a vote next week.
                >>>>>>>>
                >>>>>>>> Here is a summary in case that helps. The
                proposal outlines a strategy for handling table
                identifiers in Iceberg view metadata, with the goal of
                ensuring correctness, portability, and engine
                compatibility. It recommends resolving table
                identifiers at read time (late binding) rather than
                creation time, and introduces UUID-based validation to
                maintain identity guarantees across engines, or
                sessions. It also revises how default-catalog and
                default-namespace are handled (defaulting both to the
                session context if not explicitly set) to better align
                with engine behavior and improve cross-engine
                interoperability.
                >>>>>>>>
                >>>>>>>> Please let me know your thoughts.
                >>>>>>>>
                >>>>>>>> Thanks,
                >>>>>>>> Walaa.
                >>>>>>>>
                >>>>>>>>
                >>>>>>>>
                >>>>>>>> On Wed, Apr 16, 2025 at 2:03 PM Walaa Eldin
                Moustafa <wa.moust...@gmail.com> wrote:
                >>>>>>>>>
                >>>>>>>>> Thanks Eduard and Sung! I have addressed the
                comments.
                >>>>>>>>>
                >>>>>>>>> One key point to keep in mind is that
                catalog names in the spec refer to logical
                catalogs—i.e., the first part of a three-part
                identifier. These correspond to Spark's DataSourceV2
                catalogs, Trino connectors, and similar constructs.
                This is a level of abstraction above physical
                catalogs, which are not referenced or used in the view
                spec. The reason is that table identifiers in the view
                definition/text itself refer to logical catalogs, not
                physical ones (since they interface directly with the
                engine and not a specific metastore).
                >>>>>>>>>
                >>>>>>>>> Thanks,
                >>>>>>>>> Walaa.
                >>>>>>>>>
                >>>>>>>>>
                >>>>>>>>> On Wed, Apr 16, 2025 at 6:15 AM Sung Yun
                <sungwy...@gmail.com> wrote:
                >>>>>>>>>>
                >>>>>>>>>> Thank you Walaa for the proposal. I think
                view portability is a very important topic for us to
                continue discussing as it relies on many assumptions
                within the data ecosystem for it to function like
                you've highlighted well in the document.
                >>>>>>>>>>
                >>>>>>>>>> I've added a few comments around how this
                may impact the permission questions the engines will
                be asking, and whether that is the desired behavior.
                >>>>>>>>>>
                >>>>>>>>>> Sung
                >>>>>>>>>>
                >>>>>>>>>> On Wed, Apr 16, 2025 at 7:32 AM Eduard
                Tudenhöfner <etudenhoef...@apache.org> wrote:
                >>>>>>>>>>>
                >>>>>>>>>>> Thanks Walaa for tackling this problem.
                I've added a few comments to get a better
                understanding of how this will look like in the actual
                implementation.
                >>>>>>>>>>>
                >>>>>>>>>>> Eduard
                >>>>>>>>>>>
                >>>>>>>>>>> On Tue, Apr 15, 2025 at 7:09 PM Walaa
                Eldin Moustafa <wa.moust...@gmail.com> wrote:
                >>>>>>>>>>>>
                >>>>>>>>>>>> Hi Everyone,
                >>>>>>>>>>>>
                >>>>>>>>>>>> Starting this thread to resume our
                discussion on how to reference table identifiers from
                Iceberg metadata, a key aspect of the view
                specification, particularly in relation to the MV
                (materialized view) extensions.
                >>>>>>>>>>>>
                >>>>>>>>>>>> I had the chance to speak offline with a
                few community members to better understand how the
                current spec is being interpreted. Those conversations
                served as inputs to a new proposal on how table
                identifier references could be represented in metadata.
                >>>>>>>>>>>>
                >>>>>>>>>>>> You can find the proposal here [1]. I
                look forward to your feedback and working together to
                move this forward so we can finalize the MV spec as well.
                >>>>>>>>>>>>
                >>>>>>>>>>>> [1]
                
https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0
                >>>>>>>>>>>>
                >>>>>>>>>>>> Thanks,
                >>>>>>>>>>>> Walaa.

Re: [DISCUSS] Table Identifiers in Iceberg View Spec

Reply via email to