Hi Walaa,

Thanks for clarifying the aspects of non-determinism. Let me try to address your questions.

1. This is my interpretation of the current spec: The view is executed when it's being referenced in a SQL statement. That statement contains the information for the query engine to resolve the catalog of the view. The query engine then uses that information to fetch the view metadata from the catalog. It also needs to temporarily keep track of which catalog it used to fetch the view metadata. It can then use that information to resolve the table references in the views SQL definition in case no default catalog is specified.

2. The important part is that the catalog can be referenced at execution time. As long as that's the case I would assume the view can be created in any catalog.


I think your point is really valuable because the current specification can lead to some unintuitive behavior. For example for the following statement:

CREATE VIEW catalogA.sales.monthly_orders AS SELECT * from sales.orders;

If the session default catalog is not "catalogA", the "sales.orders" in the view query would not be the same as just referencing "sales.orders" in a normal SQL statement. This is because without a "default-catalog", the catalog name of "sales.orders" would default to "catalogA".


However, I like the current design of the view spec, because it has the "closure" property. Because of the fact that the "view catalog" has to be known when executing a view, all the information required to resolve the table identifiers is contained in the view metadata (and the "view catalog"). I think that if you make the identifier resolution dependent on external parameters, it hinders portability.

Thanks,

Jan

On 4/22/25 18:36, Walaa Eldin Moustafa wrote:
Hi Jan,

Thanks for the thoughtful feedback.

I think it’s important we clarify a key point before going deeper:

Non-determinism is not caused by session fallback behavior—it’s a *fundamental limitation of using table identifiers* alone, regardless of whether we use the current rule, the proposed fallback to the session’s default catalog, or even early vs. late binding.

The same fully qualified identifier (e.g., catalogA.namespace.table) can resolve to different objects depending solely on engine-specific routing logic or catalog aliases. So determinism isn’t guaranteed just because an identifier is "fully qualified." The only reliable anchor for identity is the UUID. That’s why the proposed use of UUIDs is not just a hardening strategy. It’s the actual fix for correctness.

To move the conversation forward, could you help clarify two things in the context of the current spec:

* Where in the metadata is the “view catalog” stored, so that an engine knows to fall back to it if default-catalog is null?

* Are we even allowed to create views in the session's default catalog (i.e., without specifying a catalog) in the current Iceberg spec?

These questions are important because if we can’t unambiguously recover the "view catalog" from metadata, then defaulting to it is problematic. And if views can't be created in the default catalog, then the fallback rule doesn’t generalize.

Thanks,
Walaa.


On Tue, Apr 22, 2025 at 3:14 AM Jan Kaul <jank...@mailbox.org.invalid> wrote:

    Hi Walaa,

    thank you for your proposal. If I understood correctly, you
    proposal is composed of three parts:

    - session default catalog as fallback for "default-catalog"

    - session default namespace as fallback for "default-namepace"

    - Late binding + UUID validation

    I have some comments regarding these points.


            1. Session default catalog as fallback for "default-catalog"

    Introducing a behavior that depends on the current session setup
    is in my opinion the definition of "non-determinism". You could be
    running the same query-engine and catalog-setup on different days,
    with different default session catalogs (which is rather common),
    and would be getting different results.

    Whereas with the current behavior, the view always produces the
    same results. The current behavior has some rough edges in very
    niche use cases but I think is solid for most uses cases.


            2. Session default namespace as fallback for
            "default-namespace"

    Similar to the above.


            3. Late binding + UUID validation

    If I understand it correctly, the current implementation already
    uses late binding.

    Generally, having UUID validation makes the setup more robust.
    Which is great. However, having UUID validation still requires us
    to have a portable table identifier specification. Even if we have
    the UUIDs of the referenced tables from the view, there simply
    isn't an interface that let's us use those UUIDs. The catalog
    interface is defined in terms of table identifiers.

    So we always require a working catalog setup and suiting table
    identifiers to obtain the table metadata. We can use the UUIDs to
    verify if we loaded the correct table. But this can only be done
    after we used some identifier. Which means there is no way of
    using UUIDs without a functioning catalog/identifier setup.


    In conclusion, I prefer the current behavior for "default-catalog"
    because it is more deterministic in my opinion. And I think the
    current spec does a good job for multi-engine table identifier
    resolution. I see the UUID validation more of an additional
    hardening strategy.

    Thanks

    Jan

    On 4/21/25 17:38, Walaa Eldin Moustafa wrote:
    Thanks Renjie!

    The existing spec has some guidance on resolving catalogs on the
    fly already (to address the case of view text with table
    identifiers missing the catalog part). The guidance is to use the
    catalog where the view is stored. But I find this rule hard to
    interpret or use. The catalog itself is a logical construct—such
    as a federated catalog that delegates to multiple physical
    backends (e.g., HMS and REST). In such cases, the catalog (e.g.,
    `my_catalog` in `my_catalog.namespace1.table1`) doesn’t
    physically store the tables; it only routes requests to
    underlying stores. Therefore, defaulting identifier resolution
    based on the catalog where the view is "stored" doesn’t align
    with how catalogs actually behave in practice.

    Thanks,
    Walaa.

    On Sun, Apr 20, 2025 at 11:17 PM Renjie Liu
    <liurenjie2...@gmail.com> wrote:

        Hi, Walaa:

        Thanks for the proposal.

        I've reviewed the doc, but in general I have some concerns
        with resolving catalog names on the fly with query engine
        defined catalog names. This introduces some flexibility at
        first glance, but also makes misconfiguration difficult to
        explain.

        But I agree with one part that we should store resolved table
        uuid in view metadata, as table/view renaming may introduce
        errors that's difficult to understand for user.

        On Sat, Apr 19, 2025 at 3:02 AM Walaa Eldin Moustafa
        <wa.moust...@gmail.com> wrote:

            Hi Everyone,

            Looking forward to keeping up the momentum and closing
            out the MV spec as well. I’m hoping we can proceed to a
            vote next week.

            Here is a summary in case that helps. The proposal
            outlines a strategy for handling table identifiers in
            Iceberg view metadata, with the goal of ensuring
            correctness, portability, and engine compatibility. It
            recommends resolving table identifiers at read time (late
            binding) rather than creation time, and introduces
            UUID-based validation to maintain identity guarantees
            across engines, or sessions. It also revises how
            default-catalog and default-namespace are handled
            (defaulting both to the session context if not explicitly
            set) to better align with engine behavior and improve
            cross-engine interoperability.

            Please let me know your thoughts.

            Thanks,
            Walaa.



            On Wed, Apr 16, 2025 at 2:03 PM Walaa Eldin Moustafa
            <wa.moust...@gmail.com> wrote:

                Thanks Eduard and Sung! I have addressed the comments.

                One key point to keep in mind is that catalog names
                in the spec refer to logical catalogs—i.e., the first
                part of a three-part identifier. These correspond to
                Spark's DataSourceV2 catalogs, Trino connectors, and
                similar constructs. This is a level of abstraction
                above physical catalogs, which are not referenced or
                used in the view spec. The reason is that table
                identifiers in the view definition/text itself refer
                to logical catalogs, not physical ones (since they
                interface directly with the engine and not a specific
                metastore).

                Thanks,
                Walaa.


                On Wed, Apr 16, 2025 at 6:15 AM Sung Yun
                <sungwy...@gmail.com> wrote:

                    Thank you Walaa for the proposal. I think view
                    portability is a very important topic for us to
                    continue discussing as it relies on many
                    assumptions within the data ecosystem for it to
                    function like you've highlighted well in the
                    document.

                    I've added a few comments around how this may
                    impact the permission questions the engines will
                    be asking, and whether that is the desired behavior.

                    Sung

                    On Wed, Apr 16, 2025 at 7:32 AM Eduard
                    Tudenhöfner <etudenhoef...@apache.org> wrote:

                        Thanks Walaa for tackling this problem. I've
                        added a few comments to get a better
                        understanding of how this will look like in
                        the actual implementation.

                        Eduard

                        On Tue, Apr 15, 2025 at 7:09 PM Walaa Eldin
                        Moustafa <wa.moust...@gmail.com> wrote:

                            Hi Everyone,

                            Starting this thread to resume our
                            discussion on how to reference table
                            identifiers from Iceberg metadata, a key
                            aspect of the view specification,
                            particularly in relation to the MV
                            (materialized view) extensions.

                            I had the chance to speak offline with a
                            few community members to better
                            understand how the current spec is being
                            interpreted. Those conversations served
                            as inputs to a new proposal on how table
                            identifier references could be
                            represented in metadata.

                            You can find the proposal here [1]. I
                            look forward to your feedback and working
                            together to move this forward so we can
                            finalize the MV spec as well.

                            [1]
                            
https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0

                            Thanks,
                            Walaa.

Reply via email to