Hi Jan,
Thanks for the thoughtful feedback.
I think it’s important we clarify a key point before going deeper:
Non-determinism is not caused by session fallback behavior—it’s a
*fundamental limitation of using table identifiers* alone,
regardless of whether we use the current rule, the proposed
fallback to the session’s default catalog, or even early vs. late
binding.
The same fully qualified identifier (e.g.,
catalogA.namespace.table) can resolve to different objects
depending solely on engine-specific routing logic or catalog
aliases. So determinism isn’t guaranteed just because an
identifier is "fully qualified." The only reliable anchor for
identity is the UUID. That’s why the proposed use of UUIDs is not
just a hardening strategy. It’s the actual fix for correctness.
To move the conversation forward, could you help clarify two
things in the context of the current spec:
* Where in the metadata is the “view catalog” stored, so that an
engine knows to fall back to it if default-catalog is null?
* Are we even allowed to create views in the session's default
catalog (i.e., without specifying a catalog) in the current
Iceberg spec?
These questions are important because if we can’t unambiguously
recover the "view catalog" from metadata, then defaulting to it
is problematic. And if views can't be created in the default
catalog, then the fallback rule doesn’t generalize.
Thanks,
Walaa.
On Tue, Apr 22, 2025 at 3:14 AM Jan Kaul
<jank...@mailbox.org.invalid>
<mailto:jank...@mailbox.org.invalid> wrote:
Hi Walaa,
thank you for your proposal. If I understood correctly, you
proposal is composed of three parts:
- session default catalog as fallback for "default-catalog"
- session default namespace as fallback for "default-namepace"
- Late binding + UUID validation
I have some comments regarding these points.
1. Session default catalog as fallback for
"default-catalog"
Introducing a behavior that depends on the current session
setup is in my opinion the definition of "non-determinism".
You could be running the same query-engine and catalog-setup
on different days, with different default session catalogs
(which is rather common), and would be getting different results.
Whereas with the current behavior, the view always produces
the same results. The current behavior has some rough edges
in very niche use cases but I think is solid for most uses cases.
2. Session default namespace as fallback for
"default-namespace"
Similar to the above.
3. Late binding + UUID validation
If I understand it correctly, the current implementation
already uses late binding.
Generally, having UUID validation makes the setup more
robust. Which is great. However, having UUID validation still
requires us to have a portable table identifier
specification. Even if we have the UUIDs of the referenced
tables from the view, there simply isn't an interface that
let's us use those UUIDs. The catalog interface is defined in
terms of table identifiers.
So we always require a working catalog setup and suiting
table identifiers to obtain the table metadata. We can use
the UUIDs to verify if we loaded the correct table. But this
can only be done after we used some identifier. Which means
there is no way of using UUIDs without a functioning
catalog/identifier setup.
In conclusion, I prefer the current behavior for
"default-catalog" because it is more deterministic in my
opinion. And I think the current spec does a good job for
multi-engine table identifier resolution. I see the UUID
validation more of an additional hardening strategy.
Thanks
Jan
On 4/21/25 17:38, Walaa Eldin Moustafa wrote:
Thanks Renjie!
The existing spec has some guidance on resolving catalogs on
the fly already (to address the case of view text with table
identifiers missing the catalog part). The guidance is to
use the catalog where the view is stored. But I find this
rule hard to interpret or use. The catalog itself is a
logical construct—such as a federated catalog that delegates
to multiple physical backends (e.g., HMS and REST). In such
cases, the catalog (e.g., `my_catalog` in
`my_catalog.namespace1.table1`) doesn’t physically store the
tables; it only routes requests to underlying stores.
Therefore, defaulting identifier resolution based on the
catalog where the view is "stored" doesn’t align with how
catalogs actually behave in practice.
Thanks,
Walaa.
On Sun, Apr 20, 2025 at 11:17 PM Renjie Liu
<liurenjie2...@gmail.com> wrote:
Hi, Walaa:
Thanks for the proposal.
I've reviewed the doc, but in general I have some
concerns with resolving catalog names on the fly with
query engine defined catalog names. This introduces some
flexibility at first glance, but also makes
misconfiguration difficult to explain.
But I agree with one part that we should store resolved
table uuid in view metadata, as table/view renaming may
introduce errors that's difficult to understand for user.
On Sat, Apr 19, 2025 at 3:02 AM Walaa Eldin Moustafa
<wa.moust...@gmail.com> wrote:
Hi Everyone,
Looking forward to keeping up the momentum and
closing out the MV spec as well. I’m hoping we can
proceed to a vote next week.
Here is a summary in case that helps. The proposal
outlines a strategy for handling table identifiers
in Iceberg view metadata, with the goal of ensuring
correctness, portability, and engine compatibility.
It recommends resolving table identifiers at read
time (late binding) rather than creation time, and
introduces UUID-based validation to maintain
identity guarantees across engines, or sessions. It
also revises how default-catalog and
default-namespace are handled (defaulting both to
the session context if not explicitly set) to better
align with engine behavior and improve cross-engine
interoperability.
Please let me know your thoughts.
Thanks,
Walaa.
On Wed, Apr 16, 2025 at 2:03 PM Walaa Eldin Moustafa
<wa.moust...@gmail.com> wrote:
Thanks Eduard and Sung! I have addressed the
comments.
One key point to keep in mind is that catalog
names in the spec refer to logical
catalogs—i.e., the first part of a three-part
identifier. These correspond to Spark's
DataSourceV2 catalogs, Trino connectors, and
similar constructs. This is a level of
abstraction above physical catalogs, which are
not referenced or used in the view spec. The
reason is that table identifiers in the view
definition/text itself refer to logical
catalogs, not physical ones (since they
interface directly with the engine and not a
specific metastore).
Thanks,
Walaa.
On Wed, Apr 16, 2025 at 6:15 AM Sung Yun
<sungwy...@gmail.com> wrote:
Thank you Walaa for the proposal. I think
view portability is a very important topic
for us to continue discussing as it relies
on many assumptions within the data
ecosystem for it to function like you've
highlighted well in the document.
I've added a few comments around how this
may impact the permission questions the
engines will be asking, and whether that is
the desired behavior.
Sung
On Wed, Apr 16, 2025 at 7:32 AM Eduard
Tudenhöfner <etudenhoef...@apache.org> wrote:
Thanks Walaa for tackling this problem.
I've added a few comments to get a
better understanding of how this will
look like in the actual implementation.
Eduard
On Tue, Apr 15, 2025 at 7:09 PM Walaa
Eldin Moustafa <wa.moust...@gmail.com>
wrote:
Hi Everyone,
Starting this thread to resume our
discussion on how to reference table
identifiers from Iceberg metadata, a
key aspect of the view
specification, particularly in
relation to the MV (materialized
view) extensions.
I had the chance to speak offline
with a few community members to
better understand how the current
spec is being interpreted. Those
conversations served as inputs to a
new proposal on how table identifier
references could be represented in
metadata.
You can find the proposal here [1].
I look forward to your feedback and
working together to move this
forward so we can finalize the MV
spec as well.
[1]
https://docs.google.com/document/d/1-I2v_OqBgJi_8HVaeH1u2jowghmXoB8XaJLzPBa_Hg8/edit?tab=t.0
Thanks,
Walaa.