Hi all (cross-posting to Iceberg and Spark dev lists),

I'd like to gather input from both communities on an architectural decision
for the Spark implementation of Iceberg Materialized Views.

*Background*

The Iceberg community has been working on a Materialized View specification
(https://github.com/apache/iceberg/pull/11041), and the design is now
directionally aligned among contributors active on the spec PR. We are now
devising the Spark implementation as the first end-to-end realization of
the spec. A core engine concern is read-time routing: when a query
references an MV, the engine must decide whether to read the precomputed
storage table (when the MV is fresh) or evaluate the view query (when
stale). Both architectures have been implemented; the PRs and trade-offs
are laid out below.


*1. DSv2 catalog-based routing: *https://github.com/apache/iceberg/pull/9830

The catalog itself decides routing. SparkCatalog.loadTable returns a
wrapper that reads from the storage table when fresh; loadView indicates
"use loadTable instead" when fresh.

Pros:
- Routing decision lives at the catalog boundary, where the View ↔ storage
Table relationship is already modeled.
- Aligns with Spark 4.2's new RelationCatalog.loadRelation API [1], which
lets a catalog return either a Table or View for the same identifier. MV
routing fits this model naturally, and the current pre-4.2 indirection
(loadView throws / loadTable returns a wrapper) can collapse to a single
loadRelation call.

Cons:
- Freshness evaluation runs inside the view catalog. When MV dependencies
span multiple catalogs, loading dependency metadata from another catalog
doesn't compose cleanly inside one catalog's load path.
- Pre-4.2, routing requires the indirection above.


*2. Catalyst analyzer-based routing:*
https://github.com/wmoustafa/iceberg-1/pull/2

A new Catalyst rule (ResolveMaterializedViews) rewrites UnresolvedRelation
for a fresh MV to the storage table identifier; Spark's normal table
resolution loads it. Stale MVs fall through to the existing view-expansion
rule.

Pros:
- Sits above the catalog layer, so freshness checks that need to load
dependency metadata from other catalogs are a natural fit. The analyzer
rule has access to the full catalog manager.
- Works across Spark 3.x and 4.x without depending on a specific DSv2 API
shape.

Cons:
- Doesn't leverage loadRelation where DSv2 is evolving.

*The crux of the trade-off:* the DSv2 approach aligns with the direction of
Spark's DSv2 surface (especially 4.2's loadRelation), while the analyzer
approach accommodates cross-catalog freshness checks more naturally.

Feedback welcome from either community, particularly anyone who has thought
about MVs, or planned loadRelation-based integration patterns.

Thanks,
Walaa.

[1]
https://github.com/apache/spark/blob/e754420f1c43423cb865adcc840e1d3111f3ef3b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/RelationCatalog.java#L123-L135

Reply via email to