Hi all (cross-posting to Iceberg and Spark dev lists), I'd like to gather input from both communities on an architectural decision for the Spark implementation of Iceberg Materialized Views.
*Background* The Iceberg community has been working on a Materialized View specification (https://github.com/apache/iceberg/pull/11041), and the design is now directionally aligned among contributors active on the spec PR. We are now devising the Spark implementation as the first end-to-end realization of the spec. A core engine concern is read-time routing: when a query references an MV, the engine must decide whether to read the precomputed storage table (when the MV is fresh) or evaluate the view query (when stale). Both architectures have been implemented; the PRs and trade-offs are laid out below. *1. DSv2 catalog-based routing: *https://github.com/apache/iceberg/pull/9830 The catalog itself decides routing. SparkCatalog.loadTable returns a wrapper that reads from the storage table when fresh; loadView indicates "use loadTable instead" when fresh. Pros: - Routing decision lives at the catalog boundary, where the View ↔ storage Table relationship is already modeled. - Aligns with Spark 4.2's new RelationCatalog.loadRelation API [1], which lets a catalog return either a Table or View for the same identifier. MV routing fits this model naturally, and the current pre-4.2 indirection (loadView throws / loadTable returns a wrapper) can collapse to a single loadRelation call. Cons: - Freshness evaluation runs inside the view catalog. When MV dependencies span multiple catalogs, loading dependency metadata from another catalog doesn't compose cleanly inside one catalog's load path. - Pre-4.2, routing requires the indirection above. *2. Catalyst analyzer-based routing:* https://github.com/wmoustafa/iceberg-1/pull/2 A new Catalyst rule (ResolveMaterializedViews) rewrites UnresolvedRelation for a fresh MV to the storage table identifier; Spark's normal table resolution loads it. Stale MVs fall through to the existing view-expansion rule. Pros: - Sits above the catalog layer, so freshness checks that need to load dependency metadata from other catalogs are a natural fit. The analyzer rule has access to the full catalog manager. - Works across Spark 3.x and 4.x without depending on a specific DSv2 API shape. Cons: - Doesn't leverage loadRelation where DSv2 is evolving. *The crux of the trade-off:* the DSv2 approach aligns with the direction of Spark's DSv2 surface (especially 4.2's loadRelation), while the analyzer approach accommodates cross-catalog freshness checks more naturally. Feedback welcome from either community, particularly anyone who has thought about MVs, or planned loadRelation-based integration patterns. Thanks, Walaa. [1] https://github.com/apache/spark/blob/e754420f1c43423cb865adcc840e1d3111f3ef3b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/RelationCatalog.java#L123-L135
