[PR] Views, Spark: Add support for Materialized Views; Integrate with Spark SQL [iceberg]

via GitHub Thu, 19 Mar 2026 01:14:16 -0700


wmoustafa opened a new pull request, #9830:
URL: https://github.com/apache/iceberg/pull/9830


   ## Spec
   This patch adds support for materialized views in Iceberg and integrates the 
implementation with Spark SQL. It reuses the current spec of Iceberg views and 
tables by leveraging table properties to capture materialized view metadata. 
Those properties can be added to the Iceberg spec to formalize materialized 
view support.
   
   Below is a summary of all metadata properties introduced or utilized by this 
patch, classified based on whether they are associated with a table or a view, 
along with their purposes:
   
   ### Properties on a View:
   
   1. **`iceberg.materialized.view`**:
       - **Type**: View property
       - **Purpose**: This property is used to mark whether a view is a 
materialized view. If set to `true`, the view is treated as a materialized 
view. This helps in differentiating between virtual and materialized views 
within the catalog and dictates specific handling and validation logic for 
materialized views.
   
   2. **`iceberg.materialized.view.storage.table`**:
       - **Type**: View property
       - **Purpose**: Specifies the identifier of the storage table associated 
with the materialized view. This property is used for linking a materialized 
view with its corresponding storage table, enabling data management and query 
execution based on the stored data freshness.
   
   ### Properties on a Table:
   
   1. **`iceberg.base.snapshot.[UUID]`**:
       - **Type**: Table property
       - **Purpose**: These properties store the snapshot IDs of the base 
tables at the time the materialized view's data was last updated. Each property 
is prefixed with `base.snapshot.` followed by the UUID of the base table. They 
are used to track whether the materialized view's data is up to date with the 
base tables by comparing these snapshot IDs with the current snapshot IDs of 
the base tables. If all the base tables' current snapshot IDs match the ones 
stored in these properties, the materialized view's data is considered fresh.
   
   2. **`iceberg.materialized.view.version`**:
       - **Type**: Table property
       - **Purpose**: This property tracks the parent view version ID when the 
storage table is created (or refreshed). The table is usable only when the view 
version ID property matches the current parent view version ID.
       
   ## Spark SQL
   This patch introduces support for materialized views in the Spark module by 
adding support for Spark SQL `CREATE MATERIALIZED VIEW` and adding materialized 
view handling for the `DROP VIEW` DDL command. When a `CREATE MATERIALIZED 
VIEW` command is executed, the patch interprets the command to create a new 
materialized view, which involves not only registering the view's metadata 
(including marking it as a materialized view with the appropriate properties) 
but also setting up a corresponding storage table to hold the materialized data 
and setting the base table current snapshot IDs (at creation time). The storage 
table identifier is passed by a new clause `STORED AS '...'`. If no `STORED AS` 
clause is specified, a default storage table identifier is assigned. When a 
`DROP VIEW` command is issued for a materialized view, the patch ensures that 
both the metadata for the materialized view and its associated storage table 
are properly removed from the catalog. Support for `REFRESH MATE
 RIALIZED VIEW` is left as a future enhancement.
   
   ## Spark Catalog
   This patch enhances the `SparkCatalog` to intelligently decide whether to 
return the view text metadata for a materialized view or the data from its 
associated storage table based on the freshness of the materialized view. 
Within the `loadTable` method, the patch first checks if the requested table 
corresponds to a materialized view by loading the view from the Iceberg 
catalog. If the identified view is marked as a materialized view (using the 
`iceberg.materialized.view` property), the patch then assesses its freshness. 
If it is fresh, the `loadTable` method proceeds to load and return the storage 
table associated with the materialized view, allowing users to query the 
pre-computed data directly. However, if the materialized view is stale, the 
method simply returns to allow `SparkCatalog`'s `loadView` to run. In turn, 
`loadView` returns the metadata for the virtual view itself, triggering the 
usual Spark view logic that computes the result set based on the current state 
of the bas
 e tables.
   
   ## Notes
   * This patch intentionally avoids introducing new Iceberg or engine object 
APIs. The intention is to start a discussion on whether such APIs are required, 
and the best objects to model them. There is a number of trade-offs based on 
each choice.
   * The `InMemoryCatalog` has been extended to use a test `LocalFileIO` due to 
an existing gap in a pure `InMemoryCatalog` (with `InMemoryFileIO`), with 
working with data files (which are required by the storage table). The 
extension of the `InMemoryCatalog` to use `LocalFileIO` ended up promoting a 
couple of methods to `public`, but the intention is again to start a discussion 
about the best way to address the current gap.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Views, Spark: Add support for Materialized Views; Integrate with Spark SQL [iceberg]

Reply via email to