wmoustafa opened a new pull request, #9830:
URL: https://github.com/apache/iceberg/pull/9830
## Spec
This patch adds support for materialized views in Iceberg and integrates the
implementation with Spark SQL. It reuses the current spec of Iceberg views and
tables by leveraging table properties to capture materialized view metadata.
Those properties can be added to the Iceberg spec to formalize materialized
view support.
Below is a summary of all metadata properties introduced or utilized by this
patch, classified based on whether they are associated with a table or a view,
along with their purposes:
### Properties on a View:
1. **`iceberg.materialized.view`**:
- **Type**: View property
- **Purpose**: This property is used to mark whether a view is a
materialized view. If set to `true`, the view is treated as a materialized
view. This helps in differentiating between virtual and materialized views
within the catalog and dictates specific handling and validation logic for
materialized views.
2. **`iceberg.materialized.view.storage.table`**:
- **Type**: View property
- **Purpose**: Specifies the identifier of the storage table associated
with the materialized view. This property is used for linking a materialized
view with its corresponding storage table, enabling data management and query
execution based on the stored data freshness.
### Properties on a Table:
1. **`iceberg.base.snapshot.[UUID]`**:
- **Type**: Table property
- **Purpose**: These properties store the snapshot IDs of the base
tables at the time the materialized view's data was last updated. Each property
is prefixed with `base.snapshot.` followed by the UUID of the base table. They
are used to track whether the materialized view's data is up to date with the
base tables by comparing these snapshot IDs with the current snapshot IDs of
the base tables. If all the base tables' current snapshot IDs match the ones
stored in these properties, the materialized view's data is considered fresh.
2. **`iceberg.materialized.view.version`**:
- **Type**: Table property
- **Purpose**: This property tracks the parent view version ID when the
storage table is created (or refreshed). The table is usable only when the view
version ID property matches the current parent view version ID.
## Spark SQL
This patch introduces support for materialized views in the Spark module by
adding support for Spark SQL `CREATE MATERIALIZED VIEW` and adding materialized
view handling for the `DROP VIEW` DDL command. When a `CREATE MATERIALIZED
VIEW` command is executed, the patch interprets the command to create a new
materialized view, which involves not only registering the view's metadata
(including marking it as a materialized view with the appropriate properties)
but also setting up a corresponding storage table to hold the materialized data
and setting the base table current snapshot IDs (at creation time). The storage
table identifier is passed by a new clause `STORED AS '...'`. If no `STORED AS`
clause is specified, a default storage table identifier is assigned. When a
`DROP VIEW` command is issued for a materialized view, the patch ensures that
both the metadata for the materialized view and its associated storage table
are properly removed from the catalog. Support for `REFRESH MATE
RIALIZED VIEW` is left as a future enhancement.
## Spark Catalog
This patch enhances the `SparkCatalog` to intelligently decide whether to
return the view text metadata for a materialized view or the data from its
associated storage table based on the freshness of the materialized view.
Within the `loadTable` method, the patch first checks if the requested table
corresponds to a materialized view by loading the view from the Iceberg
catalog. If the identified view is marked as a materialized view (using the
`iceberg.materialized.view` property), the patch then assesses its freshness.
If it is fresh, the `loadTable` method proceeds to load and return the storage
table associated with the materialized view, allowing users to query the
pre-computed data directly. However, if the materialized view is stale, the
method simply returns to allow `SparkCatalog`'s `loadView` to run. In turn,
`loadView` returns the metadata for the virtual view itself, triggering the
usual Spark view logic that computes the result set based on the current state
of the bas
e tables.
## Notes
* This patch intentionally avoids introducing new Iceberg or engine object
APIs. The intention is to start a discussion on whether such APIs are required,
and the best objects to model them. There is a number of trade-offs based on
each choice.
* The `InMemoryCatalog` has been extended to use a test `LocalFileIO` due to
an existing gap in a pure `InMemoryCatalog` (with `InMemoryFileIO`), with
working with data files (which are required by the storage table). The
extension of the `InMemoryCatalog` to use `LocalFileIO` ended up promoting a
couple of methods to `public`, but the intention is again to start a discussion
about the best way to address the current gap.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]