bennychow commented on code in PR #11041: URL: https://github.com/apache/iceberg/pull/11041#discussion_r2627854951
########## format/view-spec.md: ########## @@ -160,6 +177,89 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness + +A materialized view is considered fresh when its precomputed data is usable by consumers. As tables referenced by a materialized view change over time, the precomputed data may no longer accurately reflect the logical materialized view definition. When this occurs, the materialized view (storage table) is considered stale. + +Different systems interpret freshness differently, typically based on the objects referenced in the fully expanded query tree of the materialized view. Some systems consider only direct children, others only leaf nodes, and some the entire query tree. The specific interpretation is determined by the producer of the storage table. + +#### Refresh state + +The refresh state record captures the state of source tables, views, and materialized views at refresh time. It contains a list of directly or indirectly referenced source states that allow a consumer to determine the freshness of the materialized view. Review Comment: Suggestion: The refresh state record captures the **unique dependencies in the materialized view's dependency graph**. These dependencies include source Iceberg tables, views, and **nested** materialized views that allow a consumer to determine the freshness of the materialized view. ########## format/view-spec.md: ########## @@ -160,6 +177,89 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness + +A materialized view is considered fresh when its precomputed data is usable by consumers. As tables referenced by a materialized view change over time, the precomputed data may no longer accurately reflect the logical materialized view definition. When this occurs, the materialized view (storage table) is considered stale. + +Different systems interpret freshness differently, typically based on the objects referenced in the fully expanded query tree of the materialized view. Some systems consider only direct children, others only leaf nodes, and some the entire query tree. The specific interpretation is determined by the producer of the storage table. Review Comment: Suggestion: Different systems define freshness differently, **based on how much of the dependency graph must be current**. Some require the entire query tree to be fully up to date, while others only require direct children or allow bounded staleness at leaf nodes. As a result, “fresh” can mean strict end-to-end consistency, acceptable lag, or policy/version compliance. ########## format/view-spec.md: ########## @@ -42,12 +42,24 @@ An atomic swap of one view metadata file for another provides the basis for maki Writers create view metadata files optimistically, assuming that the current metadata location will not be changed before the writer's commit. Once a writer has created an update, it commits by swapping the view's metadata file pointer from the base location to the new location. +### Materialized Views + +Materialized views are a type of view with precomputed results from the view query stored as a table. +When queried, engines may return the precomputed data for the materialized views, shifting the cost of query execution to the precomputation step. + +Iceberg materialized views are implemented as a combination of an Iceberg view and an underlying Iceberg table, the "storage-table", which stores the precomputed data. +Materialized View metadata is a superset of View metadata with an additional pointer to the storage table. The storage table is an Iceberg table with additional materialized view refresh state metadata. +Refresh metadata contains information about the "source tables" and/or "source views", which are the tables/views referenced in the query definition of the materialized view. + ## Specification ### Terms * **Schema** -- Names and types of fields in a view. * **Version** -- The state of a view at some point in time. +* **Storage table** -- Iceberg table that stores the precomputed data of a materialized view. +* **Source table** -- A table reference that occurs in the query definition of a materialized view. The materialized view depends on the data from the source tables. +* **Source view** -- A view reference that occurs in the query definition of a materialized view. The materialized view depends on the definitions from the source views. Review Comment: Suggestion: Add this additional term: **Nested materialized view** -- A dependent materialized view that is used in refreshing the current materialized view. ########## format/view-spec.md: ########## @@ -160,6 +177,89 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness + +A materialized view is considered fresh when its precomputed data is usable by consumers. As tables referenced by a materialized view change over time, the precomputed data may no longer accurately reflect the logical materialized view definition. When this occurs, the materialized view (storage table) is considered stale. Review Comment: Suggestion: A materialized view is considered fresh when its precomputed data is usable by consumers. As tables **and views** referenced by a materialized view change over time, the precomputed data may no longer accurately **reflect the materialized view's dependency graph**. When this occurs, the materialized view (storage table) is considered stale. ########## format/view-spec.md: ########## @@ -160,6 +177,89 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness + +A materialized view is considered fresh when its precomputed data is usable by consumers. As tables referenced by a materialized view change over time, the precomputed data may no longer accurately reflect the logical materialized view definition. When this occurs, the materialized view (storage table) is considered stale. + +Different systems interpret freshness differently, typically based on the objects referenced in the fully expanded query tree of the materialized view. Some systems consider only direct children, others only leaf nodes, and some the entire query tree. The specific interpretation is determined by the producer of the storage table. + +#### Refresh state + +The refresh state record captures the state of source tables, views, and materialized views at refresh time. It contains a list of directly or indirectly referenced source states that allow a consumer to determine the freshness of the materialized view. + +**Producer responsibilities:** +- The producer of the storage table must provide a sufficient list of source states so that consumers can determine freshness according to the producer's interpretation. +- The source states list may be empty if the source state cannot be determined for all objects (for example, for non-Iceberg tables). + +**Consumer evaluation:** +- The consumer must at least perform a coarse-grained evaluation based on `refresh-start-timestamp-ms` and `max-staleness-ms`. +- The consumer may additionally compare the `source-states` list against the states loaded from the catalog. +- The consumer trusts that the producer has provided all states necessary to determine freshness. + +The refresh state has the following fields: + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `view-version-id` | The `version-id` of the materialized view when the refresh operation was performed | +| _required_ | `source-states` | A list of [source states](#source-state) records | +| _required_ | `refresh-start-timestamp-ms` | A timestamp of when the refresh operation was started | + +#### Source state + +Materialized views can reference source objects of different types, such as Iceberg tables and views. Source state records have a common field `type` that determines the form, which can be one of the following: + +* `table`: An Iceberg table +* `view`: An Iceberg view Review Comment: Discussion: Could we make it easier for the consumer to know whether a producer used a nested MV or not? The producer could separate out the nested MV with an additional type: - materialized-view: An Iceberg materialized view used to refresh the current storage table -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
