stevenzwu commented on code in PR #11041: URL: https://github.com/apache/iceberg/pull/11041#discussion_r2628240131
########## format/view-spec.md: ########## @@ -160,6 +177,89 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness + +A materialized view is considered fresh when its precomputed data is usable by consumers. As tables referenced by a materialized view change over time, the precomputed data may no longer accurately reflect the logical materialized view definition. When this occurs, the materialized view (storage table) is considered stale. + +Different systems interpret freshness differently, typically based on the objects referenced in the fully expanded query tree of the materialized view. Some systems consider only direct children, others only leaf nodes, and some the entire query tree. The specific interpretation is determined by the producer of the storage table. + +#### Refresh state + +The refresh state record captures the state of source tables, views, and materialized views at refresh time. It contains a list of directly or indirectly referenced source states that allow a consumer to determine the freshness of the materialized view. + +**Producer responsibilities:** +- The producer of the storage table must provide a sufficient list of source states so that consumers can determine freshness according to the producer's interpretation. +- The source states list may be empty if the source state cannot be determined for all objects (for example, for non-Iceberg tables). + +**Consumer evaluation:** +- The consumer must at least perform a coarse-grained evaluation based on `refresh-start-timestamp-ms` and `max-staleness-ms`. +- The consumer may additionally compare the `source-states` list against the states loaded from the catalog. +- The consumer trusts that the producer has provided all states necessary to determine freshness. Review Comment: I am not sure this bullet point about `trust` is needed. During the meeting, we also discussed more flexibility for consumers sth along this line. ``` - The consumer may parse the view definition to implement more sophisticated policy. ``` ########## format/view-spec.md: ########## @@ -42,12 +42,24 @@ An atomic swap of one view metadata file for another provides the basis for maki Writers create view metadata files optimistically, assuming that the current metadata location will not be changed before the writer's commit. Once a writer has created an update, it commits by swapping the view's metadata file pointer from the base location to the new location. +### Materialized Views + +Materialized views are a type of view with precomputed results from the view query stored as a table. +When queried, engines may return the precomputed data for the materialized views, shifting the cost of query execution to the precomputation step. + +Iceberg materialized views are implemented as a combination of an Iceberg view and an underlying Iceberg table, the "storage-table", which stores the precomputed data. +Materialized View metadata is a superset of View metadata with an additional pointer to the storage table. The storage table is an Iceberg table with additional materialized view refresh state metadata. +Refresh metadata contains information about the "source tables" and/or "source views", which are the tables/views referenced in the query definition of the materialized view. + ## Specification ### Terms * **Schema** -- Names and types of fields in a view. * **Version** -- The state of a view at some point in time. +* **Storage table** -- Iceberg table that stores the precomputed data of a materialized view. +* **Source table** -- A table reference that occurs in the query definition of a materialized view. The materialized view depends on the data from the source tables. +* **Source view** -- A view reference that occurs in the query definition of a materialized view. The materialized view depends on the definitions from the source views. Review Comment: for consistency, maybe ``` Source materialized view -- A materialized view reference that occurs in the query definition of a materialized view. ``` I actually think we can remove the part from the bullet points, as I don't see it clarify anything additionally ``` The materialized view depends on the data from the source tables. ``` ########## format/view-spec.md: ########## @@ -160,6 +177,89 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness + +A materialized view is considered fresh when its precomputed data is usable by consumers. As tables referenced by a materialized view change over time, the precomputed data may no longer accurately reflect the logical materialized view definition. When this occurs, the materialized view (storage table) is considered stale. + +Different systems interpret freshness differently, typically based on the objects referenced in the fully expanded query tree of the materialized view. Some systems consider only direct children, others only leaf nodes, and some the entire query tree. The specific interpretation is determined by the producer of the storage table. + +#### Refresh state + +The refresh state record captures the state of source tables, views, and materialized views at refresh time. It contains a list of directly or indirectly referenced source states that allow a consumer to determine the freshness of the materialized view. + +**Producer responsibilities:** +- The producer of the storage table must provide a sufficient list of source states so that consumers can determine freshness according to the producer's interpretation. +- The source states list may be empty if the source state cannot be determined for all objects (for example, for non-Iceberg tables). + +**Consumer evaluation:** +- The consumer must at least perform a coarse-grained evaluation based on `refresh-start-timestamp-ms` and `max-staleness-ms`. Review Comment: we probably need to clearly the evaluation, like ``` A materialized view is fresh if `refresh-start-timestamp-ms` is within the window `[now - max-staleness-ms, now]` ``` ########## format/view-spec.md: ########## @@ -160,6 +177,89 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness + +A materialized view is considered fresh when its precomputed data is usable by consumers. As tables referenced by a materialized view change over time, the precomputed data may no longer accurately reflect the logical materialized view definition. When this occurs, the materialized view (storage table) is considered stale. Review Comment: > A materialized view is considered fresh when its precomputed data is usable by consumers. This definition doesn't capture what should be considered fresh. Later part on consumer responsibilities is kind of defining what fresh mean. ########## format/view-spec.md: ########## @@ -63,11 +75,13 @@ The view version metadata file has the following fields: | _required_ | `versions` | A list of known [versions](#versions) of the view [1] | | _required_ | `version-log` | A list of [version log](#version-log) entries with the timestamp and `version-id` for every change to `current-version-id` | | _optional_ | `properties` | A string to string map of view properties [2] | +| _optional_ | `max-staleness-ms` | The maximum time interval in milliseconds during which changed source table snapshots are considered fresh enough to skip refreshing [3] | Review Comment: the wording here isn't super clear how this config should be used. It also doesn't capture the delayed view semantic that @igorbelianski-cyber mentioned. We should remove `to skip refreshing` part, as consumers can have other fallback behaviors like fail or treat MV as a logical view ########## format/view-spec.md: ########## @@ -160,6 +177,89 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness Review Comment: We should add a paragraph to describe possible consumer behavior is a MV is considered stale. ``` - Can fail, or refresh inline, or treat MV as a logical view - Mustn’t consume from the storage table ``` ########## format/view-spec.md: ########## @@ -160,6 +177,89 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness + +A materialized view is considered fresh when its precomputed data is usable by consumers. As tables referenced by a materialized view change over time, the precomputed data may no longer accurately reflect the logical materialized view definition. When this occurs, the materialized view (storage table) is considered stale. + +Different systems interpret freshness differently, typically based on the objects referenced in the fully expanded query tree of the materialized view. Some systems consider only direct children, others only leaf nodes, and some the entire query tree. The specific interpretation is determined by the producer of the storage table. + +#### Refresh state + +The refresh state record captures the state of source tables, views, and materialized views at refresh time. It contains a list of directly or indirectly referenced source states that allow a consumer to determine the freshness of the materialized view. + +**Producer responsibilities:** +- The producer of the storage table must provide a sufficient list of source states so that consumers can determine freshness according to the producer's interpretation. +- The source states list may be empty if the source state cannot be determined for all objects (for example, for non-Iceberg tables). + +**Consumer evaluation:** +- The consumer must at least perform a coarse-grained evaluation based on `refresh-start-timestamp-ms` and `max-staleness-ms`. +- The consumer may additionally compare the `source-states` list against the states loaded from the catalog. +- The consumer trusts that the producer has provided all states necessary to determine freshness. + +The refresh state has the following fields: + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `view-version-id` | The `version-id` of the materialized view when the refresh operation was performed | +| _required_ | `source-states` | A list of [source states](#source-state) records | +| _required_ | `refresh-start-timestamp-ms` | A timestamp of when the refresh operation was started | + +#### Source state + +Materialized views can reference source objects of different types, such as Iceberg tables and views. Source state records have a common field `type` that determines the form, which can be one of the following: + +* `table`: An Iceberg table +* `view`: An Iceberg view Review Comment: I am wondering if source MV should be one entry with combined view and storage table status in this single list. It is related to my question on how to load MV in REST catalog. Can we hide the storage table to consumers for access control. Only define access control at view level and storage table hiding would auto inherit the view access policy? but it could be a problem for non REST catalog. If a single entry, source MV can have the following states - view-uuid - view-version-id - storage-table-uuid - storage-table-snapshot-id `ref` field for source table state is not applicable to storage table. ########## format/view-spec.md: ########## @@ -160,6 +177,89 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness + +A materialized view is considered fresh when its precomputed data is usable by consumers. As tables referenced by a materialized view change over time, the precomputed data may no longer accurately reflect the logical materialized view definition. When this occurs, the materialized view (storage table) is considered stale. + +Different systems interpret freshness differently, typically based on the objects referenced in the fully expanded query tree of the materialized view. Some systems consider only direct children, others only leaf nodes, and some the entire query tree. The specific interpretation is determined by the producer of the storage table. + +#### Refresh state + +The refresh state record captures the state of source tables, views, and materialized views at refresh time. It contains a list of directly or indirectly referenced source states that allow a consumer to determine the freshness of the materialized view. + +**Producer responsibilities:** +- The producer of the storage table must provide a sufficient list of source states so that consumers can determine freshness according to the producer's interpretation. +- The source states list may be empty if the source state cannot be determined for all objects (for example, for non-Iceberg tables). + +**Consumer evaluation:** +- The consumer must at least perform a coarse-grained evaluation based on `refresh-start-timestamp-ms` and `max-staleness-ms`. +- The consumer may additionally compare the `source-states` list against the states loaded from the catalog. Review Comment: We probably lack a bit of details here. If the coarse-grained evaluation considered the MV stale and this evaluation considered it fresh, the MV should be considered fresh. But such priority is not expressed here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
