stevenzwu commented on code in PR #11041: URL: https://github.com/apache/iceberg/pull/11041#discussion_r2586956236
########## format/view-spec.md: ########## @@ -63,11 +79,13 @@ The view version metadata file has the following fields: | _required_ | `versions` | A list of known [versions](#versions) of the view [1] | | _required_ | `version-log` | A list of [version log](#version-log) entries with the timestamp and `version-id` for every change to `current-version-id` | | _optional_ | `properties` | A string to string map of view properties [2] | +| _optional_ | `max-staleness-ms` | The maximum time interval in milliseconds after a refresh operation during which the materialized view's data is considered fresh [3] | Notes: 1. The number of versions to retain is controlled by the view property: `version.history.num-entries`. 2. Properties are used for metadata such as `comment` and for settings that affect view maintenance. This is not intended to be used for arbitrary metadata. +3. The `max-staleness-ms` field only applies to materialized views and must be set to `null` for common views. If `max-staleness-ms` is not `null` and the time elapsed since the last refresh operation is less than `max-staleness-ms`, the query engine may return data directly from the `storage-table` without evaluating freshness based on the source tables and views. If `max-staleness-ms` is `null` for a materialized view, the data in the `storage-table` is always considered fresh. Review Comment: > the time elapsed since the last refresh operation I think this is not aligned with what we discussed in the meeting. The consensus was the delayed view semantics that Igor brought up. ``` A Materialized View(MV) considered fresh if and only if the precomputed result from the storage table is equivalent to what would have been obtained by running MV's defining query at some point in time within the interval of [CurrentTime - MaxStaleness, CurrentTime] ``` Think about the example from the discussion ``` How should the staleness time be calculated ? 1. Start from the time when the last refresh operation was started. 2. Start from the earliest snapshot timestamp from any source table with snapshots added after the last MV refresh. E.g., let’s assume the config is 60 mins. 10:00 is when the last refresh operation was started. 10:45 is the time when a source table has a new snapshot added. At 11:15, is the MV still fresh? With option 1, it is stale. With option 2, it is fresh. ``` Current wording is option 1. The delayed view semantics is essentially option 2. ########## format/view-spec.md: ########## @@ -42,12 +42,28 @@ An atomic swap of one view metadata file for another provides the basis for maki Writers create view metadata files optimistically, assuming that the current metadata location will not be changed before the writer's commit. Once a writer has created an update, it commits by swapping the view's metadata file pointer from the base location to the new location. +### Materialized Views + +Materialized views are a type of view with precomputed results from the view query stored as a table. +When queried, engines may return the precomputed data for the materialized views, shifting the cost of query execution to the precomputation step. + +Iceberg materialized views are implemented as a combination of an Iceberg view and an underlying Iceberg table, the "storage-table", which stores the precomputed data. +Materialized View metadata is a superset of View metadata with an additional pointer to the storage table. The storage table is an Iceberg table with additional materialized view refresh state metadata. +Refresh metadata contains information about the "source tables" and/or "source views", which are the tables/views referenced in the query definition of the materialized view. +During read time, a materialized view (storage table) can be interpreted as "fresh", "stale" or "invalid", depending on the following situations: +* **fresh** -- The `snapshot_id`s of the last refresh operation match the current `snapshot_id`s of all the source tables. +* **stale** -- The `snapshot_id`s do not match for at-least one source table, indicating that a refresh operation needs to be performed to capture the latest source table changes. Review Comment: This sentence is inaccurate anymore with the `max-staleness-ms` config. Should we create a dedicated section (like `status interpretation`) in the end for the status interpretation (for fresh, stale, invalid) after all the concepts have been introduced (including refresh-state)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
