JanKaul commented on code in PR #11041: URL: https://github.com/apache/iceberg/pull/11041#discussion_r2635661174
########## format/view-spec.md: ########## @@ -160,6 +178,108 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness + +A materialized view's precomputed data becomes stale as the tables and views referenced in its query definition change over time. Freshness determines whether the precomputed data accurately represents the logical query definition at the current state of its dependencies. + +Different systems define freshness differently, based on how much of the dependency graph must be current. Some require the entire query tree to be fully up to date, while others only require direct children or allow bounded staleness at leaf nodes. As a result, "fresh" can mean strict end-to-end consistency, acceptable lag, or policy/version compliance. + +A materialized view is considered fresh when its precomputed data meets the freshness criteria defined by the consumer's evaluation policy. When these criteria are not met, the materialized view is considered stale. + +#### Refresh state + +The refresh state record captures the unique dependencies in the materialized view's dependency graph. These dependencies include source Iceberg tables, views, and nested materialized views that allow a consumer to determine the freshness of the materialized view. + +**Producer responsibilities:** +- The producer of the storage table must provide a sufficient list of source states so that consumers can determine freshness according to the producer's interpretation. +- The source states list may be empty if the source state cannot be determined for all objects (for example, for non-Iceberg tables). + +**Consumer evaluation:** +- The consumer must at least perform a coarse-grained evaluation based on `refresh-start-timestamp-ms` and `max-staleness-ms`. A materialized view is fresh if `refresh-start-timestamp-ms` is within the window `[now - max-staleness-ms, now]`. +- The consumer may additionally compare the `source-states` list against the states loaded from the catalog. If this evaluation determines the materialized view is fresh, it overrides the coarse-grained evaluation result. +- The consumer may parse the view definition to implement a more sophisticated policy. +- When a materialized view is considered stale, the consumer can fail, refresh inline, or treat the materialized view as a logical view. The consumer must not consume from the storage table when the materialized view is stale. + +The refresh state has the following fields: + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `view-version-id` | The `version-id` of the materialized view when the refresh operation was performed | +| _required_ | `source-states` | A list of [source states](#source-state) records | +| _required_ | `refresh-start-timestamp-ms` | A timestamp of when the refresh operation was started | + +#### Source state + +Materialized views can reference source objects of different types, such as Iceberg tables, view, and materialized views. Source state records have a common field `type` that determines the form, which can be one of the following: + +* `table`: An Iceberg table +* `view`: An Iceberg view +* `materialized-view`: An Iceberg materialized view + +The metadata fields for each type are defined below: + +#### Source table state + +A source table record captures the state of a source table (including source MV's storage table) at the time of the last refresh operation. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `type` | A string that must be set to `table` | Review Comment: Regarding your first point. The purpose of the refresh-state is to determine freshness of a MV. If an MV depends on two separate snapshots through other nested MVs it will always be the older snapshot that will be critical for `max-staleness-ms` and render a MV stale. The producer of a storage table has to parse the SQL anyway and can determine the lineage itself if needed. Your second point is a major issue. Maybe we should be using the sequence-number of the snapshot instead of the snapshot-id. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
