danielcweeks commented on code in PR #11041: URL: https://github.com/apache/iceberg/pull/11041#discussion_r3011325711
########## format/view-spec.md: ########## @@ -160,7 +178,122 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | -## Appendix A: An Example +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness + +A materialized view is "fresh" when the storage table adequately represents the logical query definition of the view. +Since different systems define freshness differently, it is left to the consumer to evaluate freshness based on its own policy. + +**Consumer behavior:** + +When evaluating freshness, consumers: + +- May apply time-based freshness policies, such as allowing a staleness window based on `refresh-start-timestamp-ms`. +- May compare the `source-states` list against the states loaded from the catalog to verify the producer's freshness interpretation. +- May parse the view definition to implement more sophisticated policies. +- When a materialized view is considered stale, can fail, refresh inline, or treat the materialized view as a logical view. +- Should not consume the storage table as it is when the materialized view doesn't meet the freshness criteria. + +**Producer behavior:** + +Producers should provide the necessary information in the [refresh state](#refresh-state) such that consumers can verify the logical equivalence of the precomputed data with the query definition. +Different producers may have different freshness interpretations, based on how much of the refresh state's dependency graph should be evaluated. +Some producers expect the entire dependency graph to be evaluated and therefore include source MV dependencies. Other producers may only expect dependencies in the MV's SQL to be evaluated and therefore do not include dependencies of source MVs. + +When writing the refresh state, producers: + +- Should provide a sufficient list of source states such that consumers can determine freshness according to the producer's interpretation. If the producers interpretation is such that it doesn't rely on the source-states to determine freshness, it may provide an empty list. +- If the source state cannot be determined for all objects (for example, for non-Iceberg tables) may leave the source states list empty. +- If a stored object is reachable through multiple paths in the dependency graph (diamond dependency pattern), the entry with the oldest snapshot-id or version-id must be stored. + +#### Refresh state + +The refresh state record captures the dependencies in the materialized view's dependency graph. +These dependencies include source Iceberg tables, views, and materialized views. + +The refresh state has the following fields: + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `view-version-id` | The `version-id` of the materialized view when the refresh operation was performed | +| _required_ | `source-states` | A list of [source states](#source-state) records | +| _required_ | `refresh-start-timestamp-ms` | A timestamp of when the refresh operation was started | + +#### Source state + +Source state records capture the state of objects referenced by a materialized view including objects referenced by source materialized views. +Each record has a `type` field that determines its form: + +| Type | Description | +|---------|-------------| +| `table` | An Iceberg table, including storage tables of source materialized views | +| `view` | An Iceberg view, including source materialized views | + +Source materialized views are represented by two source state entries: one for the view itself and one for its storage table. + +#### Source table state + +A source table record captures the state of a source table (including source MV's storage table) at the time of the last refresh operation. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `type` | A string that must be set to `table` | +| _required_ | `name` | A string specifying the name of the source table | +| _required_ | `namespace` | A list of strings for namespace levels | +| _optional_ | `catalog` | An optional name of the catalog. If set to `null` the catalog is the same as the materialized views' | +| _required_ | `uuid` | The uuid of the source table | +| _required_ | `snapshot-id` | The snapshot-id of the source table that was read during the refresh operation | +| _optional_ | `ref` | Branch name of the source table being referenced in the view query | + +When `ref` is `null` or not set, it defaults to "main". + +#### Source view state + +A source view record captures the state of a source view at the time of the last refresh operation. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `type` | A string that must be set to `view` | +| _required_ | `name` | A string specifying the name of the source view | +| _required_ | `namespace` | A list of strings for namespace levels | +| _optional_ | `catalog` | An optional name of the catalog. If set to `null` the catalog is the same as the materialized views' | +| _required_ | `uuid` | The uuid of the source view | +| _required_ | `version-id` | The version-id of the source view that was read during the refresh operation | + +#### Storage table creation and configuration + +When processing a `CREATE MATERIALIZED VIEW` statement, query engines must: + +1. Create the storage table as a regular Iceberg table with any specified configurations (partitioning, sort order, compression, etc.). Review Comment: I don't feel lke we need to include this in the spec. The configurations set on the table are a implementation detail and don't need to be enforced by the spec. Some engines may allow setting these configurations, while others may not (and rely on the backing catalog implementation). This is also not a query engine requirement as the catalog implementation should handle this (not the engine directly). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
