stevenzwu commented on code in PR #11041: URL: https://github.com/apache/iceberg/pull/11041#discussion_r3242602282
########## format/view-spec.md: ########## @@ -160,7 +178,120 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | -## Appendix A: An Example +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The `refresh-state` property is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of a storage table snapshot to provide information about the state of the precomputed data. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _optional_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness + +A materialized view is **fresh** when the storage table represents the result of the current view query (at the materialized view's current `view-version-id`) over the current state of its dependencies. Dependencies are determined by parsing the SQL: base Iceberg tables, Iceberg views (whose own dependencies are transitively dependencies of the materialized view), and intermediate materialized views (treated as their storage tables, with their own freshness established recursively from their `refresh-state`). + +A change to the materialized view's definition produces a new `view-version-id`; any storage-table snapshot recorded at a prior `view-version-id` is not fresh under the current definition. + +The `refresh-state` summary on each storage-table snapshot records dependency state observed at refresh time. Producers populate it; consumers use it to assess freshness without re-executing the query. The spec does not mandate what producers record or how consumers assess. See [Appendix B](#appendix-b-what-counts-as-a-dependency) for what counts as a dependency. + +##### Producer flexibility + +Producers may selectively choose a subset of their dependencies to record — for example, skipping non-Iceberg sources or recording an empty list. + +When writing the refresh state, producers: + +- **Must** record `view-version-id` and `refresh-start-timestamp-ms`. +- **Must** include all distinct source states for the inputs they chose to track. +- **May** leave `source-states` empty (e.g., when sources are non-Iceberg or freshness is determined by a mechanism outside this spec). + +A snapshot whose refresh state violates a `Must` rule is invalid; consumers may treat it as if it had no `refresh-state`. + +##### Consumer options + +Consumers may use any combination of the following to assess the storage table: + +- **Recency policy.** Accept the storage table when `refresh-start-timestamp-ms` falls within a staleness window. A recency policy bounds data age but does not establish freshness. +- **Trust the recorded `source-states`.** Compare each entry against the current catalog state — `snapshot-id` for tables, `version-id` for views, optionally recursive verification for intermediate materialized views recorded by their storage tables. Also confirm that the recorded `view-version-id` equals the materialized view's current `view-version-id`. +- **Verify by parsing the view query.** Derive the dependency set from the SQL and confirm every dependency is covered by `source-states` and matches the current state. Treat any uncovered dependency as undetermined. + +If a consumer's assessment passes, it reads from the storage table; otherwise it evaluates the view query in place of the storage table. + +#### Refresh state + +The refresh state record captures the dependencies in the materialized view's dependency graph. Each dependency is recorded in `source-states` as either a `table` entry (a base table or an intermediate materialized view's storage table) or a `view` entry. + +The refresh state has the following fields: + +| Requirement | Field name | Description | +|-------------|------------------------------|-------------| +| _required_ | `view-version-id` | The `version-id` of the materialized view when the refresh operation was performed | +| _required_ | `source-states` | A list of [source state](#source-state) records | +| _required_ | `refresh-start-timestamp-ms` | A timestamp of when the refresh operation was started | + +#### Source state + +Source state records capture the state of objects referenced by a materialized view. Each record has a `type` field that determines its form: + +| Type | Description | +|---------|-------------| +| `table` | An Iceberg table — either a base table in the dependency graph, or the storage table of an intermediate materialized view | +| `view` | An Iceberg view in the dependency graph | + +An intermediate materialized view must be recorded as a single `table` entry referencing its storage table; recording it as a `view` entry is not permitted. The intermediate materialized view's own dependencies are reached recursively through its `refresh-state`. + +#### Source table state + +A source table record captures the state of a source table (including a source materialized view's storage table) at the time of the last refresh operation. + +| Requirement | Field name | Description | +|-------------|---------------|-------------| +| _required_ | `type` | A string that must be set to `table` | +| _required_ | `name` | A string specifying the name of the source table | +| _required_ | `namespace` | A list of strings for namespace levels | +| _optional_ | `catalog` | An optional name of the catalog. If not set, the catalog is the same as the materialized view's | +| _required_ | `uuid` | The uuid of the source table | +| _required_ | `snapshot-id` | The snapshot-id of the source table that was read during the refresh operation | +| _optional_ | `ref` | Branch name of the source table being referenced in the view query | + +When `ref` is `null` or not set, it defaults to `main`. + +#### Source view state + +A source view record captures the state of a source view at the time of the last refresh operation. + +| Requirement | Field name | Description | +|-------------|--------------|-------------| +| _required_ | `type` | A string that must be set to `view` | +| _required_ | `name` | A string specifying the name of the source view | +| _required_ | `namespace` | A list of strings for namespace levels | +| _optional_ | `catalog` | An optional name of the catalog. If not set, the catalog is the same as the materialized view's | +| _required_ | `uuid` | The uuid of the source view | +| _required_ | `version-id` | The version-id of the source view that was read during the refresh operation | + +#### Storage table creation and configuration + +When processing a `CREATE MATERIALIZED VIEW` statement, query engines must: + +1. Create the storage table as a regular Iceberg table with any specified configurations (partitioning, sort order, compression, etc.). +2. Create the materialized view metadata with a `storage-table` reference pointing to the created storage table. + +The storage table must exist and be accessible before the materialized view metadata is committed. Review Comment: nit: `before` may be a bit strict. they can be committed in the same transaction -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
