Re: [PR] Materialized View Spec [iceberg]

via GitHub Thu, 14 May 2026 11:53:07 -0700


stevenzwu commented on code in PR #11041:
URL: https://github.com/apache/iceberg/pull/11041#discussion_r3243563129



##########
format/view-spec.md:
##########
@@ -160,7 +178,120 @@ Each entry in `version-log` is a struct with the 
following fields:
 | _required_  | `timestamp-ms` | Timestamp when the view's 
`current-version-id` was updated (ms from epoch) |
 | _required_  | `version-id`   | ID that `current-version-id` was set to |
 
-## Appendix A: An Example
+#### Storage Table Identifier
+
+The table identifier for the storage table that stores the precomputed results.
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `namespace`    | A list of strings for namespace levels |
+| _required_  | `name`         | A string specifying the name of the table |
+
+### Storage table metadata
+
+This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
+The `refresh-state` property is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of a storage 
table snapshot to provide information about the state of the precomputed data.
+
+| Requirement | Field name      | Description |
+|-------------|-----------------|-------------|
+| _optional_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
+
+#### Freshness
+
+A materialized view is **fresh** when the storage table represents the result 
of the current view query (at the materialized view's current 
`view-version-id`) over the current state of its dependencies. Dependencies are 
determined by parsing the SQL: base Iceberg tables, Iceberg views (whose own 
dependencies are transitively dependencies of the materialized view), and 
intermediate materialized views (treated as their storage tables, with their 
own freshness established recursively from their `refresh-state`).
+
+A change to the materialized view's definition produces a new 
`view-version-id`; any storage-table snapshot recorded at a prior 
`view-version-id` is not fresh under the current definition.
+
+The `refresh-state` summary on each storage-table snapshot records dependency 
state observed at refresh time. Producers populate it; consumers use it to 
assess freshness without re-executing the query. The spec does not mandate what 
producers record or how consumers assess. See [Appendix 
B](#appendix-b-what-counts-as-a-dependency) for what counts as a dependency.
+
+##### Producer flexibility
+
+Producers may selectively choose a subset of their dependencies to record — 
for example, skipping non-Iceberg sources or recording an empty list.
+
+When writing the refresh state, producers:
+
+- **Must** record `view-version-id` and `refresh-start-timestamp-ms`.
+- **Must** include all distinct source states for the inputs they chose to 
track.
+- **May** leave `source-states` empty (e.g., when sources are non-Iceberg or 
freshness is determined by a mechanism outside this spec).
+
+A snapshot whose refresh state violates a `Must` rule is invalid; consumers 
may treat it as if it had no `refresh-state`.
+
+##### Consumer options
+
+Consumers may use any combination of the following to assess the storage table:
+
+- **Recency policy.** Accept the storage table when 
`refresh-start-timestamp-ms` falls within a staleness window. A recency policy 
bounds data age but does not establish freshness.
+- **Trust the recorded `source-states`.** Compare each entry against the 
current catalog state — `snapshot-id` for tables, `version-id` for views, 
optionally recursive verification for intermediate materialized views recorded 
by their storage tables. Also confirm that the recorded `view-version-id` 
equals the materialized view's current `view-version-id`.
+- **Verify by parsing the view query.** Derive the dependency set from the SQL 
and confirm every dependency is covered by `source-states` and matches the 
current state. Treat any uncovered dependency as undetermined.
+
+If a consumer's assessment passes, it reads from the storage table; otherwise 
it evaluates the view query in place of the storage table.

Review Comment:
   this wording restricts the consumer behavior to `evaluate the view query`. 
In the past we discussed a few valid behaviors.
   
   * evaluate the view query (like a logical view)
   * calculate incremental result and merge with the MV storage table // 
BigQuery does sth similar
   * fail



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Materialized View Spec [iceberg]

Reply via email to