Re: [PR] Materialized View Spec [iceberg]

via GitHub Thu, 14 May 2026 13:48:18 -0700


bennychow commented on code in PR #11041:
URL: https://github.com/apache/iceberg/pull/11041#discussion_r3244179464



##########
format/view-spec.md:
##########
@@ -322,3 +453,142 @@ 
s3://bucket/warehouse/default.db/event_agg/metadata/00002-(uuid).metadata.json
   } ]
 }
 ```
+
+### Materialized View Example
+
+Imagine the following operation, which creates a materialized view that 
precomputes daily event counts:
+
+```sql
+USE prod.default
+```
+```sql
+CREATE MATERIALIZED VIEW event_agg_mv (
+    event_count COMMENT 'Count of events',
+    event_date)
+COMMENT 'Precomputed daily event counts'
+AS
+SELECT
+    COUNT(1), CAST(event_ts AS DATE)
+FROM events
+GROUP BY 2
+```
+
+The materialized view metadata JSON file looks as follows:
+
+```
+s3://bucket/warehouse/default.db/event_agg_mv/metadata/00001-(uuid).metadata.json
+```
+```json
+{
+  "view-uuid": "b2a12651-3038-4a72-8a31-5027ab84da35",
+  "format-version" : 1,
+  "location" : "s3://bucket/warehouse/default.db/event_agg_mv",
+  "current-version-id" : 1,
+  "properties" : {
+    "comment" : "Precomputed daily event counts"
+  },
+  "versions" : [ {
+    "version-id" : 1,
+    "timestamp-ms" : 1573518431292,
+    "schema-id" : 1,
+    "default-catalog" : "prod",
+    "default-namespace" : [ "default" ],
+    "summary" : {
+      "engine-name" : "Spark",
+      "engine-version" : "3.4.1"
+    },
+    "representations" : [ {
+      "type" : "sql",
+      "sql" : "SELECT\n    COUNT(1), CAST(event_ts AS DATE)\nFROM 
events\nGROUP BY 2",
+      "dialect" : "spark"
+    } ],
+    "storage-table" : {
+      "namespace" : [ "default" ],
+      "name" : "event_agg_mv__storage"
+    }
+  } ],
+  "schemas": [ {
+    "schema-id": 1,
+    "type" : "struct",
+    "fields" : [ {
+      "id" : 1,
+      "name" : "event_count",
+      "required" : false,
+      "type" : "int",
+      "doc" : "Count of events"
+    }, {
+      "id" : 2,
+      "name" : "event_date",
+      "required" : false,
+      "type" : "date"
+    } ]
+  } ],
+  "version-log" : [ {
+    "timestamp-ms" : 1573518431292,
+    "version-id" : 1
+  } ]
+}
+```
+
+After a refresh operation, the storage table's snapshot summary contains the 
`refresh-state` property.
+The following is an example of the `refresh-state` JSON value stored in the 
snapshot summary of the storage table:
+
+```json
+{
+  "view-version-id" : 1,
+  "refresh-start-timestamp-ms" : 1573518435000,
+  "source-states" : [ {
+    "type" : "table",
+    "namespace" : [ "default" ],
+    "name" : "events",
+    "uuid" : "d4a10b5c-1e8a-4b72-9d67-3f4a8c9e1b2d",
+    "snapshot-id" : 6148331192489823102
+  } ]
+}
+```
+
+## Appendix B: What counts as a dependency
+
+The dependencies of a materialized view are determined by parsing the view 
query:
+
+- **Base Iceberg tables** in the dependency graph are recorded by 
`snapshot-id`.
+- **Iceberg views** in the dependency graph are recorded by `version-id`. A 
view's own dependencies are transitively dependencies of the materialized view 
and appear as additional entries in `source-states`.
+- **Intermediate materialized views** in the dependency graph are treated as 
their storage tables and recorded by the storage table's `snapshot-id`. Their 
own freshness is established recursively from their `refresh-state`.
+
+### Example

Review Comment:
   @wmoustafa asked if I could propose some description for Appendix B which is 
one possibility for how a producer and consumer could use the refresh state to 
record and evaluate freshness.
   
   This example explains how a producer and consumer can work together to 
record and evaluate freshness.  This example assumes:
   
   - Querying materialized view must be equivalent to querying base tables 
directly
   - Materialized view contains only Iceberg tables
   - Materialized view can be built on top of other materialized views 
   - Additional work on consumer to recursively evaluate materialized view 
freshness



##########
format/view-spec.md:
##########
@@ -322,3 +453,142 @@ 
s3://bucket/warehouse/default.db/event_agg/metadata/00002-(uuid).metadata.json
   } ]
 }
 ```
+
+### Materialized View Example
+
+Imagine the following operation, which creates a materialized view that 
precomputes daily event counts:
+
+```sql
+USE prod.default
+```
+```sql
+CREATE MATERIALIZED VIEW event_agg_mv (
+    event_count COMMENT 'Count of events',
+    event_date)
+COMMENT 'Precomputed daily event counts'
+AS
+SELECT
+    COUNT(1), CAST(event_ts AS DATE)
+FROM events
+GROUP BY 2
+```
+
+The materialized view metadata JSON file looks as follows:
+
+```
+s3://bucket/warehouse/default.db/event_agg_mv/metadata/00001-(uuid).metadata.json
+```
+```json
+{
+  "view-uuid": "b2a12651-3038-4a72-8a31-5027ab84da35",
+  "format-version" : 1,
+  "location" : "s3://bucket/warehouse/default.db/event_agg_mv",
+  "current-version-id" : 1,
+  "properties" : {
+    "comment" : "Precomputed daily event counts"
+  },
+  "versions" : [ {
+    "version-id" : 1,
+    "timestamp-ms" : 1573518431292,
+    "schema-id" : 1,
+    "default-catalog" : "prod",
+    "default-namespace" : [ "default" ],
+    "summary" : {
+      "engine-name" : "Spark",
+      "engine-version" : "3.4.1"
+    },
+    "representations" : [ {
+      "type" : "sql",
+      "sql" : "SELECT\n    COUNT(1), CAST(event_ts AS DATE)\nFROM 
events\nGROUP BY 2",
+      "dialect" : "spark"
+    } ],
+    "storage-table" : {
+      "namespace" : [ "default" ],
+      "name" : "event_agg_mv__storage"
+    }
+  } ],
+  "schemas": [ {
+    "schema-id": 1,
+    "type" : "struct",
+    "fields" : [ {
+      "id" : 1,
+      "name" : "event_count",
+      "required" : false,
+      "type" : "int",
+      "doc" : "Count of events"
+    }, {
+      "id" : 2,
+      "name" : "event_date",
+      "required" : false,
+      "type" : "date"
+    } ]
+  } ],
+  "version-log" : [ {
+    "timestamp-ms" : 1573518431292,
+    "version-id" : 1
+  } ]
+}
+```
+
+After a refresh operation, the storage table's snapshot summary contains the 
`refresh-state` property.
+The following is an example of the `refresh-state` JSON value stored in the 
snapshot summary of the storage table:
+
+```json
+{
+  "view-version-id" : 1,
+  "refresh-start-timestamp-ms" : 1573518435000,
+  "source-states" : [ {
+    "type" : "table",
+    "namespace" : [ "default" ],
+    "name" : "events",
+    "uuid" : "d4a10b5c-1e8a-4b72-9d67-3f4a8c9e1b2d",
+    "snapshot-id" : 6148331192489823102
+  } ]
+}
+```
+
+## Appendix B: What counts as a dependency
+
+The dependencies of a materialized view are determined by parsing the view 
query:
+
+- **Base Iceberg tables** in the dependency graph are recorded by 
`snapshot-id`.
+- **Iceberg views** in the dependency graph are recorded by `version-id`. A 
view's own dependencies are transitively dependencies of the materialized view 
and appear as additional entries in `source-states`.
+- **Intermediate materialized views** in the dependency graph are treated as 
their storage tables and recorded by the storage table's `snapshot-id`. Their 
own freshness is established recursively from their `refresh-state`.
+
+### Example
+
+The query under examination:
+
+- `A` (the materialized view being refreshed): `SELECT ... FROM B JOIN C ON 
...`
+- `B` (regular view): `SELECT ... FROM E JOIN D ON ...`
+- `C` (materialized view): `SELECT ... FROM F JOIN G ON ...`
+- `D` (materialized view): `SELECT ... FROM H WHERE ...`
+- `E`, `F`, `G`, `H`: base Iceberg tables
+
+`A`'s dependencies are `B`, `C`, and `D`. `B` is a regular view; its own 
dependencies (`E` and `D`) are transitively dependencies of `A`. `C` and `D` 
are materialized views; they appear in `A`'s `source-states` as their storage 
tables.
+
+```
+A [MV — being refreshed]
+├── B [VIEW]                            <-- recorded in A: version-id
+│   ├── E [TABLE]                       <-- recorded in A: snapshot-id
+│   └── D [MV]                          <-- recorded in A: storage-table 
snapshot-id
+│       ┄┄┄┄┄┄ recursive boundary ┄┄┄┄┄┄
+│       └── H [TABLE]                   (D's dependency; verified via D's 
refresh-state)
+└── C [MV]                              <-- recorded in A: storage-table 
snapshot-id
+    ┄┄┄┄┄┄ recursive boundary ┄┄┄┄┄┄
+    ├── F [TABLE]                       (C's dependency; verified via C's 
refresh-state)
+    └── G [TABLE]                       (C's dependency; verified via C's 
refresh-state)
+```
+
+`A`'s `source-states`:
+
+| type    | name          | recorded id        |
+|---------|---------------|--------------------|
+| `view`  | `B`           | `version-id: 5`    |
+| `table` | `E`           | `snapshot-id: 101` |
+| `table` | `C` (storage) | `snapshot-id: 12`  |
+| `table` | `D` (storage) | `snapshot-id: 14`  |
+
+`F`, `G`, and `H` do not appear in `A`'s `source-states` directly; they belong 
to `C` and `D`'s dependency sets and are reached recursively through `C` and 
`D`'s refresh states.
+
+A consumer establishes `A`'s freshness by checking each entry in 
`source-states` against the current catalog state. For `C` and `D`, the 
consumer compares the recorded storage-table snapshot to the current snapshot, 
then recurses into their `refresh-state` to verify each is itself fresh.

Review Comment:
   For C and D, it would be really nice if the consumer could know up front 
whether the table was a base table or storage table.



##########
format/view-spec.md:
##########
@@ -190,92 +190,93 @@ The table identifier for the storage table that stores 
the precomputed results.
 ### Storage table metadata
 
 This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
-The property "refresh-state" is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of a storage 
table snapshot to provide information about the state of the precomputed data.
+The `refresh-state` property is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of a storage 
table snapshot to provide information about the state of the precomputed data.
 
 | Requirement | Field name      | Description |
 |-------------|-----------------|-------------|
 | _optional_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
 
 #### Freshness
 
-A materialized view is "fresh" when the storage table adequately represents 
the result of the view query at the current state of its dependencies.
-Since different systems define freshness differently, it is left to the 
consumer to evaluate freshness based on its own policy.
+A materialized view is **fresh** when the storage table represents the result 
of the current view query (at the materialized view's current 
`view-version-id`) over the current state of its dependencies. Dependencies are 
determined by parsing the SQL: base Iceberg tables, Iceberg views (whose own 
dependencies are transitively dependencies of the materialized view), and 
intermediate materialized views (treated as their storage tables, with their 
own freshness established recursively from their `refresh-state`).
 
-**Consumer behavior:**
+A change to the materialized view's definition produces a new 
`view-version-id`; any storage-table snapshot recorded at a prior 
`view-version-id` is not fresh under the current definition.
 
-When evaluating freshness, consumers:
+The `refresh-state` summary on each storage-table snapshot records dependency 
state observed at refresh time. Producers populate it; consumers use it to 
assess freshness without re-executing the query. The spec does not mandate what 
producers record or how consumers assess. See [Appendix 
B](#appendix-b-what-counts-as-a-dependency) for what counts as a dependency.
 
-- May apply time-based freshness policies, such as allowing a staleness window 
based on `refresh-start-timestamp-ms`.
-- May compare the `source-states` list against the states loaded from the 
catalog to verify the producer's freshness interpretation.
-- May parse the view definition to implement more sophisticated policies.
-- When a materialized view is considered stale, can fail, refresh inline, or 
treat the materialized view as a logical view.
-- Should not consume the storage table as it is when the materialized view 
doesn't meet the freshness criteria.
+##### Producer flexibility
 
-**Producer behavior:**
-
-Producers should provide the necessary information in the [refresh 
state](#refresh-state) such that consumers can verify the logical equivalence 
of the precomputed data with the query definition.
-Different producers may have different freshness interpretations, based on how 
much of the refresh state's dependency graph should be evaluated.
-Some producers expect the entire dependency graph to be evaluated and 
therefore include source MV dependencies. Other producers may only expect 
dependencies in the MV's SQL to be evaluated and therefore do not include 
dependencies of source MVs.
+Producers may selectively choose a subset of their dependencies to record — 
for example, skipping non-Iceberg sources or recording an empty list.
 
 When writing the refresh state, producers:
 
-- Should provide a sufficient list of source states such that consumers can 
determine freshness according to the producer's intent. If the producers intent 
is such that it doesn't rely on the source-states to determine freshness, it 
may provide an empty list.
-- If the source state cannot be determined for all objects (for example, for 
non-Iceberg tables or non-deterministic functions) may leave the source states 
list empty.
-- If a stored object is reachable through multiple paths in the dependency 
graph (diamond dependency pattern), all distinct source states have to be 
included in the list.
+- **Must** record `view-version-id` and `refresh-start-timestamp-ms`.
+- **Must** include all distinct source states for the inputs they chose to 
track.
+- **May** leave `source-states` empty (e.g., when sources are non-Iceberg or 
freshness is determined by a mechanism outside this spec).
+
+A snapshot whose refresh state violates a `Must` rule is invalid; consumers 
may treat it as if it had no `refresh-state`.
+
+##### Consumer options
+
+Consumers may use any combination of the following to assess the storage table:
+
+- **Recency policy.** Accept the storage table when 
`refresh-start-timestamp-ms` falls within a staleness window. A recency policy 
bounds data age but does not establish freshness.
+- **Trust the recorded `source-states`.** Compare each entry against the 
current catalog state — `snapshot-id` for tables, `version-id` for views, 
optionally recursive verification for intermediate materialized views recorded 
by their storage tables. Also confirm that the recorded `view-version-id` 
equals the materialized view's current `view-version-id`.
+- **Verify by parsing the view query.** Derive the dependency set from the SQL 
and confirm every dependency is covered by `source-states` and matches the 
current state. Treat any uncovered dependency as undetermined.
+
+If a consumer's assessment passes, it reads from the storage table; otherwise 
it evaluates the view query in place of the storage table.
 
 #### Refresh state
 
-The refresh state record captures the dependencies in the materialized view's 
dependency graph.
-These dependencies include source Iceberg tables, views, and materialized 
views.
+The refresh state record captures the dependencies in the materialized view's 
dependency graph. Each dependency is recorded in `source-states` as either a 
`table` entry (a base table or an intermediate materialized view's storage 
table) or a `view` entry.

Review Comment:
   I'd prefer "upstream" MV over a "base" MV.  I think "base" should be 
reserved for tables.



##########
format/view-spec.md:
##########
@@ -160,7 +178,120 @@ Each entry in `version-log` is a struct with the 
following fields:
 | _required_  | `timestamp-ms` | Timestamp when the view's 
`current-version-id` was updated (ms from epoch) |
 | _required_  | `version-id`   | ID that `current-version-id` was set to |
 
-## Appendix A: An Example
+#### Storage Table Identifier
+
+The table identifier for the storage table that stores the precomputed results.
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `namespace`    | A list of strings for namespace levels |
+| _required_  | `name`         | A string specifying the name of the table |
+
+### Storage table metadata
+
+This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
+The `refresh-state` property is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of a storage 
table snapshot to provide information about the state of the precomputed data.
+
+| Requirement | Field name      | Description |
+|-------------|-----------------|-------------|
+| _optional_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
+
+#### Freshness
+
+A materialized view is **fresh** when the storage table represents the result 
of the current view query (at the materialized view's current 
`view-version-id`) over the current state of its dependencies. Dependencies are 
determined by parsing the SQL: base Iceberg tables, Iceberg views (whose own 
dependencies are transitively dependencies of the materialized view), and 
intermediate materialized views (treated as their storage tables, with their 
own freshness established recursively from their `refresh-state`).
+
+A change to the materialized view's definition produces a new 
`view-version-id`; any storage-table snapshot recorded at a prior 
`view-version-id` is not fresh under the current definition.
+
+The `refresh-state` summary on each storage-table snapshot records dependency 
state observed at refresh time. Producers populate it; consumers use it to 
assess freshness without re-executing the query. The spec does not mandate what 
producers record or how consumers assess. See [Appendix 
B](#appendix-b-what-counts-as-a-dependency) for what counts as a dependency.
+
+##### Producer flexibility

Review Comment:
   I can see how this heading has parallel structure with the next heading so 
how about:
   
   - Producer: Recording Refresh State
   - Consumer:  Evaluating Refresh State



##########
format/view-spec.md:
##########
@@ -160,7 +178,120 @@ Each entry in `version-log` is a struct with the 
following fields:
 | _required_  | `timestamp-ms` | Timestamp when the view's 
`current-version-id` was updated (ms from epoch) |
 | _required_  | `version-id`   | ID that `current-version-id` was set to |
 
-## Appendix A: An Example
+#### Storage Table Identifier
+
+The table identifier for the storage table that stores the precomputed results.
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `namespace`    | A list of strings for namespace levels |
+| _required_  | `name`         | A string specifying the name of the table |
+
+### Storage table metadata
+
+This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
+The `refresh-state` property is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of a storage 
table snapshot to provide information about the state of the precomputed data.
+
+| Requirement | Field name      | Description |
+|-------------|-----------------|-------------|
+| _optional_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
+
+#### Freshness
+
+A materialized view is **fresh** when the storage table represents the result 
of the current view query (at the materialized view's current 
`view-version-id`) over the current state of its dependencies. Dependencies are 
determined by parsing the SQL: base Iceberg tables, Iceberg views (whose own 
dependencies are transitively dependencies of the materialized view), and 
intermediate materialized views (treated as their storage tables, with their 
own freshness established recursively from their `refresh-state`).
+
+A change to the materialized view's definition produces a new 
`view-version-id`; any storage-table snapshot recorded at a prior 
`view-version-id` is not fresh under the current definition.
+
+The `refresh-state` summary on each storage-table snapshot records dependency 
state observed at refresh time. Producers populate it; consumers use it to 
assess freshness without re-executing the query. The spec does not mandate what 
producers record or how consumers assess. See [Appendix 
B](#appendix-b-what-counts-as-a-dependency) for what counts as a dependency.

Review Comment:
   There are too many different use cases for freshness requirements.  Producer 
decides the full set of possibilities based on what it puts into the refresh 
state.  Consumer has the flexibility to decide if the included refresh state is 
sufficient for its freshness requirement.



##########
format/view-spec.md:
##########
@@ -160,7 +178,120 @@ Each entry in `version-log` is a struct with the 
following fields:
 | _required_  | `timestamp-ms` | Timestamp when the view's 
`current-version-id` was updated (ms from epoch) |
 | _required_  | `version-id`   | ID that `current-version-id` was set to |
 
-## Appendix A: An Example
+#### Storage Table Identifier
+
+The table identifier for the storage table that stores the precomputed results.
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `namespace`    | A list of strings for namespace levels |
+| _required_  | `name`         | A string specifying the name of the table |
+
+### Storage table metadata
+
+This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
+The `refresh-state` property is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of a storage 
table snapshot to provide information about the state of the precomputed data.
+
+| Requirement | Field name      | Description |
+|-------------|-----------------|-------------|
+| _optional_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
+
+#### Freshness
+
+A materialized view is **fresh** when the storage table represents the result 
of the current view query (at the materialized view's current 
`view-version-id`) over the current state of its dependencies. Dependencies are 
determined by parsing the SQL: base Iceberg tables, Iceberg views (whose own 
dependencies are transitively dependencies of the materialized view), and 
intermediate materialized views (treated as their storage tables, with their 
own freshness established recursively from their `refresh-state`).
+
+A change to the materialized view's definition produces a new 
`view-version-id`; any storage-table snapshot recorded at a prior 
`view-version-id` is not fresh under the current definition.
+
+The `refresh-state` summary on each storage-table snapshot records dependency 
state observed at refresh time. Producers populate it; consumers use it to 
assess freshness without re-executing the query. The spec does not mandate what 
producers record or how consumers assess. See [Appendix 
B](#appendix-b-what-counts-as-a-dependency) for what counts as a dependency.
+
+##### Producer flexibility
+
+Producers may selectively choose a subset of their dependencies to record — 
for example, skipping non-Iceberg sources or recording an empty list.
+
+When writing the refresh state, producers:
+
+- **Must** record `view-version-id` and `refresh-start-timestamp-ms`.
+- **Must** include all distinct source states for the inputs they chose to 
track.
+- **May** leave `source-states` empty (e.g., when sources are non-Iceberg or 
freshness is determined by a mechanism outside this spec).
+
+A snapshot whose refresh state violates a `Must` rule is invalid; consumers 
may treat it as if it had no `refresh-state`.
+
+##### Consumer options

Review Comment:
   Consumer:  Evaluating Refresh State
   
   to match with **Producer: Recording Refresh State**



##########
format/view-spec.md:
##########
@@ -160,7 +178,120 @@ Each entry in `version-log` is a struct with the 
following fields:
 | _required_  | `timestamp-ms` | Timestamp when the view's 
`current-version-id` was updated (ms from epoch) |
 | _required_  | `version-id`   | ID that `current-version-id` was set to |
 
-## Appendix A: An Example
+#### Storage Table Identifier
+
+The table identifier for the storage table that stores the precomputed results.
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `namespace`    | A list of strings for namespace levels |
+| _required_  | `name`         | A string specifying the name of the table |
+
+### Storage table metadata
+
+This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
+The `refresh-state` property is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of a storage 
table snapshot to provide information about the state of the precomputed data.
+
+| Requirement | Field name      | Description |
+|-------------|-----------------|-------------|
+| _optional_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
+
+#### Freshness
+
+A materialized view is **fresh** when the storage table represents the result 
of the current view query (at the materialized view's current 
`view-version-id`) over the current state of its dependencies. Dependencies are 
determined by parsing the SQL: base Iceberg tables, Iceberg views (whose own 
dependencies are transitively dependencies of the materialized view), and 
intermediate materialized views (treated as their storage tables, with their 
own freshness established recursively from their `refresh-state`).
+
+A change to the materialized view's definition produces a new 
`view-version-id`; any storage-table snapshot recorded at a prior 
`view-version-id` is not fresh under the current definition.
+
+The `refresh-state` summary on each storage-table snapshot records dependency 
state observed at refresh time. Producers populate it; consumers use it to 
assess freshness without re-executing the query. The spec does not mandate what 
producers record or how consumers assess. See [Appendix 
B](#appendix-b-what-counts-as-a-dependency) for what counts as a dependency.
+
+##### Producer flexibility
+
+Producers may selectively choose a subset of their dependencies to record — 
for example, skipping non-Iceberg sources or recording an empty list.
+
+When writing the refresh state, producers:
+
+- **Must** record `view-version-id` and `refresh-start-timestamp-ms`.
+- **Must** include all distinct source states for the inputs they chose to 
track.
+- **May** leave `source-states` empty (e.g., when sources are non-Iceberg or 
freshness is determined by a mechanism outside this spec).
+
+A snapshot whose refresh state violates a `Must` rule is invalid; consumers 
may treat it as if it had no `refresh-state`.

Review Comment:
   Agree here too.



##########
format/view-spec.md:
##########
@@ -190,92 +190,93 @@ The table identifier for the storage table that stores 
the precomputed results.
 ### Storage table metadata
 
 This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
-The property "refresh-state" is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of a storage 
table snapshot to provide information about the state of the precomputed data.
+The `refresh-state` property is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of a storage 
table snapshot to provide information about the state of the precomputed data.
 
 | Requirement | Field name      | Description |
 |-------------|-----------------|-------------|
 | _optional_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
 
 #### Freshness
 
-A materialized view is "fresh" when the storage table adequately represents 
the result of the view query at the current state of its dependencies.
-Since different systems define freshness differently, it is left to the 
consumer to evaluate freshness based on its own policy.
+A materialized view is **fresh** when the storage table represents the result 
of the current view query (at the materialized view's current 
`view-version-id`) over the current state of its dependencies. Dependencies are 
determined by parsing the SQL: base Iceberg tables, Iceberg views (whose own 
dependencies are transitively dependencies of the materialized view), and 
intermediate materialized views (treated as their storage tables, with their 
own freshness established recursively from their `refresh-state`).

Review Comment:
   Instead of "intermediate" MV, how about "upstream" MVs?   This way can also 
talk about "downstream" MVs when discussing refresh propagation in future 
iterations of this spec.
   
   Also, we agreed to remove the term "dependencies" from this section and only 
talk about dependencies in context of "refresh state" definition later in the 
spec. 



##########
format/view-spec.md:
##########
@@ -190,92 +190,93 @@ The table identifier for the storage table that stores 
the precomputed results.
 ### Storage table metadata
 
 This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
-The property "refresh-state" is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of a storage 
table snapshot to provide information about the state of the precomputed data.
+The `refresh-state` property is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of a storage 
table snapshot to provide information about the state of the precomputed data.
 
 | Requirement | Field name      | Description |
 |-------------|-----------------|-------------|
 | _optional_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
 
 #### Freshness
 
-A materialized view is "fresh" when the storage table adequately represents 
the result of the view query at the current state of its dependencies.
-Since different systems define freshness differently, it is left to the 
consumer to evaluate freshness based on its own policy.
+A materialized view is **fresh** when the storage table represents the result 
of the current view query (at the materialized view's current 
`view-version-id`) over the current state of its dependencies. Dependencies are 
determined by parsing the SQL: base Iceberg tables, Iceberg views (whose own 
dependencies are transitively dependencies of the materialized view), and 
intermediate materialized views (treated as their storage tables, with their 
own freshness established recursively from their `refresh-state`).
 
-**Consumer behavior:**
+A change to the materialized view's definition produces a new 
`view-version-id`; any storage-table snapshot recorded at a prior 
`view-version-id` is not fresh under the current definition.
 
-When evaluating freshness, consumers:
+The `refresh-state` summary on each storage-table snapshot records dependency 
state observed at refresh time. Producers populate it; consumers use it to 
assess freshness without re-executing the query. The spec does not mandate what 
producers record or how consumers assess. See [Appendix 
B](#appendix-b-what-counts-as-a-dependency) for what counts as a dependency.
 
-- May apply time-based freshness policies, such as allowing a staleness window 
based on `refresh-start-timestamp-ms`.
-- May compare the `source-states` list against the states loaded from the 
catalog to verify the producer's freshness interpretation.
-- May parse the view definition to implement more sophisticated policies.
-- When a materialized view is considered stale, can fail, refresh inline, or 
treat the materialized view as a logical view.
-- Should not consume the storage table as it is when the materialized view 
doesn't meet the freshness criteria.
+##### Producer flexibility
 
-**Producer behavior:**
-
-Producers should provide the necessary information in the [refresh 
state](#refresh-state) such that consumers can verify the logical equivalence 
of the precomputed data with the query definition.
-Different producers may have different freshness interpretations, based on how 
much of the refresh state's dependency graph should be evaluated.
-Some producers expect the entire dependency graph to be evaluated and 
therefore include source MV dependencies. Other producers may only expect 
dependencies in the MV's SQL to be evaluated and therefore do not include 
dependencies of source MVs.
+Producers may selectively choose a subset of their dependencies to record — 
for example, skipping non-Iceberg sources or recording an empty list.
 
 When writing the refresh state, producers:
 
-- Should provide a sufficient list of source states such that consumers can 
determine freshness according to the producer's intent. If the producers intent 
is such that it doesn't rely on the source-states to determine freshness, it 
may provide an empty list.
-- If the source state cannot be determined for all objects (for example, for 
non-Iceberg tables or non-deterministic functions) may leave the source states 
list empty.
-- If a stored object is reachable through multiple paths in the dependency 
graph (diamond dependency pattern), all distinct source states have to be 
included in the list.
+- **Must** record `view-version-id` and `refresh-start-timestamp-ms`.
+- **Must** include all distinct source states for the inputs they chose to 
track.
+- **May** leave `source-states` empty (e.g., when sources are non-Iceberg or 
freshness is determined by a mechanism outside this spec).
+
+A snapshot whose refresh state violates a `Must` rule is invalid; consumers 
may treat it as if it had no `refresh-state`.
+
+##### Consumer options
+
+Consumers may use any combination of the following to assess the storage table:
+
+- **Recency policy.** Accept the storage table when 
`refresh-start-timestamp-ms` falls within a staleness window. A recency policy 
bounds data age but does not establish freshness.
+- **Trust the recorded `source-states`.** Compare each entry against the 
current catalog state — `snapshot-id` for tables, `version-id` for views, 
optionally recursive verification for intermediate materialized views recorded 
by their storage tables. Also confirm that the recorded `view-version-id` 
equals the materialized view's current `view-version-id`.
+- **Verify by parsing the view query.** Derive the dependency set from the SQL 
and confirm every dependency is covered by `source-states` and matches the 
current state. Treat any uncovered dependency as undetermined.
+
+If a consumer's assessment passes, it reads from the storage table; otherwise 
it evaluates the view query in place of the storage table.
 
 #### Refresh state
 
-The refresh state record captures the dependencies in the materialized view's 
dependency graph.
-These dependencies include source Iceberg tables, views, and materialized 
views.
+The refresh state record captures the dependencies in the materialized view's 
dependency graph. Each dependency is recorded in `source-states` as either a 
`table` entry (a base table or an intermediate materialized view's storage 
table) or a `view` entry.
 
 The refresh state has the following fields:
 
-| Requirement | Field name     | Description |
-|-------------|----------------|-------------|
-| _required_  | `view-version-id`         | The `version-id` of the 
materialized view when the refresh operation was performed  |
-| _required_  | `source-states`        | A list of [source 
states](#source-state) records |
+| Requirement | Field name                   | Description |
+|-------------|------------------------------|-------------|
+| _required_  | `view-version-id`            | The `version-id` of the 
materialized view when the refresh operation was performed |
+| _required_  | `source-states`              | A list of [source 
state](#source-state) records |
 | _required_  | `refresh-start-timestamp-ms` | A timestamp of when the refresh 
operation was started |
 
 #### Source state
 
-Source state records capture the state of objects referenced by a materialized 
view including objects referenced by source materialized views.
-Each record has a `type` field that determines its form:
+Source state records capture the state of objects referenced by a materialized 
view. Each record has a `type` field that determines its form:
 
 | Type    | Description |
 |---------|-------------|
-| `table` | An Iceberg table, including storage tables of source materialized 
views |
-| `view`  | An Iceberg view, including source materialized views |
+| `table` | An Iceberg table — either a base table in the dependency graph, or 
the storage table of an intermediate materialized view |
+| `view`  | An Iceberg view in the dependency graph |
 
-Source materialized views are represented by two source state entries: one for 
the view itself and one for its storage table.
+An intermediate materialized view must be recorded as a single `table` entry 
referencing its storage table; recording it as a `view` entry is not permitted. 
The intermediate materialized view's own dependencies are reached recursively 
through its `refresh-state`.

Review Comment:
   I agree with Igor and Steven here too.



##########
format/view-spec.md:
##########
@@ -190,92 +190,93 @@ The table identifier for the storage table that stores 
the precomputed results.
 ### Storage table metadata
 
 This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
-The property "refresh-state" is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of a storage 
table snapshot to provide information about the state of the precomputed data.
+The `refresh-state` property is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of a storage 
table snapshot to provide information about the state of the precomputed data.
 
 | Requirement | Field name      | Description |
 |-------------|-----------------|-------------|
 | _optional_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
 
 #### Freshness
 
-A materialized view is "fresh" when the storage table adequately represents 
the result of the view query at the current state of its dependencies.
-Since different systems define freshness differently, it is left to the 
consumer to evaluate freshness based on its own policy.
+A materialized view is **fresh** when the storage table represents the result 
of the current view query (at the materialized view's current 
`view-version-id`) over the current state of its dependencies. Dependencies are 
determined by parsing the SQL: base Iceberg tables, Iceberg views (whose own 
dependencies are transitively dependencies of the materialized view), and 
intermediate materialized views (treated as their storage tables, with their 
own freshness established recursively from their `refresh-state`).
 
-**Consumer behavior:**
+A change to the materialized view's definition produces a new 
`view-version-id`; any storage-table snapshot recorded at a prior 
`view-version-id` is not fresh under the current definition.
 
-When evaluating freshness, consumers:
+The `refresh-state` summary on each storage-table snapshot records dependency 
state observed at refresh time. Producers populate it; consumers use it to 
assess freshness without re-executing the query. The spec does not mandate what 
producers record or how consumers assess. See [Appendix 
B](#appendix-b-what-counts-as-a-dependency) for what counts as a dependency.
 
-- May apply time-based freshness policies, such as allowing a staleness window 
based on `refresh-start-timestamp-ms`.
-- May compare the `source-states` list against the states loaded from the 
catalog to verify the producer's freshness interpretation.
-- May parse the view definition to implement more sophisticated policies.
-- When a materialized view is considered stale, can fail, refresh inline, or 
treat the materialized view as a logical view.
-- Should not consume the storage table as it is when the materialized view 
doesn't meet the freshness criteria.
+##### Producer flexibility
 
-**Producer behavior:**
-
-Producers should provide the necessary information in the [refresh 
state](#refresh-state) such that consumers can verify the logical equivalence 
of the precomputed data with the query definition.
-Different producers may have different freshness interpretations, based on how 
much of the refresh state's dependency graph should be evaluated.
-Some producers expect the entire dependency graph to be evaluated and 
therefore include source MV dependencies. Other producers may only expect 
dependencies in the MV's SQL to be evaluated and therefore do not include 
dependencies of source MVs.
+Producers may selectively choose a subset of their dependencies to record — 
for example, skipping non-Iceberg sources or recording an empty list.
 
 When writing the refresh state, producers:
 
-- Should provide a sufficient list of source states such that consumers can 
determine freshness according to the producer's intent. If the producers intent 
is such that it doesn't rely on the source-states to determine freshness, it 
may provide an empty list.
-- If the source state cannot be determined for all objects (for example, for 
non-Iceberg tables or non-deterministic functions) may leave the source states 
list empty.
-- If a stored object is reachable through multiple paths in the dependency 
graph (diamond dependency pattern), all distinct source states have to be 
included in the list.
+- **Must** record `view-version-id` and `refresh-start-timestamp-ms`.
+- **Must** include all distinct source states for the inputs they chose to 
track.

Review Comment:
   Agree too..  "Should" is better.



##########
format/view-spec.md:
##########
@@ -322,3 +453,142 @@ 
s3://bucket/warehouse/default.db/event_agg/metadata/00002-(uuid).metadata.json
   } ]
 }
 ```
+
+### Materialized View Example
+
+Imagine the following operation, which creates a materialized view that 
precomputes daily event counts:
+
+```sql
+USE prod.default
+```
+```sql
+CREATE MATERIALIZED VIEW event_agg_mv (
+    event_count COMMENT 'Count of events',
+    event_date)
+COMMENT 'Precomputed daily event counts'
+AS
+SELECT
+    COUNT(1), CAST(event_ts AS DATE)
+FROM events
+GROUP BY 2
+```
+
+The materialized view metadata JSON file looks as follows:
+
+```
+s3://bucket/warehouse/default.db/event_agg_mv/metadata/00001-(uuid).metadata.json
+```
+```json
+{
+  "view-uuid": "b2a12651-3038-4a72-8a31-5027ab84da35",
+  "format-version" : 1,
+  "location" : "s3://bucket/warehouse/default.db/event_agg_mv",
+  "current-version-id" : 1,
+  "properties" : {
+    "comment" : "Precomputed daily event counts"
+  },
+  "versions" : [ {
+    "version-id" : 1,
+    "timestamp-ms" : 1573518431292,
+    "schema-id" : 1,
+    "default-catalog" : "prod",
+    "default-namespace" : [ "default" ],
+    "summary" : {
+      "engine-name" : "Spark",
+      "engine-version" : "3.4.1"
+    },
+    "representations" : [ {
+      "type" : "sql",
+      "sql" : "SELECT\n    COUNT(1), CAST(event_ts AS DATE)\nFROM 
events\nGROUP BY 2",
+      "dialect" : "spark"
+    } ],
+    "storage-table" : {
+      "namespace" : [ "default" ],
+      "name" : "event_agg_mv__storage"
+    }
+  } ],
+  "schemas": [ {
+    "schema-id": 1,
+    "type" : "struct",
+    "fields" : [ {
+      "id" : 1,
+      "name" : "event_count",
+      "required" : false,
+      "type" : "int",
+      "doc" : "Count of events"
+    }, {
+      "id" : 2,
+      "name" : "event_date",
+      "required" : false,
+      "type" : "date"
+    } ]
+  } ],
+  "version-log" : [ {
+    "timestamp-ms" : 1573518431292,
+    "version-id" : 1
+  } ]
+}
+```
+
+After a refresh operation, the storage table's snapshot summary contains the 
`refresh-state` property.
+The following is an example of the `refresh-state` JSON value stored in the 
snapshot summary of the storage table:
+
+```json
+{
+  "view-version-id" : 1,
+  "refresh-start-timestamp-ms" : 1573518435000,
+  "source-states" : [ {
+    "type" : "table",
+    "namespace" : [ "default" ],
+    "name" : "events",
+    "uuid" : "d4a10b5c-1e8a-4b72-9d67-3f4a8c9e1b2d",
+    "snapshot-id" : 6148331192489823102
+  } ]
+}
+```
+
+## Appendix B: What counts as a dependency
+
+The dependencies of a materialized view are determined by parsing the view 
query:
+
+- **Base Iceberg tables** in the dependency graph are recorded by 
`snapshot-id`.
+- **Iceberg views** in the dependency graph are recorded by `version-id`. A 
view's own dependencies are transitively dependencies of the materialized view 
and appear as additional entries in `source-states`.
+- **Intermediate materialized views** in the dependency graph are treated as 
their storage tables and recorded by the storage table's `snapshot-id`. Their 
own freshness is established recursively from their `refresh-state`.
+
+### Example
+
+The query under examination:
+
+- `A` (the materialized view being refreshed): `SELECT ... FROM B JOIN C ON 
...`
+- `B` (regular view): `SELECT ... FROM E JOIN D ON ...`
+- `C` (materialized view): `SELECT ... FROM F JOIN G ON ...`
+- `D` (materialized view): `SELECT ... FROM H WHERE ...`
+- `E`, `F`, `G`, `H`: base Iceberg tables
+
+`A`'s dependencies are `B`, `C`, and `D`. `B` is a regular view; its own 
dependencies (`E` and `D`) are transitively dependencies of `A`. `C` and `D` 
are materialized views; they appear in `A`'s `source-states` as their storage 
tables.

Review Comment:
   I agree with Steven.   D is not a direct dependency for A.  A only directly 
depends on B and C.  



##########
format/view-spec.md:
##########
@@ -160,7 +178,120 @@ Each entry in `version-log` is a struct with the 
following fields:
 | _required_  | `timestamp-ms` | Timestamp when the view's 
`current-version-id` was updated (ms from epoch) |
 | _required_  | `version-id`   | ID that `current-version-id` was set to |
 
-## Appendix A: An Example
+#### Storage Table Identifier
+
+The table identifier for the storage table that stores the precomputed results.
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `namespace`    | A list of strings for namespace levels |
+| _required_  | `name`         | A string specifying the name of the table |
+
+### Storage table metadata
+
+This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
+The `refresh-state` property is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of a storage 
table snapshot to provide information about the state of the precomputed data.
+
+| Requirement | Field name      | Description |
+|-------------|-----------------|-------------|
+| _optional_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
+
+#### Freshness
+
+A materialized view is **fresh** when the storage table represents the result 
of the current view query (at the materialized view's current 
`view-version-id`) over the current state of its dependencies. Dependencies are 
determined by parsing the SQL: base Iceberg tables, Iceberg views (whose own 
dependencies are transitively dependencies of the materialized view), and 
intermediate materialized views (treated as their storage tables, with their 
own freshness established recursively from their `refresh-state`).
+
+A change to the materialized view's definition produces a new 
`view-version-id`; any storage-table snapshot recorded at a prior 
`view-version-id` is not fresh under the current definition.
+
+The `refresh-state` summary on each storage-table snapshot records dependency 
state observed at refresh time. Producers populate it; consumers use it to 
assess freshness without re-executing the query. The spec does not mandate what 
producers record or how consumers assess. See [Appendix 
B](#appendix-b-what-counts-as-a-dependency) for what counts as a dependency.
+
+##### Producer flexibility
+
+Producers may selectively choose a subset of their dependencies to record — 
for example, skipping non-Iceberg sources or recording an empty list.
+
+When writing the refresh state, producers:
+
+- **Must** record `view-version-id` and `refresh-start-timestamp-ms`.
+- **Must** include all distinct source states for the inputs they chose to 
track.
+- **May** leave `source-states` empty (e.g., when sources are non-Iceberg or 
freshness is determined by a mechanism outside this spec).
+
+A snapshot whose refresh state violates a `Must` rule is invalid; consumers 
may treat it as if it had no `refresh-state`.
+
+##### Consumer options
+
+Consumers may use any combination of the following to assess the storage table:
+
+- **Recency policy.** Accept the storage table when 
`refresh-start-timestamp-ms` falls within a staleness window. A recency policy 
bounds data age but does not establish freshness.
+- **Trust the recorded `source-states`.** Compare each entry against the 
current catalog state — `snapshot-id` for tables, `version-id` for views, 
optionally recursive verification for intermediate materialized views recorded 
by their storage tables. Also confirm that the recorded `view-version-id` 
equals the materialized view's current `view-version-id`.
+- **Verify by parsing the view query.** Derive the dependency set from the SQL 
and confirm every dependency is covered by `source-states` and matches the 
current state. Treat any uncovered dependency as undetermined.
+
+If a consumer's assessment passes, it reads from the storage table; otherwise 
it evaluates the view query in place of the storage table.

Review Comment:
   I agree... let's not prescribe the "otherwise" part.



##########
format/view-spec.md:
##########
@@ -160,7 +178,120 @@ Each entry in `version-log` is a struct with the 
following fields:
 | _required_  | `timestamp-ms` | Timestamp when the view's 
`current-version-id` was updated (ms from epoch) |
 | _required_  | `version-id`   | ID that `current-version-id` was set to |
 
-## Appendix A: An Example
+#### Storage Table Identifier
+
+The table identifier for the storage table that stores the precomputed results.
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `namespace`    | A list of strings for namespace levels |
+| _required_  | `name`         | A string specifying the name of the table |
+
+### Storage table metadata
+
+This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
+The `refresh-state` property is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of a storage 
table snapshot to provide information about the state of the precomputed data.
+
+| Requirement | Field name      | Description |
+|-------------|-----------------|-------------|
+| _optional_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
+
+#### Freshness
+
+A materialized view is **fresh** when the storage table represents the result 
of the current view query (at the materialized view's current 
`view-version-id`) over the current state of its dependencies. Dependencies are 
determined by parsing the SQL: base Iceberg tables, Iceberg views (whose own 
dependencies are transitively dependencies of the materialized view), and 
intermediate materialized views (treated as their storage tables, with their 
own freshness established recursively from their `refresh-state`).
+
+A change to the materialized view's definition produces a new 
`view-version-id`; any storage-table snapshot recorded at a prior 
`view-version-id` is not fresh under the current definition.
+
+The `refresh-state` summary on each storage-table snapshot records dependency 
state observed at refresh time. Producers populate it; consumers use it to 
assess freshness without re-executing the query. The spec does not mandate what 
producers record or how consumers assess. See [Appendix 
B](#appendix-b-what-counts-as-a-dependency) for what counts as a dependency.
+
+##### Producer flexibility
+
+Producers may selectively choose a subset of their dependencies to record — 
for example, skipping non-Iceberg sources or recording an empty list.
+
+When writing the refresh state, producers:
+
+- **Must** record `view-version-id` and `refresh-start-timestamp-ms`.
+- **Must** include all distinct source states for the inputs they chose to 
track.
+- **May** leave `source-states` empty (e.g., when sources are non-Iceberg or 
freshness is determined by a mechanism outside this spec).
+
+A snapshot whose refresh state violates a `Must` rule is invalid; consumers 
may treat it as if it had no `refresh-state`.
+
+##### Consumer options
+
+Consumers may use any combination of the following to assess the storage table:
+
+- **Recency policy.** Accept the storage table when 
`refresh-start-timestamp-ms` falls within a staleness window. A recency policy 
bounds data age but does not establish freshness.

Review Comment:
   I like "Recency Policy".  "Refresh Staleness" feels less clear to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Materialized View Spec [iceberg]

Reply via email to