Re: [PR] Materialized View Spec [iceberg]

via GitHub Thu, 28 May 2026 14:05:01 -0700


bennychow commented on code in PR #11041:
URL: https://github.com/apache/iceberg/pull/11041#discussion_r3320601224



##########
format/view-spec.md:
##########
@@ -322,3 +449,215 @@ 
s3://bucket/warehouse/default.db/event_agg/metadata/00002-(uuid).metadata.json
   } ]
 }
 ```
+
+### Materialized View Example
+
+Imagine the following operation, which creates a materialized view that 
precomputes daily event counts:
+
+```sql
+USE prod.default
+```
+```sql
+CREATE MATERIALIZED VIEW event_agg_mv (
+    event_count COMMENT 'Count of events',
+    event_date)
+COMMENT 'Precomputed daily event counts'
+AS
+SELECT
+    COUNT(1), CAST(event_ts AS DATE)
+FROM events
+GROUP BY 2
+```
+
+The materialized view metadata JSON file looks as follows:
+
+```
+s3://bucket/warehouse/default.db/event_agg_mv/metadata/00001-(uuid).metadata.json
+```
+```json
+{
+  "view-uuid": "b2a12651-3038-4a72-8a31-5027ab84da35",
+  "format-version" : 1,
+  "location" : "s3://bucket/warehouse/default.db/event_agg_mv",
+  "current-version-id" : 1,
+  "properties" : {
+    "comment" : "Precomputed daily event counts"
+  },
+  "versions" : [ {
+    "version-id" : 1,
+    "timestamp-ms" : 1573518431292,
+    "schema-id" : 1,
+    "default-catalog" : "prod",
+    "default-namespace" : [ "default" ],
+    "summary" : {
+      "engine-name" : "Spark",
+      "engine-version" : "3.4.1"
+    },
+    "representations" : [ {
+      "type" : "sql",
+      "sql" : "SELECT\n    COUNT(1), CAST(event_ts AS DATE)\nFROM 
events\nGROUP BY 2",
+      "dialect" : "spark"
+    } ],
+    "storage-table" : {
+      "namespace" : [ "default" ],
+      "name" : "event_agg_mv__storage"
+    }
+  } ],
+  "schemas": [ {
+    "schema-id": 1,
+    "type" : "struct",
+    "fields" : [ {
+      "id" : 1,
+      "name" : "event_count",
+      "required" : false,
+      "type" : "int",
+      "doc" : "Count of events"
+    }, {
+      "id" : 2,
+      "name" : "event_date",
+      "required" : false,
+      "type" : "date"
+    } ]
+  } ],
+  "version-log" : [ {
+    "timestamp-ms" : 1573518431292,
+    "version-id" : 1
+  } ]
+}
+```
+
+After a refresh operation, the storage table's snapshot summary contains the 
`refresh-state` property.
+The following is an example of the `refresh-state` JSON value stored in the 
snapshot summary of the storage table:
+
+```json
+{
+  "view-version-id" : 1,
+  "refresh-start-timestamp-ms" : 1573518435000,
+  "source-states" : [ {
+    "type" : "table",
+    "namespace" : [ "default" ],
+    "name" : "events",
+    "uuid" : "d4a10b5c-1e8a-4b72-9d67-3f4a8c9e1b2d",
+    "snapshot-id" : 6148331192489823102
+  } ]
+}
+```
+
+## Appendix B: Example strategies for selecting dependencies
+
+Producers may select different sets of dependencies to record in the refresh 
state. The strategies below illustrate common choices against the same shared 
query.
+
+### Shared query
+
+- `A` (the materialized view being refreshed): `SELECT ... FROM B JOIN C ON 
...`
+- `B` (regular view): `SELECT ... FROM E JOIN D ON ...`
+- `C` (regular view or materialized view, varies by strategy): `SELECT ... 
FROM F JOIN G ON ...`
+- `D` (regular view or materialized view, varies by strategy): `SELECT ... 
FROM H WHERE ...`
+- `E`, `F`, `G`, `H`: base Iceberg tables
+
+### Strategy 1: Track all nested tables and views (no nested MVs)
+
+The view query reads only base tables and regular views. The refresh state 
tracks snapshot IDs of all deeply nested base tables and version IDs of all 
views traversed. Reuse of the storage table is sensitive to changes in any of 
them.
+
+`C` and `D` are regular views.
+
+```
+A [MV — being refreshed]
+├── B [VIEW]                            <-- recorded in A: version-id: 5
+│   ├── E [TABLE]                       <-- recorded in A: snapshot-id: 101
+│   └── D [VIEW]                        <-- recorded in A: version-id: 9
+│       └── H [TABLE]                   <-- recorded in A: snapshot-id: 104
+└── C [VIEW]                            <-- recorded in A: version-id: 7
+    ├── F [TABLE]                       <-- recorded in A: snapshot-id: 102
+    └── G [TABLE]                       <-- recorded in A: snapshot-id: 103
+```
+
+### Strategy 2: Treat nested materialized views as tables
+
+Same as Strategy 1, but the query reads from materialized views. The producer 
stops at each MV boundary and records the MV's storage table snapshot ID. No 
expansion beyond the MV.
+
+`C` and `D` are materialized views, treated as tables.
+
+```
+A [MV — being refreshed]
+├── B [VIEW]                            <-- recorded in A: version-id: 5
+│   ├── E [TABLE]                       <-- recorded in A: snapshot-id: 101
+│   └── D [MV]                          <-- recorded in A: storage-table 
snapshot-id: 14
+│       ┄┄┄┄┄┄ recursive boundary ┄┄┄┄┄┄
+│       └── H [TABLE]                   (D's dependency; verified via D's 
refresh-state)
+└── C [MV]                              <-- recorded in A: storage-table 
snapshot-id: 12
+    ┄┄┄┄┄┄ recursive boundary ┄┄┄┄┄┄
+    ├── F [TABLE]                       (C's dependency; verified via C's 
refresh-state)
+    └── G [TABLE]                       (C's dependency; verified via C's 
refresh-state)
+```
+
+`F`, `G`, and `H` do not appear in `A`'s `source-states`; they belong to `C` 
and `D`'s dependency graphs.
+
+### Strategy 3: Treat nested materialized views as views
+
+Same as Strategy 1, but the query reads from materialized views. The producer 
treats each materialized view as a regular view: expand through the MV's view 
definition and record the underlying tables and views. The MV's storage table 
snapshot ID is **not** recorded.
+
+`C` and `D` are materialized views, treated as views (expanded).
+
+```
+A [MV — being refreshed]
+├── B [VIEW]                            <-- recorded in A: version-id: 5
+│   ├── E [TABLE]                       <-- recorded in A: snapshot-id: 101
+│   └── D [MV — expanded as view]       <-- recorded in A: version-id: 9
+│       └── H [TABLE]                   <-- recorded in A: snapshot-id: 104
+└── C [MV — expanded as view]           <-- recorded in A: version-id: 7
+    ├── F [TABLE]                       <-- recorded in A: snapshot-id: 102
+    └── G [TABLE]                       <-- recorded in A: snapshot-id: 103
+```
+
+The recorded shape matches Strategy 1. The difference is semantic: `C` and `D` 
are materialized views whose view definitions were expanded; their storage 
tables are not part of the recorded state.
+
+### Strategy 4: Track only view versions
+
+The producer treats the storage table as reusable as long as the view 
definitions in the dependency chain are unchanged. Underlying table changes do 
not affect freshness. Only view version IDs are recorded.
+
+`C` and `D` are regular views.
+
+```
+A [MV — being refreshed]
+├── B [VIEW]                            <-- recorded in A: version-id: 5
+│   ├── E [TABLE]                       (not recorded)
+│   └── D [VIEW]                        <-- recorded in A: version-id: 9
+│       └── H [TABLE]                   (not recorded)
+└── C [VIEW]                            <-- recorded in A: version-id: 7
+    ├── F [TABLE]                       (not recorded)
+    └── G [TABLE]                       (not recorded)
+```
+
+Snapshots of `E`, `F`, `G`, `H` are not recorded. Reuse is sensitive to 
view-definition changes but insensitive to data changes in the underlying 
tables.
+
+### Strategy 5: Empty refresh state (recency only)
+
+The producer leaves `source-states` empty and relies entirely on 
`refresh-start-timestamp-ms`. Consumers reuse the storage table based on a 
recency policy alone.
+
+`A`'s refresh state:
+
+```json
+{
+  "view-version-id": 1,
+  "refresh-start-timestamp-ms": 1573518435000,
+  "source-states": []
+}
+```
+
+### Strategy 6: Skip non-Iceberg dependencies
+
+The producer records only Iceberg sources and omits non-Iceberg dependencies 
entirely. Useful when the view query reads from a mix of Iceberg and 
non-Iceberg sources and the producer chooses to track only the Iceberg side.
+
+Assume the query reads from base Iceberg tables `E`, `F`, `G`, `H` and an 
additional non-Iceberg table `N`.
+
+```
+A [MV — being refreshed]
+├── E [TABLE]                           <-- recorded in A: snapshot-id: 101
+├── F [TABLE]                           <-- recorded in A: snapshot-id: 102
+├── G [TABLE]                           <-- recorded in A: snapshot-id: 103
+├── H [TABLE]                           <-- recorded in A: snapshot-id: 104
+└── N [NON-ICEBERG TABLE]               (omitted; not tracked)
+```
+
+`N` is omitted. Consumers cannot verify `N`'s state from the refresh state 
alone.

Review Comment:
   nit: The consumer does know that the data queried from N cannot be older 
than refresh-start-timestamp-ms



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Materialized View Spec [iceberg]

Reply via email to