stevenzwu commented on code in PR #11041:
URL: https://github.com/apache/iceberg/pull/11041#discussion_r2628240131


##########
format/view-spec.md:
##########
@@ -160,6 +177,89 @@ Each entry in `version-log` is a struct with the following 
fields:
 | _required_  | `timestamp-ms` | Timestamp when the view's 
`current-version-id` was updated (ms from epoch) |
 | _required_  | `version-id`   | ID that `current-version-id` was set to |
 
+#### Storage Table Identifier
+
+The table identifier for the storage table that stores the precomputed results.
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `namespace`    | A list of strings for namespace levels |
+| _required_  | `name`         | A string specifying the name of the table |
+
+### Storage table metadata
+
+This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
+The property "refresh-state" is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of every storage 
table snapshot to determine the freshness of the precomputed data of the 
storage table.
+
+| Requirement | Field name      | Description |
+|-------------|-----------------|-------------|
+| _required_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
+
+#### Freshness
+
+A materialized view is considered fresh when its precomputed data is usable by 
consumers. As tables referenced by a materialized view change over time, the 
precomputed data may no longer accurately reflect the logical materialized view 
definition. When this occurs, the materialized view (storage table) is 
considered stale.
+
+Different systems interpret freshness differently, typically based on the 
objects referenced in the fully expanded query tree of the materialized view. 
Some systems consider only direct children, others only leaf nodes, and some 
the entire query tree. The specific interpretation is determined by the 
producer of the storage table.
+
+#### Refresh state
+
+The refresh state record captures the state of source tables, views, and 
materialized views at refresh time. It contains a list of directly or 
indirectly referenced source states that allow a consumer to determine the 
freshness of the materialized view.
+
+**Producer responsibilities:**
+- The producer of the storage table must provide a sufficient list of source 
states so that consumers can determine freshness according to the producer's 
interpretation.
+- The source states list may be empty if the source state cannot be determined 
for all objects (for example, for non-Iceberg tables).
+
+**Consumer evaluation:**
+- The consumer must at least perform a coarse-grained evaluation based on 
`refresh-start-timestamp-ms` and `max-staleness-ms`.
+- The consumer may additionally compare the `source-states` list against the 
states loaded from the catalog.
+- The consumer trusts that the producer has provided all states necessary to 
determine freshness.

Review Comment:
   I am not sure this bullet point about `trust` is needed. 
   
   During the meeting, we also discussed more flexibility for consumers sth 
along this line.
   ```
   - The consumer may parse the view definition to implement more sophisticated 
policy.
   ```



##########
format/view-spec.md:
##########
@@ -42,12 +42,24 @@ An atomic swap of one view metadata file for another 
provides the basis for maki
 
 Writers create view metadata files optimistically, assuming that the current 
metadata location will not be changed before the writer's commit. Once a writer 
has created an update, it commits by swapping the view's metadata file pointer 
from the base location to the new location.
 
+### Materialized Views
+
+Materialized views are a type of view with precomputed results from the view 
query stored as a table.
+When queried, engines may return the precomputed data for the materialized 
views, shifting the cost of query execution to the precomputation step.
+
+Iceberg materialized views are implemented as a combination of an Iceberg view 
and an underlying Iceberg table, the "storage-table", which stores the 
precomputed data.
+Materialized View metadata is a superset of View metadata with an additional 
pointer to the storage table. The storage table is an Iceberg table with 
additional materialized view refresh state metadata.
+Refresh metadata contains information about the "source tables" and/or "source 
views", which are the tables/views referenced in the query definition of the 
materialized view.
+
 ## Specification
 
 ### Terms
 
 * **Schema** -- Names and types of fields in a view.
 * **Version** -- The state of a view at some point in time.
+* **Storage table** -- Iceberg table that stores the precomputed data of a 
materialized view.
+* **Source table** -- A table reference that occurs in the query definition of 
a materialized view. The materialized view depends on the data from the source 
tables.
+* **Source view** -- A view reference that occurs in the query definition of a 
materialized view. The materialized view depends on the definitions from the 
source views.

Review Comment:
   for consistency, maybe
   ```
   Source materialized view -- A materialized view reference that occurs in the 
query definition of a materialized view.
   ```
   
   I actually think we can remove the part from the bullet points, as I don't 
see it clarify anything additionally
   ```
   The materialized view depends on the data from the source tables.
   ```



##########
format/view-spec.md:
##########
@@ -160,6 +177,89 @@ Each entry in `version-log` is a struct with the following 
fields:
 | _required_  | `timestamp-ms` | Timestamp when the view's 
`current-version-id` was updated (ms from epoch) |
 | _required_  | `version-id`   | ID that `current-version-id` was set to |
 
+#### Storage Table Identifier
+
+The table identifier for the storage table that stores the precomputed results.
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `namespace`    | A list of strings for namespace levels |
+| _required_  | `name`         | A string specifying the name of the table |
+
+### Storage table metadata
+
+This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
+The property "refresh-state" is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of every storage 
table snapshot to determine the freshness of the precomputed data of the 
storage table.
+
+| Requirement | Field name      | Description |
+|-------------|-----------------|-------------|
+| _required_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
+
+#### Freshness
+
+A materialized view is considered fresh when its precomputed data is usable by 
consumers. As tables referenced by a materialized view change over time, the 
precomputed data may no longer accurately reflect the logical materialized view 
definition. When this occurs, the materialized view (storage table) is 
considered stale.
+
+Different systems interpret freshness differently, typically based on the 
objects referenced in the fully expanded query tree of the materialized view. 
Some systems consider only direct children, others only leaf nodes, and some 
the entire query tree. The specific interpretation is determined by the 
producer of the storage table.
+
+#### Refresh state
+
+The refresh state record captures the state of source tables, views, and 
materialized views at refresh time. It contains a list of directly or 
indirectly referenced source states that allow a consumer to determine the 
freshness of the materialized view.
+
+**Producer responsibilities:**
+- The producer of the storage table must provide a sufficient list of source 
states so that consumers can determine freshness according to the producer's 
interpretation.
+- The source states list may be empty if the source state cannot be determined 
for all objects (for example, for non-Iceberg tables).
+
+**Consumer evaluation:**
+- The consumer must at least perform a coarse-grained evaluation based on 
`refresh-start-timestamp-ms` and `max-staleness-ms`.

Review Comment:
   we probably need to clearly the evaluation, like
   ```
   A materialized view is fresh if `refresh-start-timestamp-ms` is within the 
window `[now - max-staleness-ms, now]` 
   ```



##########
format/view-spec.md:
##########
@@ -160,6 +177,89 @@ Each entry in `version-log` is a struct with the following 
fields:
 | _required_  | `timestamp-ms` | Timestamp when the view's 
`current-version-id` was updated (ms from epoch) |
 | _required_  | `version-id`   | ID that `current-version-id` was set to |
 
+#### Storage Table Identifier
+
+The table identifier for the storage table that stores the precomputed results.
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `namespace`    | A list of strings for namespace levels |
+| _required_  | `name`         | A string specifying the name of the table |
+
+### Storage table metadata
+
+This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
+The property "refresh-state" is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of every storage 
table snapshot to determine the freshness of the precomputed data of the 
storage table.
+
+| Requirement | Field name      | Description |
+|-------------|-----------------|-------------|
+| _required_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
+
+#### Freshness
+
+A materialized view is considered fresh when its precomputed data is usable by 
consumers. As tables referenced by a materialized view change over time, the 
precomputed data may no longer accurately reflect the logical materialized view 
definition. When this occurs, the materialized view (storage table) is 
considered stale.

Review Comment:
   > A materialized view is considered fresh when its precomputed data is 
usable by consumers.
   
   This definition doesn't capture what should be considered fresh. Later part 
on consumer responsibilities is kind of defining what fresh mean.



##########
format/view-spec.md:
##########
@@ -63,11 +75,13 @@ The view version metadata file has the following fields:
 | _required_  | `versions`           | A list of known [versions](#versions) 
of the view [1] |
 | _required_  | `version-log`        | A list of [version log](#version-log) 
entries with the timestamp and `version-id` for every change to 
`current-version-id` |
 | _optional_  | `properties`         | A string to string map of view 
properties [2] |
+| _optional_  | `max-staleness-ms`   | The maximum time interval in 
milliseconds during which changed source table snapshots are considered fresh 
enough to skip refreshing [3] |

Review Comment:
   the wording here isn't super clear how this config should be used. It also 
doesn't capture the delayed view semantic that @igorbelianski-cyber mentioned. 
   
   We should remove `to skip refreshing` part, as consumers can have other 
fallback behaviors like fail or treat MV as a logical view



##########
format/view-spec.md:
##########
@@ -160,6 +177,89 @@ Each entry in `version-log` is a struct with the following 
fields:
 | _required_  | `timestamp-ms` | Timestamp when the view's 
`current-version-id` was updated (ms from epoch) |
 | _required_  | `version-id`   | ID that `current-version-id` was set to |
 
+#### Storage Table Identifier
+
+The table identifier for the storage table that stores the precomputed results.
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `namespace`    | A list of strings for namespace levels |
+| _required_  | `name`         | A string specifying the name of the table |
+
+### Storage table metadata
+
+This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
+The property "refresh-state" is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of every storage 
table snapshot to determine the freshness of the precomputed data of the 
storage table.
+
+| Requirement | Field name      | Description |
+|-------------|-----------------|-------------|
+| _required_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
+
+#### Freshness

Review Comment:
   We should add a paragraph to describe possible consumer behavior is a MV is 
considered stale.
   ```
   - Can fail, or refresh inline, or treat MV as a logical view
   - Mustn’t consume from the storage table
   ```
   



##########
format/view-spec.md:
##########
@@ -160,6 +177,89 @@ Each entry in `version-log` is a struct with the following 
fields:
 | _required_  | `timestamp-ms` | Timestamp when the view's 
`current-version-id` was updated (ms from epoch) |
 | _required_  | `version-id`   | ID that `current-version-id` was set to |
 
+#### Storage Table Identifier
+
+The table identifier for the storage table that stores the precomputed results.
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `namespace`    | A list of strings for namespace levels |
+| _required_  | `name`         | A string specifying the name of the table |
+
+### Storage table metadata
+
+This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
+The property "refresh-state" is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of every storage 
table snapshot to determine the freshness of the precomputed data of the 
storage table.
+
+| Requirement | Field name      | Description |
+|-------------|-----------------|-------------|
+| _required_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
+
+#### Freshness
+
+A materialized view is considered fresh when its precomputed data is usable by 
consumers. As tables referenced by a materialized view change over time, the 
precomputed data may no longer accurately reflect the logical materialized view 
definition. When this occurs, the materialized view (storage table) is 
considered stale.
+
+Different systems interpret freshness differently, typically based on the 
objects referenced in the fully expanded query tree of the materialized view. 
Some systems consider only direct children, others only leaf nodes, and some 
the entire query tree. The specific interpretation is determined by the 
producer of the storage table.
+
+#### Refresh state
+
+The refresh state record captures the state of source tables, views, and 
materialized views at refresh time. It contains a list of directly or 
indirectly referenced source states that allow a consumer to determine the 
freshness of the materialized view.
+
+**Producer responsibilities:**
+- The producer of the storage table must provide a sufficient list of source 
states so that consumers can determine freshness according to the producer's 
interpretation.
+- The source states list may be empty if the source state cannot be determined 
for all objects (for example, for non-Iceberg tables).
+
+**Consumer evaluation:**
+- The consumer must at least perform a coarse-grained evaluation based on 
`refresh-start-timestamp-ms` and `max-staleness-ms`.
+- The consumer may additionally compare the `source-states` list against the 
states loaded from the catalog.
+- The consumer trusts that the producer has provided all states necessary to 
determine freshness.
+
+The refresh state has the following fields:
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `view-version-id`         | The `version-id` of the 
materialized view when the refresh operation was performed  |
+| _required_  | `source-states`        | A list of [source 
states](#source-state) records |
+| _required_  | `refresh-start-timestamp-ms` | A timestamp of when the refresh 
operation was started |
+
+#### Source state
+
+Materialized views can reference source objects of different types, such as 
Iceberg tables and views. Source state records have a common field `type` that 
determines the form, which can be one of the following:
+
+* `table`: An Iceberg table
+* `view`: An Iceberg view

Review Comment:
   I am wondering if source MV should be one entry with combined view and 
storage table status in this single list. It is related to my question on how 
to load MV in REST catalog. Can we hide the storage table to consumers for 
access control. Only define access control at view level and storage table 
hiding would auto inherit the view access policy? but it could be a problem for 
non REST catalog.
   
   If a single entry, source MV can have the following states
   
   - view-uuid
   - view-version-id
   - storage-table-uuid
   - storage-table-snapshot-id
   
   `ref` field for source table state is not applicable to storage table.



##########
format/view-spec.md:
##########
@@ -160,6 +177,89 @@ Each entry in `version-log` is a struct with the following 
fields:
 | _required_  | `timestamp-ms` | Timestamp when the view's 
`current-version-id` was updated (ms from epoch) |
 | _required_  | `version-id`   | ID that `current-version-id` was set to |
 
+#### Storage Table Identifier
+
+The table identifier for the storage table that stores the precomputed results.
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `namespace`    | A list of strings for namespace levels |
+| _required_  | `name`         | A string specifying the name of the table |
+
+### Storage table metadata
+
+This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
+The property "refresh-state" is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of every storage 
table snapshot to determine the freshness of the precomputed data of the 
storage table.
+
+| Requirement | Field name      | Description |
+|-------------|-----------------|-------------|
+| _required_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
+
+#### Freshness
+
+A materialized view is considered fresh when its precomputed data is usable by 
consumers. As tables referenced by a materialized view change over time, the 
precomputed data may no longer accurately reflect the logical materialized view 
definition. When this occurs, the materialized view (storage table) is 
considered stale.
+
+Different systems interpret freshness differently, typically based on the 
objects referenced in the fully expanded query tree of the materialized view. 
Some systems consider only direct children, others only leaf nodes, and some 
the entire query tree. The specific interpretation is determined by the 
producer of the storage table.
+
+#### Refresh state
+
+The refresh state record captures the state of source tables, views, and 
materialized views at refresh time. It contains a list of directly or 
indirectly referenced source states that allow a consumer to determine the 
freshness of the materialized view.
+
+**Producer responsibilities:**
+- The producer of the storage table must provide a sufficient list of source 
states so that consumers can determine freshness according to the producer's 
interpretation.
+- The source states list may be empty if the source state cannot be determined 
for all objects (for example, for non-Iceberg tables).
+
+**Consumer evaluation:**
+- The consumer must at least perform a coarse-grained evaluation based on 
`refresh-start-timestamp-ms` and `max-staleness-ms`.
+- The consumer may additionally compare the `source-states` list against the 
states loaded from the catalog.

Review Comment:
   We probably lack a bit of details here. If the coarse-grained evaluation 
considered the MV stale and this evaluation considered it fresh, the MV should 
be considered fresh. But such priority is not expressed here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to