stevenzwu commented on code in PR #11041:
URL: https://github.com/apache/iceberg/pull/11041#discussion_r2586956236


##########
format/view-spec.md:
##########
@@ -63,11 +79,13 @@ The view version metadata file has the following fields:
 | _required_  | `versions`           | A list of known [versions](#versions) 
of the view [1] |
 | _required_  | `version-log`        | A list of [version log](#version-log) 
entries with the timestamp and `version-id` for every change to 
`current-version-id` |
 | _optional_  | `properties`         | A string to string map of view 
properties [2] |
+| _optional_  | `max-staleness-ms`   | The maximum time interval in 
milliseconds after a refresh operation during which the materialized view's 
data is considered fresh [3] |
 
 Notes:
 
 1. The number of versions to retain is controlled by the view property: 
`version.history.num-entries`.
 2. Properties are used for metadata such as `comment` and for settings that 
affect view maintenance. This is not intended to be used for arbitrary metadata.
+3. The `max-staleness-ms` field only applies to materialized views and must be 
set to `null` for common views. If `max-staleness-ms` is not `null` and the 
time elapsed since the last refresh operation is less than `max-staleness-ms`, 
the query engine may return data directly from the `storage-table` without 
evaluating freshness based on the source tables and views. If 
`max-staleness-ms` is `null` for a materialized view, the data in the 
`storage-table` is always considered fresh.

Review Comment:
   > the time elapsed since the last refresh operation
   
   I think this is not aligned with what we discussed in the meeting. The 
consensus was the delayed view semantics that Igor brought up.
   
   ```
   A Materialized View(MV) considered fresh if and only if the precomputed 
result from the storage table is equivalent to what would have been obtained by 
running MV's defining query at some point in time within the interval of 
[CurrentTime - MaxStaleness, CurrentTime]
   ```
   
   Think about the example from the discussion
   ```
   How should the staleness time be calculated ? 
   1. Start from the time when the last refresh operation was started.
   2. Start from the earliest snapshot timestamp from any source table with 
snapshots added after the last MV refresh. 
   
   E.g., let’s assume the config is 60 mins.
   10:00 is when the last refresh operation was started.
   10:45 is the time when a source table has a new snapshot added.
   At 11:15, is the MV still fresh? With option 1, it is stale. With option 2, 
it is fresh.
   ```
   
   Current wording is option 1. The delayed view semantics is essentially 
option 2.
   



##########
format/view-spec.md:
##########
@@ -42,12 +42,28 @@ An atomic swap of one view metadata file for another 
provides the basis for maki
 
 Writers create view metadata files optimistically, assuming that the current 
metadata location will not be changed before the writer's commit. Once a writer 
has created an update, it commits by swapping the view's metadata file pointer 
from the base location to the new location.
 
+### Materialized Views
+
+Materialized views are a type of view with precomputed results from the view 
query stored as a table.
+When queried, engines may return the precomputed data for the materialized 
views, shifting the cost of query execution to the precomputation step.
+
+Iceberg materialized views are implemented as a combination of an Iceberg view 
and an underlying Iceberg table, the "storage-table", which stores the 
precomputed data.
+Materialized View metadata is a superset of View metadata with an additional 
pointer to the storage table. The storage table is an Iceberg table with 
additional materialized view refresh state metadata.
+Refresh metadata contains information about the "source tables" and/or "source 
views", which are the tables/views referenced in the query definition of the 
materialized view.
+During read time, a materialized view (storage table) can be interpreted as 
"fresh", "stale" or "invalid", depending on the following situations:
+* **fresh** -- The `snapshot_id`s of the last refresh operation match the 
current `snapshot_id`s of all the source tables.
+* **stale** -- The `snapshot_id`s do not match for at-least one source table, 
indicating that a refresh operation needs to be performed to capture the latest 
source table changes.

Review Comment:
   This sentence is inaccurate anymore with the `max-staleness-ms` config.
   
   Should we create a dedicated section (like `status interpretation`) in the 
end for the status interpretation (for fresh, stale, invalid) after all the 
concepts have been introduced (including refresh-state)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to