Re: [DISCUSS] Materialized Views: Lineage and State information

Jan Kaul Thu, 15 Aug 2024 00:22:32 -0700

Hi all,

I would like to reemphasize the purpose of the refresh-state formaterialized views. The purpose is to determine if the precomputed datais fresh, stale or invalid. For that the current snapshot-id of everytable in the query tree has to be fetched from the catalog by using itsfull identifier and ref. Additionally the refresh state stores thesnapshot-id of the last refresh.

To summarize: *To determine the freshness of the precomputed data werequire the full identifier + ref and snapshot-id of the last refreshfor every table in the fully expanded query tree*

This is a requirement from how the catalog works and independent fromhow we design the lineage/refresh state. Additionally we previouslyagreed that we should be able to obtain the full list of identifierswithout needing to parse the SQL definition.

Now we are having a discussion in how to store and obtain the fullyexpanded list of table identifiers and snapshot-ids. To move thediscussion forward I think it would be valuable to answer the following3 questions:

1. Should we move the identifiers out of the refresh-state into a newlineage record that is stored as part of the view metadata?


2. If yes, should the lineage in the view be fully expanded?

3. What should be used as an identifier in the lineage to referenceentries in the refresh-state?


1. Question:

We already agreed that this would be a good idea because we wouldn'tintroduce the identifier concept to the table metadata. However, lookingat the complexity that comes with the alternatives, I would like to keepthis question open.


2. Question:

I'm against using a not fully expanded lineage in the view struct. Torecall we require every identifier in the fully expanded query tree todetermine the freshness. Not storing all identifiers in the lineagewould mean to recursively call the catalog and expand the query tree atread time. This can lead to a large overhead for determining the refreshstate compared to expanding the query tree once at creation time andthen storing the fully expanded lineage.


3. Question:

This depends on Question 2.

For a not fully expanded lineage, the only options would be uuids orcatalog identifiers.

For a fully expanded lineage the question isn't all that relevant. Thecurrent design specifies that the lineage is a map from an identifier toan id and the refresh-state is a map from such id to a snapshot-id. Forthis to work we don't have to specify which kind of identifier has to beused. One query engine could use uuids, the other engine sequence-ids.The important assumption we are making is that every id that is used inthe refresh-state has to be defined in the lineage.So the question about using uuids is rather, can the query engine trustthat the id defined in the lineage is the uuid of the table.

Regarding the complexity that comes from introducing the lineage in theview I would like to revisit question 1. Introducing the lineage in theview metadata opens up the question of when should the lineage be fullyexpanded. We see that we have 3 options:


1. Not fully expanded lineage -> Expansion at read time

2. Fully expanded lineage -> Expansion at creation time

3. No lineage (use identifiers in refresh-state) -> Expansion at refreshtime

As reading is expected to be the most frequent operation I see option 1as not favorable. As the query engine has to fully expand the query treefor a refresh anyway, I see option 3 as the most natural. For a refreshoperation the query engine must understand the SQL dialects of all viewsin the query tree and therefore is guaranteed to successfully expand thelineage. This might not be the case at creation time, which makes option2 less favorable.

As can be seen, I'm in favor of just storing the refresh-state as a mapfrom identifier to snapshot-id and not using the lineage. I know thatthis introduces the concept of a catalog identifiers to the tablemetadata spec, but in my opinion it is by far the simplest option.


I'm interested in your opinions.

Best wishes,

Jan

On 14.08.24 22:24, Walaa Eldin Moustafa wrote:

Thanks Benny. For refs, I am +1 to represent them as UUID + optionalref, although we can iterate ohe exact JSON structure (e.g., anotheroption is splitting for (UUID) state from (UUID + ref) state into twoseparate higher-level fields).
Generally agree on REFRESH VIEW strategy could be up to the engine,but it seems like an area where Iceberg could have an opinion/spec on.I will start a separate thread for that.
Thanks,
Walaa.

Re: [DISCUSS] Materialized Views: Lineage and State information

Reply via email to