Hi all,

I would like to reemphasize the purpose of the refresh-state for materialized views. The purpose is to determine if the precomputed data is fresh, stale or invalid. For that the current snapshot-id of every table in the query tree has to be fetched from the catalog by using its full identifier and ref. Additionally the refresh state stores the snapshot-id of the last refresh.

To summarize: *To determine the freshness of the precomputed data we require the full identifier + ref and snapshot-id of the last refresh for every table in the fully expanded query tree*

This is a requirement from how the catalog works and independent from how we design the lineage/refresh state. Additionally we previously agreed that we should be able to obtain the full list of identifiers without needing to parse the SQL definition.

Now we are having a discussion in how to store and obtain the fully expanded list of table identifiers and snapshot-ids. To move the discussion forward I think it would be valuable to answer the following 3 questions:

1. Should we move the identifiers out of the refresh-state into a new lineage record that is stored as part of the view metadata?

2. If yes, should the lineage in the view be fully expanded?

3. What should be used as an identifier in the lineage to reference entries in the refresh-state?

1. Question:

We already agreed that this would be a good idea because we wouldn't introduce the identifier concept to the table metadata. However, looking at the complexity that comes with the alternatives, I would like to keep this question open.

2. Question:

I'm against using a not fully expanded lineage in the view struct. To recall we require every identifier in the fully expanded query tree to determine the freshness. Not storing all identifiers in the lineage would mean to recursively call the catalog and expand the query tree at read time. This can lead to a large overhead for determining the refresh state compared to expanding the query tree once at creation time and then storing the fully expanded lineage.

3. Question:

This depends on Question 2.

For a not fully expanded lineage, the only options would be uuids or catalog identifiers.

For a fully expanded lineage the question isn't all that relevant. The current design specifies that the lineage is a map from an identifier to an id and the refresh-state is a map from such id to a snapshot-id. For this to work we don't have to specify which kind of identifier has to be used. One query engine could use uuids, the other engine sequence-ids. The important assumption we are making is that every id that is used in the refresh-state has to be defined in the lineage. So the question about using uuids is rather, can the query engine trust that the id defined in the lineage is the uuid of the table.


Regarding the complexity that comes from introducing the lineage in the view I would like to revisit question 1. Introducing the lineage in the view metadata opens up the question of when should the lineage be fully expanded. We see that we have 3 options:

1. Not fully expanded lineage -> Expansion at read time

2. Fully expanded lineage -> Expansion at creation time

3. No lineage (use identifiers in refresh-state) -> Expansion at refresh time

As reading is expected to be the most frequent operation I see option 1 as not favorable. As the query engine has to fully expand the query tree for a refresh anyway, I see option 3 as the most natural. For a refresh operation the query engine must understand the SQL dialects of all views in the query tree and therefore is guaranteed to successfully expand the lineage. This might not be the case at creation time, which makes option 2 less favorable.

As can be seen, I'm in favor of just storing the refresh-state as a map from identifier to snapshot-id and not using the lineage. I know that this introduces the concept of a catalog identifiers to the table metadata spec, but in my opinion it is by far the simplest option.

I'm interested in your opinions.

Best wishes,

Jan

On 14.08.24 22:24, Walaa Eldin Moustafa wrote:
Thanks Benny. For refs, I am +1 to represent them as UUID + optional ref, although we can iterate ohe exact JSON structure (e.g., another option is splitting for (UUID) state from (UUID + ref) state into two separate higher-level fields).

Generally agree on REFRESH VIEW strategy could be up to the engine, but it seems like an area where Iceberg could have an opinion/spec on. I will start a separate thread for that.

Thanks,
Walaa.

Reply via email to