Hi all,
I would like to reemphasize the purpose of the refresh-state for
materialized views. The purpose is to determine if the precomputed data
is fresh, stale or invalid. For that the current snapshot-id of every
table in the query tree has to be fetched from the catalog by using its
full identifier and ref. Additionally the refresh state stores the
snapshot-id of the last refresh.
To summarize: *To determine the freshness of the precomputed data we
require the full identifier + ref and snapshot-id of the last refresh
for every table in the fully expanded query tree*
This is a requirement from how the catalog works and independent from
how we design the lineage/refresh state. Additionally we previously
agreed that we should be able to obtain the full list of identifiers
without needing to parse the SQL definition.
Now we are having a discussion in how to store and obtain the fully
expanded list of table identifiers and snapshot-ids. To move the
discussion forward I think it would be valuable to answer the following
3 questions:
1. Should we move the identifiers out of the refresh-state into a new
lineage record that is stored as part of the view metadata?
2. If yes, should the lineage in the view be fully expanded?
3. What should be used as an identifier in the lineage to reference
entries in the refresh-state?
1. Question:
We already agreed that this would be a good idea because we wouldn't
introduce the identifier concept to the table metadata. However, looking
at the complexity that comes with the alternatives, I would like to keep
this question open.
2. Question:
I'm against using a not fully expanded lineage in the view struct. To
recall we require every identifier in the fully expanded query tree to
determine the freshness. Not storing all identifiers in the lineage
would mean to recursively call the catalog and expand the query tree at
read time. This can lead to a large overhead for determining the refresh
state compared to expanding the query tree once at creation time and
then storing the fully expanded lineage.
3. Question:
This depends on Question 2.
For a not fully expanded lineage, the only options would be uuids or
catalog identifiers.
For a fully expanded lineage the question isn't all that relevant. The
current design specifies that the lineage is a map from an identifier to
an id and the refresh-state is a map from such id to a snapshot-id. For
this to work we don't have to specify which kind of identifier has to be
used. One query engine could use uuids, the other engine sequence-ids.
The important assumption we are making is that every id that is used in
the refresh-state has to be defined in the lineage.
So the question about using uuids is rather, can the query engine trust
that the id defined in the lineage is the uuid of the table.
Regarding the complexity that comes from introducing the lineage in the
view I would like to revisit question 1. Introducing the lineage in the
view metadata opens up the question of when should the lineage be fully
expanded. We see that we have 3 options:
1. Not fully expanded lineage -> Expansion at read time
2. Fully expanded lineage -> Expansion at creation time
3. No lineage (use identifiers in refresh-state) -> Expansion at refresh
time
As reading is expected to be the most frequent operation I see option 1
as not favorable. As the query engine has to fully expand the query tree
for a refresh anyway, I see option 3 as the most natural. For a refresh
operation the query engine must understand the SQL dialects of all views
in the query tree and therefore is guaranteed to successfully expand the
lineage. This might not be the case at creation time, which makes option
2 less favorable.
As can be seen, I'm in favor of just storing the refresh-state as a map
from identifier to snapshot-id and not using the lineage. I know that
this introduces the concept of a catalog identifiers to the table
metadata spec, but in my opinion it is by far the simplest option.
I'm interested in your opinions.
Best wishes,
Jan
On 14.08.24 22:24, Walaa Eldin Moustafa wrote:
Thanks Benny. For refs, I am +1 to represent them as UUID + optional
ref, although we can iterate ohe exact JSON structure (e.g., another
option is splitting for (UUID) state from (UUID + ref) state into two
separate higher-level fields).
Generally agree on REFRESH VIEW strategy could be up to the engine,
but it seems like an area where Iceberg could have an opinion/spec on.
I will start a separate thread for that.
Thanks,
Walaa.