The option of using catalog identifiers in the state map still
requires keeping lineage information in the view because REFRESH MV
needs the latest fully expanded children (which could have changed
from the set of children currently in the state map),
without reparsing the view tree. Therefore, catalog identifiers in
the state map, does not eliminate the need for tracking children in
the form of catalog identifiers in the lineage side (but in this case
lineage will be a set instead of just a map).
Hence, my concerns with using catalog identifiers (as opposed to
UUIDs) are:
* The fundamental issue where the table spec depends on/refers to the
view spec (because such catalog identifiers are not defined in the
table spec and the only place they have a meaning is in the view spec
lineage information).
* (less fundamental) The denormalization introduced by this
arrangement, where each identifier is 3-parts and all of them repeat
in both lineage info and state map.
I am not very concerned with recursive expansion (through multiple
calls), as it is always the case with views.
On a positive note, looks like we agree to move past sequence numbers :)
Thanks,
Walaa.
On Thu, Aug 15, 2024 at 4:07 PM Micah Kornfield
<emkornfi...@gmail.com> wrote:
I think given the constraint that catalog lookup has to be by
identifier and not UUID, I'd prefer using identifier in the
refresh state. If we use identifiers, we can directly
parallelize the catalog calls to fetch the latest state. If
we use UUID, the engine has to go back to the MV and possibly
additional views to reconstruct the lineage map. It's just a
lot slower and more work for the engine when there is a MV
that references a lot of views (and those views reference
additional views).
I'm +1 on using catalog identifiers as the key. As you point out
this is inline with #2 (try to minimize serial catalog lookups)
in addition to supporting requirement #3.
On Thu, Aug 15, 2024 at 3:27 PM Benny Chow <btc...@gmail.com> wrote:
I think given the constraint that catalog lookup has to be by
identifier and not UUID, I'd prefer using identifier in the
refresh state. If we use identifiers, we can directly
parallelize the catalog calls to fetch the latest state. If
we use UUID, the engine has to go back to the MV and possibly
additional views to reconstruct the lineage map. It's just a
lot slower and more work for the engine when there is a MV
that references a lot of views (and those views reference
additional views).
Thanks
Benny
On Thu, Aug 15, 2024 at 2:14 PM Walaa Eldin Moustafa
<wa.moust...@gmail.com> wrote:
Thanks Jan, Micah, and Karuppayya for chiming in.
I do not think 3 and 4 are at odds with each other (for
example maintaining both lineage map and state map
through UUID can achieve both). Also, I do not think we
can drop the lineage map since in many catalogs, the only
lookup method is by the catalog identifier, and not the UUID.
I think if we go with UUIDs in the state, we should have
a lineage map (from identifiers to UUIDs) to go with it.
Thanks,
Walaa.
On Thu, Aug 15, 2024 at 1:45 PM karuppayya
<karuppayya1...@gmail.com> wrote:
+1 to storing the refresh state as a map of UUIDs to
snapshot IDs, and deferring the inclusion of lineage
to a future iteration.(like Micha mentioned)
This would greatly simplify the current design.
Also in terms of identifiers to use(UUID or catalog
identifier) for the refresh state
We will not be able to fetch the table/View using the
UUID alone, for example from Hive based catalog.
We do not have the direct mapping between UUID and
table/view.
Which leaves us only with the catalog identifiers?
Thanks & Regards
Karuppayya
On Thu, Aug 15, 2024 at 9:16 AM Micah Kornfield
<emkornfi...@gmail.com> wrote:
I think it might be worth restating perceived
requirements and making sure there is alignment
on them.
If I am reading correctly, I think the following
are perceived requirements:
1. An engine must be able to unambiguously detect
that an underlying queried entity has changed or
not via metadata to decide if materialized table
data can be used.
2. The number of sequential catalog reads an
engine needs to make to make use of a
materialized table state at read time is minimized.
3. Engines that don't understand a SQL dialect
can still use MV information if it is not stale.
4. Table refs (catalog identifiers) should not
appear in the materialized table metadata (i.e.
state).
5. The view part of the MV definition should not
need a new revision for any changes to objects it
queries as long as their schemas stay compatible
(only state information on the materialized table
need to change).
In my mind, requirement 1, is the only true
requirement. I think this necessitates having
UUID + snapshot ID as part of the state
information (not necessarily part of the
Lineage). I think it also necessitates having a
denormalized view of all entities that are inputs
into the MV in the state information (a view
object might not change but its underlying tables
or views could change and that must be detected).
Requirements 2 and 5 are somewhat at odds with
each other. If information is denormalized
(fully expanded) in Lineage, it means if table
information is somehow dropped from an
intermediate view, one would need to update the
view (or make excess calls to the catalog). In my
mind, this argues for normalization of the
lineage stored on the view (with the cost of
potentially 1 additional serial catalog lookup
once the state information is retrieved).
I think #3 is at odds with #4. I think #3 is
more worthwhile, then keeping #4 (and as Jan
noted #4 adds complexity).
I think the last remaining question is if lineage
serves any purpose. I think it is useful for the
following reasons:
a) When there are no intermediate views queried,
it allows for fully parallelized lookup calls to
the catalog without having to parse the SQL
statement first
b) Allows tools that don't need to lookup state
information or parse SQL but still navigate
MV/view trees.
Both of these seem relatively minor, so lineage
could perhaps be left out in the first iteration.
As it applies to Jan's questions:
1. Should we move the identifiers out of the
refresh-state into a new lineage record that
is stored as part of the view metadata?
No, I don't think so, I think #5 is a reasonable
requirement and I think this violates it.
2. If yes, should the lineage in the view be
fully expanded?
No, I think only the state should be fully
expanded (for reasons mentioned above, it
potentially requires more updates to the view
then necessary).
3. What should be used as an identifier in
the lineage to reference entries in the
refresh-state?
Catalog identifiers make sense to me. If we
agree requirement #3 is not a requirement then it
seems like this could also be UUIDs.
Thanks,
Micah
On Thu, Aug 15, 2024 at 7:57 AM Benny Chow
<btc...@gmail.com> wrote:
If we go with either UUID or Table
Identifier + VersionID/SnapshotId in the
refresh state, then this list is fully
expanded already. So, to validate the
freshness of a materialization, the engine
doesn't even need to look at the view
lineage. IMO, the view lineage is nice to
have but not a necessary requirement for
MVs. The view lineage makes sharing of views
between engines without common SQL dialects
possible.
Benny
On Thu, Aug 15, 2024 at 12:22 AM Jan Kaul
<jank...@mailbox.org.invalid> wrote:
Hi all,
I would like to reemphasize the purpose
of the refresh-state for materialized
views. The purpose is to determine if the
precomputed data is fresh, stale or
invalid. For that the current snapshot-id
of every table in the query tree has to
be fetched from the catalog by using its
full identifier and ref. Additionally the
refresh state stores the snapshot-id of
the last refresh.
To summarize: *To determine the freshness
of the precomputed data we require the
full identifier + ref and snapshot-id of
the last refresh for every table in the
fully expanded query tree*
This is a requirement from how the
catalog works and independent from how we
design the lineage/refresh state.
Additionally we previously agreed that we
should be able to obtain the full list of
identifiers without needing to parse the
SQL definition.
Now we are having a discussion in how to
store and obtain the fully expanded list
of table identifiers and snapshot-ids. To
move the discussion forward I think it
would be valuable to answer the following
3 questions:
1. Should we move the identifiers out of
the refresh-state into a new lineage
record that is stored as part of the view
metadata?
2. If yes, should the lineage in the view
be fully expanded?
3. What should be used as an identifier
in the lineage to reference entries in
the refresh-state?
1. Question:
We already agreed that this would be a
good idea because we wouldn't introduce
the identifier concept to the table
metadata. However, looking at the
complexity that comes with the
alternatives, I would like to keep this
question open.
2. Question:
I'm against using a not fully expanded
lineage in the view struct. To recall we
require every identifier in the fully
expanded query tree to determine the
freshness. Not storing all identifiers in
the lineage would mean to recursively
call the catalog and expand the query
tree at read time. This can lead to a
large overhead for determining the
refresh state compared to expanding the
query tree once at creation time and then
storing the fully expanded lineage.
3. Question:
This depends on Question 2.
For a not fully expanded lineage, the
only options would be uuids or catalog
identifiers.
For a fully expanded lineage the question
isn't all that relevant. The current
design specifies that the lineage is a
map from an identifier to an id and the
refresh-state is a map from such id to a
snapshot-id. For this to work we don't
have to specify which kind of identifier
has to be used. One query engine could
use uuids, the other engine sequence-ids.
The important assumption we are making is
that every id that is used in the
refresh-state has to be defined in the
lineage.
So the question about using uuids is
rather, can the query engine trust that
the id defined in the lineage is the uuid
of the table.
Regarding the complexity that comes from
introducing the lineage in the view I
would like to revisit question 1.
Introducing the lineage in the view
metadata opens up the question of when
should the lineage be fully expanded. We
see that we have 3 options:
1. Not fully expanded lineage ->
Expansion at read time
2. Fully expanded lineage -> Expansion at
creation time
3. No lineage (use identifiers in
refresh-state) -> Expansion at refresh time
As reading is expected to be the most
frequent operation I see option 1 as not
favorable. As the query engine has to
fully expand the query tree for a refresh
anyway, I see option 3 as the most
natural. For a refresh operation the
query engine must understand the SQL
dialects of all views in the query tree
and therefore is guaranteed to
successfully expand the lineage. This
might not be the case at creation time,
which makes option 2 less favorable.
As can be seen, I'm in favor of just
storing the refresh-state as a map from
identifier to snapshot-id and not using
the lineage. I know that this introduces
the concept of a catalog identifiers to
the table metadata spec, but in my
opinion it is by far the simplest option.
I'm interested in your opinions.
Best wishes,
Jan
On 14.08.24 22:24, Walaa Eldin Moustafa
wrote:
Thanks Benny. For refs, I am +1 to
represent them as UUID + optional ref,
although we can iterate ohe exact JSON
structure (e.g., another option is
splitting for (UUID) state from (UUID +
ref) state into two separate
higher-level fields).
Generally agree on REFRESH VIEW strategy
could be up to the engine, but it seems
like an area where Iceberg could have an
opinion/spec on. I will start a separate
thread for that.
Thanks,
Walaa.