The option of using catalog identifiers in the state map still
requires keeping lineage information in the view because REFRESH MV
needs the latest fully expanded children (which could have changed
from the set of children currently in the state map),
without reparsing the view tree. Therefore, catalog identifiers in the
state map, does not eliminate the need for tracking children in the
form of catalog identifiers in the lineage side (but in this case
lineage will be a set instead of just a map).
Hence, my concerns with using catalog identifiers (as opposed to
UUIDs) are:
* The fundamental issue where the table spec depends on/refers to the
view spec (because such catalog identifiers are not defined in the
table spec and the only place they have a meaning is in the view spec
lineage information).
* (less fundamental) The denormalization introduced by this
arrangement, where each identifier is 3-parts and all of them repeat
in both lineage info and state map.
I am not very concerned with recursive expansion (through multiple
calls), as it is always the case with views.
On a positive note, looks like we agree to move past sequence numbers :)
Thanks,
Walaa.
On Thu, Aug 15, 2024 at 4:07 PM Micah Kornfield
<emkornfi...@gmail.com> wrote:
I think given the constraint that catalog lookup has to be by
identifier and not UUID, I'd prefer using identifier in the
refresh state. If we use identifiers, we can directly
parallelize the catalog calls to fetch the latest state. If we
use UUID, the engine has to go back to the MV and possibly
additional views to reconstruct the lineage map. It's just a
lot slower and more work for the engine when there is a MV
that references a lot of views (and those views reference
additional views).
I'm +1 on using catalog identifiers as the key. As you point out
this is inline with #2 (try to minimize serial catalog lookups) in
addition to supporting requirement #3.
On Thu, Aug 15, 2024 at 3:27 PM Benny Chow <btc...@gmail.com> wrote:
I think given the constraint that catalog lookup has to be by
identifier and not UUID, I'd prefer using identifier in the
refresh state. If we use identifiers, we can directly
parallelize the catalog calls to fetch the latest state. If
we use UUID, the engine has to go back to the MV and possibly
additional views to reconstruct the lineage map. It's just a
lot slower and more work for the engine when there is a MV
that references a lot of views (and those views reference
additional views).
Thanks
Benny
On Thu, Aug 15, 2024 at 2:14 PM Walaa Eldin Moustafa
<wa.moust...@gmail.com> wrote:
Thanks Jan, Micah, and Karuppayya for chiming in.
I do not think 3 and 4 are at odds with each other (for
example maintaining both lineage map and state map through
UUID can achieve both). Also, I do not think we can drop
the lineage map since in many catalogs, the only lookup
method is by the catalog identifier, and not the UUID.
I think if we go with UUIDs in the state, we should have a
lineage map (from identifiers to UUIDs) to go with it.
Thanks,
Walaa.
On Thu, Aug 15, 2024 at 1:45 PM karuppayya
<karuppayya1...@gmail.com> wrote:
+1 to storing the refresh state as a map of UUIDs to
snapshot IDs, and deferring the inclusion of lineage
to a future iteration.(like Micha mentioned)
This would greatly simplify the current design.
Also in terms of identifiers to use(UUID or catalog
identifier) for the refresh state
We will not be able to fetch the table/View using the
UUID alone, for example from Hive based catalog.
We do not have the direct mapping between UUID and
table/view.
Which leaves us only with the catalog identifiers?
Thanks & Regards
Karuppayya
On Thu, Aug 15, 2024 at 9:16 AM Micah Kornfield
<emkornfi...@gmail.com> wrote:
I think it might be worth restating perceived
requirements and making sure there is alignment on
them.
If I am reading correctly, I think the following
are perceived requirements:
1. An engine must be able to unambiguously detect
that an underlying queried entity has changed or
not via metadata to decide if materialized table
data can be used.
2. The number of sequential catalog reads an
engine needs to make to make use of a materialized
table state at read time is minimized.
3. Engines that don't understand a SQL dialect can
still use MV information if it is not stale.
4. Table refs (catalog identifiers) should not
appear in the materialized table metadata (i.e.
state).
5. The view part of the MV definition should not
need a new revision for any changes to objects it
queries as long as their schemas stay compatible
(only state information on the materialized table
need to change).
In my mind, requirement 1, is the only true
requirement. I think this necessitates having
UUID + snapshot ID as part of the state
information (not necessarily part of the
Lineage). I think it also necessitates having a
denormalized view of all entities that are inputs
into the MV in the state information (a view
object might not change but its underlying tables
or views could change and that must be detected).
Requirements 2 and 5 are somewhat at odds with
each other. If information is denormalized (fully
expanded) in Lineage, it means if table
information is somehow dropped from an
intermediate view, one would need to update the
view (or make excess calls to the catalog). In my
mind, this argues for normalization of the lineage
stored on the view (with the cost of potentially 1
additional serial catalog lookup once the state
information is retrieved).
I think #3 is at odds with #4. I think #3 is more
worthwhile, then keeping #4 (and as Jan noted #4
adds complexity).
I think the last remaining question is if lineage
serves any purpose. I think it is useful for the
following reasons:
a) When there are no intermediate views queried,
it allows for fully parallelized lookup calls to
the catalog without having to parse the SQL
statement first
b) Allows tools that don't need to lookup state
information or parse SQL but still navigate
MV/view trees.
Both of these seem relatively minor, so lineage
could perhaps be left out in the first iteration.
As it applies to Jan's questions:
1. Should we move the identifiers out of the
refresh-state into a new lineage record that
is stored as part of the view metadata?
No, I don't think so, I think #5 is a reasonable
requirement and I think this violates it.
2. If yes, should the lineage in the view be
fully expanded?
No, I think only the state should be fully
expanded (for reasons mentioned above, it
potentially requires more updates to the view then
necessary).
3. What should be used as an identifier in the
lineage to reference entries in the refresh-state?
Catalog identifiers make sense to me. If we agree
requirement #3 is not a requirement then it seems
like this could also be UUIDs.
Thanks,
Micah
On Thu, Aug 15, 2024 at 7:57 AM Benny Chow
<btc...@gmail.com> wrote:
If we go with either UUID or Table
Identifier + VersionID/SnapshotId in the
refresh state, then this list is fully
expanded already. So, to validate the
freshness of a materialization, the engine
doesn't even need to look at the view
lineage. IMO, the view lineage is nice to
have but not a necessary requirement for MVs.
The view lineage makes sharing of views
between engines without common SQL dialects
possible.
Benny
On Thu, Aug 15, 2024 at 12:22 AM Jan Kaul
<jank...@mailbox.org.invalid> wrote:
Hi all,
I would like to reemphasize the purpose of
the refresh-state for materialized views.
The purpose is to determine if the
precomputed data is fresh, stale or
invalid. For that the current snapshot-id
of every table in the query tree has to be
fetched from the catalog by using its full
identifier and ref. Additionally the
refresh state stores the snapshot-id of
the last refresh.
To summarize: *To determine the freshness
of the precomputed data we require the
full identifier + ref and snapshot-id of
the last refresh for every table in the
fully expanded query tree*
This is a requirement from how the catalog
works and independent from how we design
the lineage/refresh state. Additionally we
previously agreed that we should be able
to obtain the full list of identifiers
without needing to parse the SQL definition.
Now we are having a discussion in how to
store and obtain the fully expanded list
of table identifiers and snapshot-ids. To
move the discussion forward I think it
would be valuable to answer the following
3 questions:
1. Should we move the identifiers out of
the refresh-state into a new lineage
record that is stored as part of the view
metadata?
2. If yes, should the lineage in the view
be fully expanded?
3. What should be used as an identifier in
the lineage to reference entries in the
refresh-state?
1. Question:
We already agreed that this would be a
good idea because we wouldn't introduce
the identifier concept to the table
metadata. However, looking at the
complexity that comes with the
alternatives, I would like to keep this
question open.
2. Question:
I'm against using a not fully expanded
lineage in the view struct. To recall we
require every identifier in the fully
expanded query tree to determine the
freshness. Not storing all identifiers in
the lineage would mean to recursively call
the catalog and expand the query tree at
read time. This can lead to a large
overhead for determining the refresh state
compared to expanding the query tree once
at creation time and then storing the
fully expanded lineage.
3. Question:
This depends on Question 2.
For a not fully expanded lineage, the only
options would be uuids or catalog identifiers.
For a fully expanded lineage the question
isn't all that relevant. The current
design specifies that the lineage is a map
from an identifier to an id and the
refresh-state is a map from such id to a
snapshot-id. For this to work we don't
have to specify which kind of identifier
has to be used. One query engine could use
uuids, the other engine sequence-ids. The
important assumption we are making is that
every id that is used in the refresh-state
has to be defined in the lineage.
So the question about using uuids is
rather, can the query engine trust that
the id defined in the lineage is the uuid
of the table.
Regarding the complexity that comes from
introducing the lineage in the view I
would like to revisit question 1.
Introducing the lineage in the view
metadata opens up the question of when
should the lineage be fully expanded. We
see that we have 3 options:
1. Not fully expanded lineage -> Expansion
at read time
2. Fully expanded lineage -> Expansion at
creation time
3. No lineage (use identifiers in
refresh-state) -> Expansion at refresh time
As reading is expected to be the most
frequent operation I see option 1 as not
favorable. As the query engine has to
fully expand the query tree for a refresh
anyway, I see option 3 as the most
natural. For a refresh operation the query
engine must understand the SQL dialects of
all views in the query tree and therefore
is guaranteed to successfully expand the
lineage. This might not be the case at
creation time, which makes option 2 less
favorable.
As can be seen, I'm in favor of just
storing the refresh-state as a map from
identifier to snapshot-id and not using
the lineage. I know that this introduces
the concept of a catalog identifiers to
the table metadata spec, but in my opinion
it is by far the simplest option.
I'm interested in your opinions.
Best wishes,
Jan
On 14.08.24 22:24, Walaa Eldin Moustafa wrote:
Thanks Benny. For refs, I am +1 to
represent them as UUID + optional ref,
although we can iterate ohe exact JSON
structure (e.g., another option is
splitting for (UUID) state from (UUID +
ref) state into two separate higher-level
fields).
Generally agree on REFRESH VIEW strategy
could be up to the engine, but it seems
like an area where Iceberg could have an
opinion/spec on. I will start a separate
thread for that.
Thanks,
Walaa.