Re: [DISCUSS] Materialized Views: Lineage and State information

Walaa Eldin Moustafa Fri, 16 Aug 2024 16:56:04 -0700

Thanks Micah, for the latter, I meant the type of denormalization of
repeating a 3-part name as opposed to using an ID.


On Fri, Aug 16, 2024 at 4:52 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> However, this still does not address the semantic issue which is more
>> fundamental in my opinion. The Iceberg table spec is not aware of catalog
>> table identifiers and this use will be the first break of this abstraction.
>
>
> IIUC, based on Jan's comments, we are not going to modify the table
> specification.  I thought the state map was effectively opaque metadata
> from the table specification perspective?  If this is the case I feel like
> that is OK and not a blocker, I think by their nature as already discussed
> MVs need catalog information to function properly and the choice to put
> catalog information into the table metadata is pragmatic and preserves
> other desirable properties.  It might be a more important point if we want
> to update the table specification (IMO, I still think it would probably be
> OK).
>
> On a side note, it does not address the denormalization issue either if we
>> ever want to introduce the lineage in the view as a nice-to-have.
>
>
> I think if lineage is introduced to the View metadata, it should only hold
> direct dependencies for the reasons already discussed. IMO, I think the
> potential overlap is OK as they serve two different purposes.
>
> Cheers,
> Micah
>
>
>
>
>
> On Fri, Aug 16, 2024 at 4:27 PM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> That is right. I agree that in the case of using catalog identifiers in
>> state information, using them in lineage information would be a
>> nice-to-have and not a requirement.
>>
>> However, this still does not address the semantic issue which is more
>> fundamental in my opinion. The Iceberg table spec is not aware of catalog
>> table identifiers and this use will be the first break of this abstraction.
>>
>> On a side note, it does not address the denormalization issue either if
>> we ever want to introduce the lineage in the view as a nice-to-have.
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Fri, Aug 16, 2024 at 10:09 AM Jan Kaul <jank...@mailbox.org.invalid>
>> wrote:
>>
>>> Hi Walaa,
>>>
>>> I would argue that for the refresh operation the query engine has to
>>> parse the query and then somehow execute it. For a full refresh it will
>>> directly execute the query and for a incremental refresh it will execute a
>>> modified version. Therefore it has to fully expand the query tree.
>>>
>>> Best wishes,
>>>
>>> Jan
>>>
>>> Am 16.08.2024 18:13 schrieb Walaa Eldin Moustafa <wa.moust...@gmail.com
>>> >:
>>>
>>> Thanks Jan for the summary.
>>>
>>> For this point:
>>>
>>> > For a refresh operation the query engine has to parse the SQL and
>>> fully expand the lineage with it's children anyway.  So the lineage is not
>>> strictly required.
>>>
>>> If the lineage is provided at creation time by the respective engine,
>>> the refresh operation does not need to parse the SQL, correct?
>>>
>>> Thanks,
>>> Walaa.
>>>
>>> On Fri, Aug 16, 2024 at 12:24 AM Jan Kaul <jank...@mailbox.org.invalid>
>>> wrote:
>>>
>>> As the table I created is not properly shown in the mailing list I'll
>>> reformat the summary of the different drawbacks again:
>>>
>>> Drawbacks of (no lineage, refresh-state key = identifier):
>>>
>>> - introduces catalog identifiers into table metadata (#4)
>>> - query engine has to expand lineage at refresh time (happens anyway)
>>>
>>> Drawbacks of (normalized lineage, refresh-state key = uuid):
>>>
>>> - recursive calls to catalog to expand lineage at read time (#2)
>>> - fragile by requiring child views to have lineage field
>>>
>>> Drawbacks of (denormalized lineage, refresh-state key = uuid):
>>>
>>> - update of materialized view version required if child view is updated
>>> (#5)
>>> On 16.08.24 09:17, Jan Kaul wrote:
>>>
>>> Hi,
>>>
>>> Thanks Micah for clearly stating the requirements. I think this gives
>>> better clarity for the discussion.
>>>
>>> It seems like we don't have a solution that satisfies all requirements
>>> at once. So we would need to choose which has the fewest drawbacks.
>>>
>>> I would like to summarize the different drawbacks that came up in the
>>> discussion.
>>> no lineage
>>> + refresh-state key = identifier
>>> normalized lineage
>>> + refresh-state key = uuid
>>> denormalized lineage
>>> + refresh-state key = uuid
>>> - introduces catalog identifiers into table metadata (#4)
>>> - query engine has to expand lineage at refresh time (happens anyway)
>>> - recursive calls to catalog to expand lineage at read time (#2)
>>> - fragile by requiring child views to have lineage field
>>> - update of materialized view version required if child view is updated
>>> (#5)
>>>
>>> With identifiers as the refresh-state keys, the lineage is not strictly
>>> required and becomes an orthogonal proposal. That's why I left it out if
>>> the comparison.
>>>
>>> In my opinion introducing catalog identifiers into the table metadata
>>> (requirement #4) is the least significant drawback as it is not a technical
>>> reason but more about semantics. Especially as the identifiers are not
>>> introduced into the table spec but are rather stored in the snapshot
>>> summary. That's why I'm in favor of using the catalog identifiers as the
>>> refresh-state keys.
>>>
>>> Regarding your last point Walaa:
>>>
>>> The option of using catalog identifiers in the state map still requires
>>> keeping lineage information in the view because REFRESH MV needs the latest
>>> fully expanded children (which could have changed from the set of children
>>> currently in the state map), without reparsing the view tree.
>>>
>>> For a refresh operation the query engine has to parse the SQL and fully
>>> expand the lineage with it's children anyway.  So the lineage is not
>>> strictly required.
>>>
>>> If I understand correctly, most of you are also in favor of using
>>> catalog identifiers + ref as the refresh-state keys and postponing the
>>> lineage proposal.
>>>
>>> I hope that we can move the discussion forward.
>>>
>>> Jan
>>> On 16.08.24 08:07, Walaa Eldin Moustafa wrote:
>>>
>>> The option of using catalog identifiers in the state map still requires
>>> keeping lineage information in the view because REFRESH MV needs the latest
>>> fully expanded children (which could have changed from the set of children
>>> currently in the state map), without reparsing the view tree. Therefore,
>>> catalog identifiers in the state map, does not eliminate the need for
>>> tracking children in the form of catalog identifiers in the lineage side
>>> (but in this case lineage will be a set instead of just a map).
>>>
>>> Hence, my concerns with using catalog identifiers (as opposed to UUIDs)
>>> are:
>>> * The fundamental issue where the table spec depends on/refers to the
>>> view spec (because such catalog identifiers are not defined in the table
>>> spec and the only place they have a meaning is in the view spec lineage
>>> information).
>>> * (less fundamental) The denormalization introduced by this arrangement,
>>> where each identifier is 3-parts and all of them repeat in both lineage
>>> info and state map.
>>>
>>> I am not very concerned with recursive expansion (through multiple
>>> calls), as it is always the case with views.
>>>
>>> On a positive note, looks like we agree to move past sequence numbers :)
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>>
>>>
>>> On Thu, Aug 15, 2024 at 4:07 PM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>> I think given the constraint that catalog lookup has to be by identifier
>>> and not UUID, I'd prefer using identifier in the refresh state.  If we use
>>> identifiers, we can directly parallelize the catalog calls to fetch the
>>> latest state.  If we use UUID, the engine has to go back to the MV and
>>> possibly additional views to reconstruct the lineage map.  It's just a lot
>>> slower and more work for the engine when there is a MV that references a
>>> lot of views (and those views reference additional views).
>>>
>>>
>>> I'm +1 on using catalog identifiers as the key.  As you point out this
>>> is inline with #2 (try to minimize serial catalog lookups) in addition to
>>> supporting requirement #3.
>>>
>>> On Thu, Aug 15, 2024 at 3:27 PM Benny Chow <btc...@gmail.com> wrote:
>>>
>>> I think given the constraint that catalog lookup has to be by identifier
>>> and not UUID, I'd prefer using identifier in the refresh state.  If we use
>>> identifiers, we can directly parallelize the catalog calls to fetch the
>>> latest state.  If we use UUID, the engine has to go back to the MV and
>>> possibly additional views to reconstruct the lineage map.  It's just a lot
>>> slower and more work for the engine when there is a MV that references a
>>> lot of views (and those views reference additional views).
>>>
>>> Thanks
>>> Benny
>>>
>>>
>>> On Thu, Aug 15, 2024 at 2:14 PM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>>
>>> Thanks Jan, Micah, and Karuppayya for chiming in.
>>>
>>> I do not think 3 and 4 are at odds with each other (for example
>>> maintaining both lineage map and state map through UUID can achieve both).
>>> Also, I do not think we can drop the lineage map since in many catalogs,
>>> the only lookup method is by the catalog identifier, and not the UUID.
>>>
>>> I think if we go with UUIDs in the state, we should have a lineage map
>>> (from identifiers to UUIDs) to go with it.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Thu, Aug 15, 2024 at 1:45 PM karuppayya <karuppayya1...@gmail.com>
>>> wrote:
>>>
>>> +1 to storing the refresh state as a map of UUIDs to snapshot IDs, and
>>> deferring the inclusion of lineage to a future iteration.(like Micha
>>> mentioned)
>>> This would greatly simplify the current design.
>>>
>>> Also in terms of identifiers to use(UUID or catalog identifier) for the
>>> refresh state
>>> We will not be able to fetch the table/View using the UUID alone, for
>>> example from Hive based catalog.
>>> We do not have the direct mapping between UUID and table/view.
>>> Which leaves us only with the catalog identifiers?
>>>
>>> Thanks & Regards
>>> Karuppayya
>>>
>>>
>>> On Thu, Aug 15, 2024 at 9:16 AM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>> I think it might be worth restating perceived requirements and making
>>> sure there is alignment on them.
>>>
>>> If I am reading correctly, I think the following are perceived
>>> requirements:
>>> 1. An engine must be able to unambiguously detect that an underlying
>>> queried entity has changed or not via metadata to decide if materialized
>>> table data can be used.
>>> 2. The number of sequential catalog reads an engine needs to make to
>>> make use of a materialized table state at read time is minimized.
>>> 3. Engines that don't understand a SQL dialect can still use MV
>>> information if it is not stale.
>>> 4. Table refs (catalog identifiers) should not appear in the
>>> materialized table metadata (i.e. state).
>>> 5. The view part of the MV definition should not need a new revision for
>>> any changes to objects it queries as long as their schemas stay compatible
>>> (only state information on the materialized table need to change).
>>>
>>> In my mind, requirement 1, is the only true requirement.  I think this
>>> necessitates having UUID + snapshot ID as part of the state information
>>> (not necessarily part of the Lineage).  I think it also necessitates having
>>> a denormalized view of all entities that are inputs into the MV in the
>>> state information (a view object might not change but its underlying tables
>>> or views could change and that must be detected).
>>>
>>> Requirements 2 and 5 are somewhat at odds with each other.  If
>>> information is denormalized (fully expanded) in Lineage, it means if table
>>> information is somehow dropped from an intermediate view, one would need to
>>> update the view (or make excess calls to the catalog). In my mind, this
>>> argues for normalization of the lineage stored on the view (with the cost
>>> of potentially 1 additional serial catalog lookup once the state
>>> information is retrieved).
>>>
>>> I think #3 is at odds with #4.  I think #3 is more worthwhile, then
>>> keeping #4 (and as Jan noted #4 adds complexity).
>>>
>>> I think the last remaining question is if lineage serves any purpose.  I
>>> think it is useful for the following reasons:
>>> a)  When there are no intermediate views queried, it allows for fully
>>> parallelized lookup calls to the catalog without having to parse the SQL
>>> statement first
>>> b)  Allows tools that don't need to lookup state information  or parse
>>> SQL but still navigate MV/view trees.
>>>
>>> Both of these seem relatively minor, so lineage could perhaps be left
>>> out in the first iteration.
>>>
>>> As it applies to Jan's questions:
>>>
>>> 1. Should we move the identifiers out of the refresh-state into a new
>>> lineage record that is stored as part of the view metadata?
>>>
>>> No, I don't think so, I think #5 is a reasonable requirement and I think
>>> this violates it.
>>>
>>>
>>> 2. If yes, should the lineage in the view be fully expanded?
>>>
>>> No, I think only the state should be fully expanded (for reasons
>>> mentioned above, it potentially requires more updates to the view then
>>> necessary).
>>>
>>>
>>> 3. What should be used as an identifier in the lineage to reference
>>> entries in the refresh-state?
>>>
>>>
>>> Catalog identifiers make sense to me.  If we agree requirement #3 is not
>>> a requirement then it seems like this could also be UUIDs.
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Thu, Aug 15, 2024 at 7:57 AM Benny Chow <btc...@gmail.com> wrote:
>>>
>>> If we go with either UUID or Table Identifier + VersionID/SnapshotId in
>>> the refresh state, then this list is fully expanded already.  So, to
>>> validate the freshness of a materialization, the engine doesn't even need
>>> to look at the view lineage.  IMO, the view lineage is nice to have but not
>>> a necessary requirement for MVs.  The view lineage makes sharing of views
>>> between engines without common SQL dialects possible.
>>>
>>> Benny
>>>
>>> On Thu, Aug 15, 2024 at 12:22 AM Jan Kaul <jank...@mailbox.org.invalid>
>>> <jank...@mailbox.org.invalid> wrote:
>>>
>>> Hi all,
>>>
>>> I would like to reemphasize the purpose of the refresh-state for
>>> materialized views. The purpose is to determine if the precomputed data is
>>> fresh, stale or invalid. For that the current snapshot-id of every table in
>>> the query tree has to be fetched from the catalog by using its full
>>> identifier and ref. Additionally the refresh state stores the snapshot-id
>>> of the last refresh.
>>>
>>> To summarize: *To determine the freshness of the precomputed data we
>>> require the full identifier + ref and snapshot-id of the last refresh for
>>> every table in the fully expanded query tree*
>>>
>>> This is a requirement from how the catalog works and independent from
>>> how we design the lineage/refresh state. Additionally we previously agreed
>>> that we should be able to obtain the full list of identifiers without
>>> needing to parse the SQL definition.
>>>
>>> Now we are having a discussion in how to store and obtain the fully
>>> expanded list of table identifiers and snapshot-ids. To move the discussion
>>> forward I think it would be valuable to answer the following 3 questions:
>>>
>>> 1. Should we move the identifiers out of the refresh-state into a new
>>> lineage record that is stored as part of the view metadata?
>>>
>>> 2. If yes, should the lineage in the view be fully expanded?
>>>
>>> 3. What should be used as an identifier in the lineage to reference
>>> entries in the refresh-state?
>>>
>>> 1. Question:
>>>
>>> We already agreed that this would be a good idea because we wouldn't
>>> introduce the identifier concept to the table metadata. However, looking at
>>> the complexity that comes with the alternatives, I would like to keep this
>>> question open.
>>>
>>> 2. Question:
>>>
>>> I'm against using a not fully expanded lineage in the view struct. To
>>> recall we require every identifier in the fully expanded query tree to
>>> determine the freshness. Not storing all identifiers in the lineage would
>>> mean to recursively call the catalog and expand the query tree at read
>>> time. This can lead to a large overhead for determining the refresh state
>>> compared to expanding the query tree once at creation time and then storing
>>> the fully expanded lineage.
>>>
>>> 3. Question:
>>>
>>> This depends on Question 2.
>>>
>>> For a not fully expanded lineage, the only options would be uuids or
>>> catalog identifiers.
>>>
>>> For a fully expanded lineage the question isn't all that relevant. The
>>> current design specifies that the lineage is a map from an identifier to an
>>> id and the refresh-state is a map from such id to a snapshot-id. For this
>>> to work we don't have to specify which kind of identifier has to be used.
>>> One query engine could use uuids, the other engine sequence-ids. The
>>> important assumption we are making is that every id that is used in the
>>> refresh-state has to be defined in the lineage.
>>> So the question about using uuids is rather, can the query engine trust
>>> that the id defined in the lineage is the uuid of the table.
>>>
>>>
>>> Regarding the complexity that comes from introducing the lineage in the
>>> view I would like to revisit question 1. Introducing the lineage in the
>>> view metadata opens up the question of when should the lineage be fully
>>> expanded. We see that we have 3 options:
>>>
>>> 1. Not fully expanded lineage -> Expansion at read time
>>>
>>> 2. Fully expanded lineage -> Expansion at creation time
>>>
>>> 3. No lineage (use identifiers in refresh-state) -> Expansion at refresh
>>> time
>>>
>>> As reading is expected to be the most frequent operation I see option 1
>>> as not favorable. As the query engine has to fully expand the query tree
>>> for a refresh anyway, I see option 3 as the most natural. For a refresh
>>> operation the query engine must understand the SQL dialects of all views in
>>> the query tree and therefore is guaranteed to successfully expand the
>>> lineage. This might not be the case at creation time, which makes option 2
>>> less favorable.
>>>
>>> As can be seen, I'm in favor of just storing the refresh-state as a map
>>> from identifier to snapshot-id and not using the lineage. I know that this
>>> introduces the concept of a catalog identifiers to the table metadata spec,
>>> but in my opinion it is by far the simplest option.
>>>
>>> I'm interested in your opinions.
>>>
>>> Best wishes,
>>>
>>> Jan
>>> On 14.08.24 22:24, Walaa Eldin Moustafa wrote:
>>>
>>> Thanks Benny. For refs, I am +1 to represent them as UUID + optional
>>> ref, although we can iterate ohe exact JSON structure (e.g., another option
>>> is splitting for (UUID) state from (UUID + ref) state into two separate
>>> higher-level fields).
>>>
>>> Generally agree on REFRESH VIEW strategy could be up to the engine, but
>>> it seems like an area where Iceberg could have an opinion/spec on. I will
>>> start a separate thread for that.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>>

Re: [DISCUSS] Materialized Views: Lineage and State information

Reply via email to