Re: [DISCUSS] Materialized Views: Lineage and State information

Jan Kaul Fri, 16 Aug 2024 00:24:05 -0700

As the table I created is not properly shown in the mailing list I'llreformat the summary of the different drawbacks again:


Drawbacks of (no lineage, refresh-state key = identifier):


- introduces catalog identifiers into table metadata (#4)
- query engine has to expand lineage at refresh time (happens anyway)

Drawbacks of (normalized lineage, refresh-state key = uuid):

- recursive calls to catalog to expand lineage at read time (#2)
- fragile by requiring child views to have lineage field

Drawbacks of (denormalized lineage, refresh-state key = uuid):

- update of materialized view version required if child view is updated (#5)

On 16.08.24 09:17, Jan Kaul wrote:

Hi,

Thanks Micah for clearly stating the requirements. I think this givesbetter clarity for the discussion.

It seems like we don't have a solution that satisfies all requirementsat once. So we would need to choose which has the fewest drawbacks.

I would like to summarize the different drawbacks that came up in thediscussion.


no lineage
+ refresh-state key = identifier
        normalized lineage
+ refresh-state key = uuid
        denormalized lineage
+ refresh-state key = uuid
- introduces catalog identifiers into table metadata (#4)
- query engine has to expand lineage at refresh time (happens anyway)
        - recursive calls to catalog to expand lineage at read time (#2)
- fragile by requiring child views to have lineage field

- update of materialized view version required if child view isupdated (#5)

With identifiers as the refresh-state keys, the lineage is notstrictly required and becomes an orthogonal proposal. That's why Ileft it out if the comparison.

In my opinion introducing catalog identifiers into the table metadata(requirement #4) is the least significant drawback as it is not atechnical reason but more about semantics. Especially as theidentifiers are not introduced into the table spec but are ratherstored in the snapshot summary. That's why I'm in favor of using thecatalog identifiers as the refresh-state keys.


Regarding your last point Walaa:

The option of using catalog identifiers in the state map stillrequires keeping lineage information in the view because REFRESH MVneeds the latest fully expanded children (which could have changedfrom the set of children currently in the state map),without reparsing the view tree.

For a refresh operation the query engine has to parse the SQL andfully expand the lineage with it's children anyway. So the lineage isnot strictly required.

If I understand correctly, most of you are also in favor of usingcatalog identifiers + ref as the refresh-state keys and postponing thelineage proposal.


I hope that we can move the discussion forward.

Jan

On 16.08.24 08:07, Walaa Eldin Moustafa wrote:

The option of using catalog identifiers in the state map stillrequires keeping lineage information in the view because REFRESH MVneeds the latest fully expanded children (which could have changedfrom the set of children currently in the state map),without reparsing the view tree. Therefore, catalog identifiers inthe state map, does not eliminate the need for tracking children inthe form of catalog identifiers in the lineage side (but in this caselineage will be a set instead of just a map).

Hence, my concerns with using catalog identifiers (as opposed toUUIDs) are:* The fundamental issue where the table spec depends on/refers to theview spec (because such catalog identifiers are not defined in thetable spec and the only place they have a meaning is in the view speclineage information).* (less fundamental) The denormalization introduced by thisarrangement, where each identifier is 3-parts and all of them repeatin both lineage info and state map.

I am not very concerned with recursive expansion (through multiplecalls), as it is always the case with views.


On a positive note, looks like we agree to move past sequence numbers :)

Thanks,
Walaa.

On Thu, Aug 15, 2024 at 4:07 PM Micah Kornfield<[email protected]> wrote:


        I think given the constraint that catalog lookup has to be by
        identifier and not UUID, I'd prefer using identifier in the
        refresh state.  If we use identifiers, we can directly
        parallelize the catalog calls to fetch the latest state.  If
        we use UUID, the engine has to go back to the MV and possibly
        additional views to reconstruct the lineage map.  It's just a
        lot slower and more work for the engine when there is a MV
        that references a lot of views (and those views reference
        additional views).


    I'm +1 on using catalog identifiers as the key.  As you point out
    this is inline with #2 (try to minimize serial catalog lookups)
    in addition to supporting requirement #3.

    On Thu, Aug 15, 2024 at 3:27 PM Benny Chow <[email protected]> wrote:

        I think given the constraint that catalog lookup has to be by
        identifier and not UUID, I'd prefer using identifier in the
        refresh state.  If we use identifiers, we can directly
        parallelize the catalog calls to fetch the latest state.  If
        we use UUID, the engine has to go back to the MV and possibly
        additional views to reconstruct the lineage map.  It's just a
        lot slower and more work for the engine when there is a MV
        that references a lot of views (and those views reference
        additional views).

        Thanks
        Benny


        On Thu, Aug 15, 2024 at 2:14 PM Walaa Eldin Moustafa
        <[email protected]> wrote:

            Thanks Jan, Micah, and Karuppayya for chiming in.

            I do not think 3 and 4 are at odds with each other (for
            example maintaining both lineage map and state map
            through UUID can achieve both). Also, I do not think we
            can drop the lineage map since in many catalogs, the only
            lookup method is by the catalog identifier, and not the UUID.

            I think if we go with UUIDs in the state, we should have
            a lineage map (from identifiers to UUIDs) to go with it.

            Thanks,
            Walaa.


            On Thu, Aug 15, 2024 at 1:45 PM karuppayya
            <[email protected]> wrote:

                +1 to storing the refresh state as a map of UUIDs to
                snapshot IDs, and deferring the inclusion of lineage
                to a future iteration.(like Micha mentioned)
                This would greatly simplify the current design.

                Also in terms of identifiers to use(UUID or catalog
                identifier) for the refresh state
                We will not be able to fetch the table/View using the
                UUID alone, for example from Hive based catalog.
                We do not have the direct mapping between UUID and
                table/view.
                Which leaves us only with the catalog identifiers?

                Thanks & Regards
                Karuppayya


                On Thu, Aug 15, 2024 at 9:16 AM Micah Kornfield
                <[email protected]> wrote:

                    I think it might be worth restating perceived
                    requirements and making sure there is alignment
                    on them.

                    If I am reading correctly, I think the following
                    are perceived requirements:
                    1. An engine must be able to unambiguously detect
                    that an underlying queried entity has changed or
                    not via metadata to decide if materialized table
                    data can be used.
                    2. The number of sequential catalog reads an
                    engine needs to make to make use of a
                    materialized table state at read time is minimized.
                    3. Engines that don't understand a SQL dialect
                    can still use MV information if it is not stale.
                    4. Table refs (catalog identifiers) should not
                    appear in the materialized table metadata (i.e.
                    state).
                    5. The view part of the MV definition should not
                    need a new revision for any changes to objects it
                    queries as long as their schemas stay compatible
                    (only state information on the materialized table
                    need to change).

                    In my mind, requirement 1, is the only true
                    requirement.  I think this necessitates having
                    UUID + snapshot ID as part of the state
                    information (not necessarily part of the
                    Lineage).  I think it also necessitates having a
                    denormalized view of all entities that are inputs
                    into the MV in the state information (a view
                    object might not change but its underlying tables
                    or views could change and that must be detected).

                    Requirements 2 and 5 are somewhat at odds with
                    each other.  If information is denormalized
                    (fully expanded) in Lineage, it means if table
                    information is somehow dropped from an
                    intermediate view, one would need to update the
                    view (or make excess calls to the catalog). In my
                    mind, this argues for normalization of the
                    lineage stored on the view (with the cost of
                    potentially 1 additional serial catalog lookup
                    once the state information is retrieved).

                    I think #3 is at odds with #4.  I think #3 is
                    more worthwhile, then keeping #4 (and as Jan
                    noted #4 adds complexity).

                    I think the last remaining question is if lineage
                    serves any purpose.  I think it is useful for the
                    following reasons:
                    a)  When there are no intermediate views queried,
                    it allows for fully parallelized lookup calls to
                    the catalog without having to parse the SQL
                    statement first
                    b)  Allows tools that don't need to lookup state
                    information  or parse SQL but still navigate
                    MV/view trees.

                    Both of these seem relatively minor, so lineage
                    could perhaps be left out in the first iteration.

                    As it applies to Jan's questions:

                        1. Should we move the identifiers out of the
                        refresh-state into a new lineage record that
                        is stored as part of the view metadata?

                    No, I don't think so, I think #5 is a reasonable
                    requirement and I think this violates it.

                        2. If yes, should the lineage in the view be
                        fully expanded?

                    No, I think only the state should be fully
                    expanded (for reasons mentioned above, it
                    potentially requires more updates to the view
                    then necessary).

                        3. What should be used as an identifier in
                        the lineage to reference entries in the
                        refresh-state?


                    Catalog identifiers make sense to me.  If we
                    agree requirement #3 is not a requirement then it
                    seems like this could also be UUIDs.

                    Thanks,
                    Micah

                    On Thu, Aug 15, 2024 at 7:57 AM Benny Chow
                    <[email protected]> wrote:

                        If we go with either UUID or Table
                        Identifier + VersionID/SnapshotId in the
                        refresh state, then this list is fully
                        expanded already.  So, to validate the
                        freshness of a materialization, the engine
                        doesn't even need to look at the view
                        lineage.  IMO, the view lineage is nice to
                        have but not a necessary requirement for
                        MVs.  The view lineage makes sharing of views
                        between engines without common SQL dialects
                        possible.

                        Benny

                        On Thu, Aug 15, 2024 at 12:22 AM Jan Kaul
                        <[email protected]> wrote:

                            Hi all,

                            I would like to reemphasize the purpose
                            of the refresh-state for materialized
                            views. The purpose is to determine if the
                            precomputed data is fresh, stale or
                            invalid. For that the current snapshot-id
                            of every table in the query tree has to
                            be fetched from the catalog by using its
                            full identifier and ref. Additionally the
                            refresh state stores the snapshot-id of
                            the last refresh.

                            To summarize: *To determine the freshness
                            of the precomputed data we require the
                            full identifier + ref and snapshot-id of
                            the last refresh for every table in the
                            fully expanded query tree*

                            This is a requirement from how the
                            catalog works and independent from how we
                            design the lineage/refresh state.
                            Additionally we previously agreed that we
                            should be able to obtain the full list of
                            identifiers without needing to parse the
                            SQL definition.

                            Now we are having a discussion in how to
                            store and obtain the fully expanded list
                            of table identifiers and snapshot-ids. To
                            move the discussion forward I think it
                            would be valuable to answer the following
                            3 questions:

                            1. Should we move the identifiers out of
                            the refresh-state into a new lineage
                            record that is stored as part of the view
                            metadata?

                            2. If yes, should the lineage in the view
                            be fully expanded?

                            3. What should be used as an identifier
                            in the lineage to reference entries in
                            the refresh-state?

                            1. Question:

                            We already agreed that this would be a
                            good idea because we wouldn't introduce
                            the identifier concept to the table
                            metadata. However, looking at the
                            complexity that comes with the
                            alternatives, I would like to keep this
                            question open.

                            2. Question:

                            I'm against using a not fully expanded
                            lineage in the view struct. To recall we
                            require every identifier in the fully
                            expanded query tree to determine the
                            freshness. Not storing all identifiers in
                            the lineage would mean to recursively
                            call the catalog and expand the query
                            tree at read time. This can lead to a
                            large overhead for determining the
                            refresh state compared to expanding the
                            query tree once at creation time and then
                            storing the fully expanded lineage.

                            3. Question:

                            This depends on Question 2.

                            For a not fully expanded lineage, the
                            only options would be uuids or catalog
                            identifiers.

                            For a fully expanded lineage the question
                            isn't all that relevant. The current
                            design specifies that the lineage is a
                            map from an identifier to an id and the
                            refresh-state is a map from such id to a
                            snapshot-id. For this to work we don't
                            have to specify which kind of identifier
                            has to be used. One query engine could
                            use uuids, the other engine sequence-ids.
                            The important assumption we are making is
                            that every id that is used in the
                            refresh-state has to be defined in the
                            lineage.
                            So the question about using uuids is
                            rather, can the query engine trust that
                            the id defined in the lineage is the uuid
                            of the table.


                            Regarding the complexity that comes from
                            introducing the lineage in the view I
                            would like to revisit question 1.
                            Introducing the lineage in the view
                            metadata opens up the question of when
                            should the lineage be fully expanded. We
                            see that we have 3 options:

                            1. Not fully expanded lineage ->
                            Expansion at read time

                            2. Fully expanded lineage -> Expansion at
                            creation time

                            3. No lineage (use identifiers in
                            refresh-state) -> Expansion at refresh time

                            As reading is expected to be the most
                            frequent operation I see option 1 as not
                            favorable. As the query engine has to
                            fully expand the query tree for a refresh
                            anyway, I see option 3 as the most
                            natural. For a refresh operation the
                            query engine must understand the SQL
                            dialects of all views in the query tree
                            and therefore is guaranteed to
                            successfully expand the lineage. This
                            might not be the case at creation time,
                            which makes option 2 less favorable.

                            As can be seen, I'm in favor of just
                            storing the refresh-state as a map from
                            identifier to snapshot-id and not using
                            the lineage. I know that this introduces
                            the concept of a catalog identifiers to
                            the table metadata spec, but in my
                            opinion it is by far the simplest option.

                            I'm interested in your opinions.

                            Best wishes,

                            Jan

                            On 14.08.24 22:24, Walaa Eldin Moustafa
                            wrote:

                            Thanks Benny. For refs, I am +1 to
                            represent them as UUID + optional ref,
                            although we can iterate ohe exact JSON
                            structure (e.g., another option is
                            splitting for (UUID) state from (UUID +
                            ref) state into two separate
                            higher-level fields).

                            Generally agree on REFRESH VIEW strategy
                            could be up to the engine, but it seems
                            like an area where Iceberg could have an
                            opinion/spec on. I will start a separate
                            thread for that.

                            Thanks,
                            Walaa.

Re: [DISCUSS] Materialized Views: Lineage and State information

Reply via email to