Thanks Micah, for the latter, I meant the type of denormalization of repeating a 3-part name as opposed to using an ID.
On Fri, Aug 16, 2024 at 4:52 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > However, this still does not address the semantic issue which is more >> fundamental in my opinion. The Iceberg table spec is not aware of catalog >> table identifiers and this use will be the first break of this abstraction. > > > IIUC, based on Jan's comments, we are not going to modify the table > specification. I thought the state map was effectively opaque metadata > from the table specification perspective? If this is the case I feel like > that is OK and not a blocker, I think by their nature as already discussed > MVs need catalog information to function properly and the choice to put > catalog information into the table metadata is pragmatic and preserves > other desirable properties. It might be a more important point if we want > to update the table specification (IMO, I still think it would probably be > OK). > > On a side note, it does not address the denormalization issue either if we >> ever want to introduce the lineage in the view as a nice-to-have. > > > I think if lineage is introduced to the View metadata, it should only hold > direct dependencies for the reasons already discussed. IMO, I think the > potential overlap is OK as they serve two different purposes. > > Cheers, > Micah > > > > > > On Fri, Aug 16, 2024 at 4:27 PM Walaa Eldin Moustafa < > wa.moust...@gmail.com> wrote: > >> That is right. I agree that in the case of using catalog identifiers in >> state information, using them in lineage information would be a >> nice-to-have and not a requirement. >> >> However, this still does not address the semantic issue which is more >> fundamental in my opinion. The Iceberg table spec is not aware of catalog >> table identifiers and this use will be the first break of this abstraction. >> >> On a side note, it does not address the denormalization issue either if >> we ever want to introduce the lineage in the view as a nice-to-have. >> >> Thanks, >> Walaa. >> >> >> On Fri, Aug 16, 2024 at 10:09 AM Jan Kaul <jank...@mailbox.org.invalid> >> wrote: >> >>> Hi Walaa, >>> >>> I would argue that for the refresh operation the query engine has to >>> parse the query and then somehow execute it. For a full refresh it will >>> directly execute the query and for a incremental refresh it will execute a >>> modified version. Therefore it has to fully expand the query tree. >>> >>> Best wishes, >>> >>> Jan >>> >>> Am 16.08.2024 18:13 schrieb Walaa Eldin Moustafa <wa.moust...@gmail.com >>> >: >>> >>> Thanks Jan for the summary. >>> >>> For this point: >>> >>> > For a refresh operation the query engine has to parse the SQL and >>> fully expand the lineage with it's children anyway. So the lineage is not >>> strictly required. >>> >>> If the lineage is provided at creation time by the respective engine, >>> the refresh operation does not need to parse the SQL, correct? >>> >>> Thanks, >>> Walaa. >>> >>> On Fri, Aug 16, 2024 at 12:24 AM Jan Kaul <jank...@mailbox.org.invalid> >>> wrote: >>> >>> As the table I created is not properly shown in the mailing list I'll >>> reformat the summary of the different drawbacks again: >>> >>> Drawbacks of (no lineage, refresh-state key = identifier): >>> >>> - introduces catalog identifiers into table metadata (#4) >>> - query engine has to expand lineage at refresh time (happens anyway) >>> >>> Drawbacks of (normalized lineage, refresh-state key = uuid): >>> >>> - recursive calls to catalog to expand lineage at read time (#2) >>> - fragile by requiring child views to have lineage field >>> >>> Drawbacks of (denormalized lineage, refresh-state key = uuid): >>> >>> - update of materialized view version required if child view is updated >>> (#5) >>> On 16.08.24 09:17, Jan Kaul wrote: >>> >>> Hi, >>> >>> Thanks Micah for clearly stating the requirements. I think this gives >>> better clarity for the discussion. >>> >>> It seems like we don't have a solution that satisfies all requirements >>> at once. So we would need to choose which has the fewest drawbacks. >>> >>> I would like to summarize the different drawbacks that came up in the >>> discussion. >>> no lineage >>> + refresh-state key = identifier >>> normalized lineage >>> + refresh-state key = uuid >>> denormalized lineage >>> + refresh-state key = uuid >>> - introduces catalog identifiers into table metadata (#4) >>> - query engine has to expand lineage at refresh time (happens anyway) >>> - recursive calls to catalog to expand lineage at read time (#2) >>> - fragile by requiring child views to have lineage field >>> - update of materialized view version required if child view is updated >>> (#5) >>> >>> With identifiers as the refresh-state keys, the lineage is not strictly >>> required and becomes an orthogonal proposal. That's why I left it out if >>> the comparison. >>> >>> In my opinion introducing catalog identifiers into the table metadata >>> (requirement #4) is the least significant drawback as it is not a technical >>> reason but more about semantics. Especially as the identifiers are not >>> introduced into the table spec but are rather stored in the snapshot >>> summary. That's why I'm in favor of using the catalog identifiers as the >>> refresh-state keys. >>> >>> Regarding your last point Walaa: >>> >>> The option of using catalog identifiers in the state map still requires >>> keeping lineage information in the view because REFRESH MV needs the latest >>> fully expanded children (which could have changed from the set of children >>> currently in the state map), without reparsing the view tree. >>> >>> For a refresh operation the query engine has to parse the SQL and fully >>> expand the lineage with it's children anyway. So the lineage is not >>> strictly required. >>> >>> If I understand correctly, most of you are also in favor of using >>> catalog identifiers + ref as the refresh-state keys and postponing the >>> lineage proposal. >>> >>> I hope that we can move the discussion forward. >>> >>> Jan >>> On 16.08.24 08:07, Walaa Eldin Moustafa wrote: >>> >>> The option of using catalog identifiers in the state map still requires >>> keeping lineage information in the view because REFRESH MV needs the latest >>> fully expanded children (which could have changed from the set of children >>> currently in the state map), without reparsing the view tree. Therefore, >>> catalog identifiers in the state map, does not eliminate the need for >>> tracking children in the form of catalog identifiers in the lineage side >>> (but in this case lineage will be a set instead of just a map). >>> >>> Hence, my concerns with using catalog identifiers (as opposed to UUIDs) >>> are: >>> * The fundamental issue where the table spec depends on/refers to the >>> view spec (because such catalog identifiers are not defined in the table >>> spec and the only place they have a meaning is in the view spec lineage >>> information). >>> * (less fundamental) The denormalization introduced by this arrangement, >>> where each identifier is 3-parts and all of them repeat in both lineage >>> info and state map. >>> >>> I am not very concerned with recursive expansion (through multiple >>> calls), as it is always the case with views. >>> >>> On a positive note, looks like we agree to move past sequence numbers :) >>> >>> Thanks, >>> Walaa. >>> >>> >>> >>> >>> On Thu, Aug 15, 2024 at 4:07 PM Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> >>> I think given the constraint that catalog lookup has to be by identifier >>> and not UUID, I'd prefer using identifier in the refresh state. If we use >>> identifiers, we can directly parallelize the catalog calls to fetch the >>> latest state. If we use UUID, the engine has to go back to the MV and >>> possibly additional views to reconstruct the lineage map. It's just a lot >>> slower and more work for the engine when there is a MV that references a >>> lot of views (and those views reference additional views). >>> >>> >>> I'm +1 on using catalog identifiers as the key. As you point out this >>> is inline with #2 (try to minimize serial catalog lookups) in addition to >>> supporting requirement #3. >>> >>> On Thu, Aug 15, 2024 at 3:27 PM Benny Chow <btc...@gmail.com> wrote: >>> >>> I think given the constraint that catalog lookup has to be by identifier >>> and not UUID, I'd prefer using identifier in the refresh state. If we use >>> identifiers, we can directly parallelize the catalog calls to fetch the >>> latest state. If we use UUID, the engine has to go back to the MV and >>> possibly additional views to reconstruct the lineage map. It's just a lot >>> slower and more work for the engine when there is a MV that references a >>> lot of views (and those views reference additional views). >>> >>> Thanks >>> Benny >>> >>> >>> On Thu, Aug 15, 2024 at 2:14 PM Walaa Eldin Moustafa < >>> wa.moust...@gmail.com> wrote: >>> >>> Thanks Jan, Micah, and Karuppayya for chiming in. >>> >>> I do not think 3 and 4 are at odds with each other (for example >>> maintaining both lineage map and state map through UUID can achieve both). >>> Also, I do not think we can drop the lineage map since in many catalogs, >>> the only lookup method is by the catalog identifier, and not the UUID. >>> >>> I think if we go with UUIDs in the state, we should have a lineage map >>> (from identifiers to UUIDs) to go with it. >>> >>> Thanks, >>> Walaa. >>> >>> >>> On Thu, Aug 15, 2024 at 1:45 PM karuppayya <karuppayya1...@gmail.com> >>> wrote: >>> >>> +1 to storing the refresh state as a map of UUIDs to snapshot IDs, and >>> deferring the inclusion of lineage to a future iteration.(like Micha >>> mentioned) >>> This would greatly simplify the current design. >>> >>> Also in terms of identifiers to use(UUID or catalog identifier) for the >>> refresh state >>> We will not be able to fetch the table/View using the UUID alone, for >>> example from Hive based catalog. >>> We do not have the direct mapping between UUID and table/view. >>> Which leaves us only with the catalog identifiers? >>> >>> Thanks & Regards >>> Karuppayya >>> >>> >>> On Thu, Aug 15, 2024 at 9:16 AM Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> >>> I think it might be worth restating perceived requirements and making >>> sure there is alignment on them. >>> >>> If I am reading correctly, I think the following are perceived >>> requirements: >>> 1. An engine must be able to unambiguously detect that an underlying >>> queried entity has changed or not via metadata to decide if materialized >>> table data can be used. >>> 2. The number of sequential catalog reads an engine needs to make to >>> make use of a materialized table state at read time is minimized. >>> 3. Engines that don't understand a SQL dialect can still use MV >>> information if it is not stale. >>> 4. Table refs (catalog identifiers) should not appear in the >>> materialized table metadata (i.e. state). >>> 5. The view part of the MV definition should not need a new revision for >>> any changes to objects it queries as long as their schemas stay compatible >>> (only state information on the materialized table need to change). >>> >>> In my mind, requirement 1, is the only true requirement. I think this >>> necessitates having UUID + snapshot ID as part of the state information >>> (not necessarily part of the Lineage). I think it also necessitates having >>> a denormalized view of all entities that are inputs into the MV in the >>> state information (a view object might not change but its underlying tables >>> or views could change and that must be detected). >>> >>> Requirements 2 and 5 are somewhat at odds with each other. If >>> information is denormalized (fully expanded) in Lineage, it means if table >>> information is somehow dropped from an intermediate view, one would need to >>> update the view (or make excess calls to the catalog). In my mind, this >>> argues for normalization of the lineage stored on the view (with the cost >>> of potentially 1 additional serial catalog lookup once the state >>> information is retrieved). >>> >>> I think #3 is at odds with #4. I think #3 is more worthwhile, then >>> keeping #4 (and as Jan noted #4 adds complexity). >>> >>> I think the last remaining question is if lineage serves any purpose. I >>> think it is useful for the following reasons: >>> a) When there are no intermediate views queried, it allows for fully >>> parallelized lookup calls to the catalog without having to parse the SQL >>> statement first >>> b) Allows tools that don't need to lookup state information or parse >>> SQL but still navigate MV/view trees. >>> >>> Both of these seem relatively minor, so lineage could perhaps be left >>> out in the first iteration. >>> >>> As it applies to Jan's questions: >>> >>> 1. Should we move the identifiers out of the refresh-state into a new >>> lineage record that is stored as part of the view metadata? >>> >>> No, I don't think so, I think #5 is a reasonable requirement and I think >>> this violates it. >>> >>> >>> 2. If yes, should the lineage in the view be fully expanded? >>> >>> No, I think only the state should be fully expanded (for reasons >>> mentioned above, it potentially requires more updates to the view then >>> necessary). >>> >>> >>> 3. What should be used as an identifier in the lineage to reference >>> entries in the refresh-state? >>> >>> >>> Catalog identifiers make sense to me. If we agree requirement #3 is not >>> a requirement then it seems like this could also be UUIDs. >>> >>> Thanks, >>> Micah >>> >>> On Thu, Aug 15, 2024 at 7:57 AM Benny Chow <btc...@gmail.com> wrote: >>> >>> If we go with either UUID or Table Identifier + VersionID/SnapshotId in >>> the refresh state, then this list is fully expanded already. So, to >>> validate the freshness of a materialization, the engine doesn't even need >>> to look at the view lineage. IMO, the view lineage is nice to have but not >>> a necessary requirement for MVs. The view lineage makes sharing of views >>> between engines without common SQL dialects possible. >>> >>> Benny >>> >>> On Thu, Aug 15, 2024 at 12:22 AM Jan Kaul <jank...@mailbox.org.invalid> >>> <jank...@mailbox.org.invalid> wrote: >>> >>> Hi all, >>> >>> I would like to reemphasize the purpose of the refresh-state for >>> materialized views. The purpose is to determine if the precomputed data is >>> fresh, stale or invalid. For that the current snapshot-id of every table in >>> the query tree has to be fetched from the catalog by using its full >>> identifier and ref. Additionally the refresh state stores the snapshot-id >>> of the last refresh. >>> >>> To summarize: *To determine the freshness of the precomputed data we >>> require the full identifier + ref and snapshot-id of the last refresh for >>> every table in the fully expanded query tree* >>> >>> This is a requirement from how the catalog works and independent from >>> how we design the lineage/refresh state. Additionally we previously agreed >>> that we should be able to obtain the full list of identifiers without >>> needing to parse the SQL definition. >>> >>> Now we are having a discussion in how to store and obtain the fully >>> expanded list of table identifiers and snapshot-ids. To move the discussion >>> forward I think it would be valuable to answer the following 3 questions: >>> >>> 1. Should we move the identifiers out of the refresh-state into a new >>> lineage record that is stored as part of the view metadata? >>> >>> 2. If yes, should the lineage in the view be fully expanded? >>> >>> 3. What should be used as an identifier in the lineage to reference >>> entries in the refresh-state? >>> >>> 1. Question: >>> >>> We already agreed that this would be a good idea because we wouldn't >>> introduce the identifier concept to the table metadata. However, looking at >>> the complexity that comes with the alternatives, I would like to keep this >>> question open. >>> >>> 2. Question: >>> >>> I'm against using a not fully expanded lineage in the view struct. To >>> recall we require every identifier in the fully expanded query tree to >>> determine the freshness. Not storing all identifiers in the lineage would >>> mean to recursively call the catalog and expand the query tree at read >>> time. This can lead to a large overhead for determining the refresh state >>> compared to expanding the query tree once at creation time and then storing >>> the fully expanded lineage. >>> >>> 3. Question: >>> >>> This depends on Question 2. >>> >>> For a not fully expanded lineage, the only options would be uuids or >>> catalog identifiers. >>> >>> For a fully expanded lineage the question isn't all that relevant. The >>> current design specifies that the lineage is a map from an identifier to an >>> id and the refresh-state is a map from such id to a snapshot-id. For this >>> to work we don't have to specify which kind of identifier has to be used. >>> One query engine could use uuids, the other engine sequence-ids. The >>> important assumption we are making is that every id that is used in the >>> refresh-state has to be defined in the lineage. >>> So the question about using uuids is rather, can the query engine trust >>> that the id defined in the lineage is the uuid of the table. >>> >>> >>> Regarding the complexity that comes from introducing the lineage in the >>> view I would like to revisit question 1. Introducing the lineage in the >>> view metadata opens up the question of when should the lineage be fully >>> expanded. We see that we have 3 options: >>> >>> 1. Not fully expanded lineage -> Expansion at read time >>> >>> 2. Fully expanded lineage -> Expansion at creation time >>> >>> 3. No lineage (use identifiers in refresh-state) -> Expansion at refresh >>> time >>> >>> As reading is expected to be the most frequent operation I see option 1 >>> as not favorable. As the query engine has to fully expand the query tree >>> for a refresh anyway, I see option 3 as the most natural. For a refresh >>> operation the query engine must understand the SQL dialects of all views in >>> the query tree and therefore is guaranteed to successfully expand the >>> lineage. This might not be the case at creation time, which makes option 2 >>> less favorable. >>> >>> As can be seen, I'm in favor of just storing the refresh-state as a map >>> from identifier to snapshot-id and not using the lineage. I know that this >>> introduces the concept of a catalog identifiers to the table metadata spec, >>> but in my opinion it is by far the simplest option. >>> >>> I'm interested in your opinions. >>> >>> Best wishes, >>> >>> Jan >>> On 14.08.24 22:24, Walaa Eldin Moustafa wrote: >>> >>> Thanks Benny. For refs, I am +1 to represent them as UUID + optional >>> ref, although we can iterate ohe exact JSON structure (e.g., another option >>> is splitting for (UUID) state from (UUID + ref) state into two separate >>> higher-level fields). >>> >>> Generally agree on REFRESH VIEW strategy could be up to the engine, but >>> it seems like an area where Iceberg could have an opinion/spec on. I will >>> start a separate thread for that. >>> >>> Thanks, >>> Walaa. >>> >>> >>>