Re: Materialized Views: Next Steps

Daniel Weeks Mon, 20 May 2024 13:58:37 -0700

Walaa, I think Ryan's comment was more in relation to the combined metadata
approach (which is not the current proposal).  I don't think anything
that's being discussed at this point helps or hurts the view integration
story with catalog apis.


-Dan



On Mon, May 20, 2024 at 1:21 PM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Thanks Dan! Do you share the concern of having to update engine APIs as
> well before adopting this? Also Ryan had a concern [1] on the number of
> backends that need to be updated.
>
> [1]
> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1#heading=h.rbxigxsh4rfw
>
> Thanks,
> Walaa.
>
>
> On Mon, May 20, 2024 at 11:14 AM Daniel Weeks <dwe...@apache.org> wrote:
>
>> I know I'm coming in late here, but I'm still working through all the
>> prior discussion.  Here are my thoughts so far:
>>
>> I agree that Jan's doc has some really good context and we should
>> continue from there, but we should remove discussion of options as it just
>> creates confusion.  We can reference other material from previous
>> discussions (mailing lists, other docs/prior version, etc), but the focus
>> should be on what we feel the final product will look like as opposed to
>> the path it took to get there.  Let's make sure that the proposal issue
>> <https://github.com/apache/iceberg/issues/10043> is kept up to date and
>> is the primary reference point going forward.
>>
>> As to the question of using properties or updating the view metadata, I
>> agree with Szehon and others who expressed concern about using properties.
>> There are a number of issues I see with standardizing around properties in
>> the table or view definition.  The first is that some of the structures may
>> get complicated.  I'm specifically concerned about lineage information as
>> there can be many table references and possibly even view references that
>> will need to contain additional information.  The second issue is that
>> table and view properties persist across versions, so rollback and even
>> handling of those properties would require careful implementation.  The
>> third issue is that once you've standardized on properties, it is difficult
>> to change direction and any future changes will likely need to be handled
>> through properties as well.  This creates an awkward scenario where you're
>> building a specification around something that isn't particularly well
>> defined or structured.
>>
>> I also don't feel there's a significant difference in the effort involved
>> in updating the metadata representations because we need to define the same
>> properties regardless (most of the work is around the handling, not how the
>> values are defined).  While using properties may seem expedient, I don't
>> think it's really going to move things along much faster and will be more
>> difficult in the long-term.  The one caveat is that I believe we may be
>> able to limit the changes to the view spec and avoid needing a table spec
>> update.  I've added my comments to the doc to that effect.
>>
>> -Dan
>>
>>
>>
>> On Fri, May 17, 2024 at 3:16 PM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> I am not in favor of expanding the spec for use cases that do not
>>> directly serve materialized views. Identifying general lineage is a
>>> separate problem that is also applicable to non-materialized views so maybe
>>> that’s worth discussing in a separate spec. If there is a use case for
>>> timestamp or snapshot level properties for materialize views, we can
>>> discuss them but so far I feel they are redundant. What do you think?
>>>
>>> Thanks,
>>> Walaa.
>>>
>>> On Fri, May 17, 2024 at 3:05 PM Benny Chow <btc...@gmail.com> wrote:
>>>
>>>> I think it’s still worthwhile to include the snapshot and timestamp
>>>> refs for completeness sake.
>>>>
>>>> Also, Jan brought up interesting use case with BI tool using the MV
>>>> without SQL representation.  The BI tool can get all table and view
>>>> dependencies if the lineage is complete.
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On May 17, 2024, at 1:35 PM, Walaa Eldin Moustafa <
>>>> wa.moust...@gmail.com> wrote:
>>>>
>>>> 
>>>>
>>>> Sounds good. I am assuming we agree it is not required for either
>>>> snapshot or timestamp?
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>>
>>>> On Fri, May 17, 2024 at 1:17 PM Benny Chow <btc...@gmail.com> wrote:
>>>>
>>>>> I like Jack's suggestions to capture the ref type and value!  When the
>>>>> ref type is branch, the snapshot id is dynamic and so the engine using the
>>>>> MV can validate that the latest snapshot on a branch matches the branch
>>>>> snapshot at the time of materialization.
>>>>>
>>>>> I think if we do this then we don't need to precisely identify the
>>>>> same table (at different snapshots) in the MV's query tree.  So, we don't
>>>>> need to capture any additional properties like alias, parent view, path to
>>>>> root, sequence number, etc.
>>>>>
>>>>> Thanks
>>>>> Benny
>>>>>
>>>>> On Fri, May 17, 2024 at 11:20 AM Walaa Eldin Moustafa <
>>>>> wa.moust...@gmail.com> wrote:
>>>>>
>>>>>> Thanks Jack, and welcome back!
>>>>>>
>>>>>> Taking a step back, I understand the initial concern was that a table
>>>>>> name (e.g., t1 in your example) would be referenced multiple times in the
>>>>>> view definition and each reference is associated with a different 
>>>>>> snapshot
>>>>>> ID, hence UUID is not sufficient to capture each occurrence/reference. I
>>>>>> proposed:
>>>>>>
>>>>>> * The solution to track unique occurrences is to use something along
>>>>>> the lines of the SQL alias (e.g., "t1" for the first occurrence and "t2"
>>>>>> from "as t2" in your example) to uniquely identify each occurrence -- we
>>>>>> can tweak the representation and explore how to handle this in case of
>>>>>> nested queries, etc, but alias is the main concept to track uniqueness.
>>>>>> * However, since this leads to a series of open ended problems, I
>>>>>> have also suggested avoiding this complexity altogether and not 
>>>>>> supporting
>>>>>> time travel in MVs for now.
>>>>>>
>>>>>> However, thinking again, are not time travel queries in MVs
>>>>>> self-containing the exact snapshot ID that we are trying to track in the
>>>>>> properties? Looks like this information is already encoded in the query 
>>>>>> and
>>>>>> there is no need to capture it externally.
>>>>>>
>>>>>> For example, if the MV definition consists of table references where
>>>>>> all of the references are bound to specific snapshot IDs or timestamps,
>>>>>> then the storage table is always fresh no matter if the underlying tables
>>>>>> change. Tracking snapshot IDs in the storage table is only required for
>>>>>> table references that are not pinned to a specific snapshot ID/timestamp 
>>>>>> in
>>>>>> the view definition, for which UUID is sufficient.
>>>>>>
>>>>>> Thanks,
>>>>>> Walaa.
>>>>>>
>>>>>>
>>>>>> On Fri, May 17, 2024 at 9:51 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi everyone, just want to say I am back from the leave, and
>>>>>>> currently catching up with the threads, I will make more comments later
>>>>>>> after knowing more details of what has been going on. Looks like we've 
>>>>>>> made
>>>>>>> great progress!
>>>>>>>
>>>>>>> Just my 2 cents on the current properties vs metadata field
>>>>>>> discussion. The proposed properties are:
>>>>>>> - in view:
>>>>>>>   1. a boolean flag to indicate a view is a MV
>>>>>>>   2. a pointer to the storage table
>>>>>>> - in storage table:
>>>>>>>   3. view version that is materialized
>>>>>>>   4. a prefix-based map to describe the snapshot version of the base
>>>>>>> tables that are materialized
>>>>>>>   5. a prefix-based map to describe the version of child views that
>>>>>>> are materialized
>>>>>>>
>>>>>>> For 1, 2, and 3, these are all pretty simple and can be just
>>>>>>> properties. I guess 4 and 5 are the main ones that seem complex and can 
>>>>>>> be
>>>>>>> more formalized as metadata fields. I think the time travel cases Bunny
>>>>>>> brought up might be good ones to look into more details:
>>>>>>>
>>>>>>> For direct version travel, I think the base table version serves as
>>>>>>> the default. If you have a MV query like
>>>>>>>
>>>>>>> SELECT * FROM
>>>>>>>   t1,
>>>>>>>   t1 FOR SYSTEM_VERSION AS OF 987654 as t2
>>>>>>>   WHERE t1.c1 = t2.c1
>>>>>>>
>>>>>>> and in the storage table it says t1 maps to snapshot id 123456, then
>>>>>>> the query is still not ambiguous, it should be interpreted as
>>>>>>>
>>>>>>> SELECT * FROM
>>>>>>>   t1 FOR SYSTEM_VERSION AS OF 123456,
>>>>>>>   t1 FOR SYSTEM_VERSION AS OF 987654 as t2
>>>>>>>   WHERE t1.c1 = t2.c1
>>>>>>>
>>>>>>> For ref travel, the specific ref version needs to be fixed at MV
>>>>>>> creation time:
>>>>>>>
>>>>>>> SELECT * FROM
>>>>>>>   t1,
>>>>>>>   t1 FOR SYSTEM_VERSION AS OF '2024-Q1' as t2
>>>>>>>   WHERE t1.c1 = t2.c1
>>>>>>>
>>>>>>> Just storing table UUID is not sufficient. In a property-based
>>>>>>> approach, we need something like
>>>>>>> base.table.<table>.ref.<ref-name>=<snapshot-id>.
>>>>>>>
>>>>>>> Time travel is similar to ref travel:
>>>>>>>
>>>>>>> SELECT * FROM
>>>>>>>   t1,
>>>>>>>   t1 FOR SYSTEM_TIME AS OF timestamp '2024-01-01' as t2
>>>>>>>   WHERE t1.c1 = t2.c1
>>>>>>>
>>>>>>> In a property-based approach, we need something like
>>>>>>> base.table.<table>.time.<timestamp>=<snapshot-id>.
>>>>>>>
>>>>>>> Technically this is indeed getting increasingly complex, so I can
>>>>>>> get why many of us say this property-based approach is quite brittle.
>>>>>>> However, it seems like it can still work as we extend the property
>>>>>>> structure. Personally speaking I am leaning more towards the 
>>>>>>> property-based
>>>>>>> approach just for its simplicity, but I need to think more about other 
>>>>>>> use
>>>>>>> cases as well.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jack Ye
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 16, 2024 at 10:21 PM Walaa Eldin Moustafa <
>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I think this is orthogonal to the property vs metadata field since
>>>>>>>> instead of representing the property as `base.table.[UUID]` it would be
>>>>>>>> something like `base.table.[alias]` where `alias` is the specific
>>>>>>>> occurrence of the table in the query according to its alias (and SELECT
>>>>>>>> scope possibly, which kind of opens the door to further complexities, 
>>>>>>>> but
>>>>>>>> for the sake of argument -- there is a mapping to properties too).
>>>>>>>>
>>>>>>>> Another question: assuming we go with the top level metadata model,
>>>>>>>> will we still expose this metadata on the engine side as properties? 
>>>>>>>> What
>>>>>>>> would the property names be?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Walaa.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, May 16, 2024 at 9:55 PM Benny Chow <btc...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Sounds good.
>>>>>>>>>
>>>>>>>>> Another benefit of the struct model is that it's more extensible
>>>>>>>>> in the future when we need to disambiguate the same table that appears
>>>>>>>>> multiple times in the MV query tree.
>>>>>>>>> This could happen with time travel queries or branching.  We may
>>>>>>>>> end up adding additional properties like a sequence number, parent 
>>>>>>>>> view or
>>>>>>>>> path to root.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> On Thu, May 16, 2024 at 3:57 PM Walaa Eldin Moustafa <
>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Benny, I have responded to the comment.
>>>>>>>>>>
>>>>>>>>>> I would suggest that we use this thread to evaluate properties
>>>>>>>>>> model vs top level metadata model (to avoid discussion drift).
>>>>>>>>>>
>>>>>>>>>> If we have feedback on the actual properties used in the
>>>>>>>>>> properties model as defined in the PR, we can have the discussion 
>>>>>>>>>> there.
>>>>>>>>>>
>>>>>>>>>> THanks,
>>>>>>>>>> Walaa.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, May 16, 2024 at 3:22 PM Benny Chow <btc...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Walaa
>>>>>>>>>>>
>>>>>>>>>>> I left comments in your spec PR:
>>>>>>>>>>> https://github.com/apache/iceberg/pull/10280#pullrequestreview-2061922169
>>>>>>>>>>>  My last question about use cases was really about incremental 
>>>>>>>>>>> refresh with
>>>>>>>>>>> aggregates.  But I think this might be too complicated to try to
>>>>>>>>>>> model/discuss now and so I agree with Micah's comment about doing 
>>>>>>>>>>> it in a
>>>>>>>>>>> future iteration.
>>>>>>>>>>>
>>>>>>>>>>> Hi Jan,
>>>>>>>>>>>
>>>>>>>>>>> Regarding storing the identifiers, I like the idea too.
>>>>>>>>>>> Dremio's query engine supports MVs on sources besides Iceberg 
>>>>>>>>>>> tables.
>>>>>>>>>>> Here's everything that's in a single lineage entry:
>>>>>>>>>>> https://github.com/dremio/dremio-oss/blob/master/services/accelerator/src/main/protobuf/reflection.proto#L80
>>>>>>>>>>> The lineage is stored as a graph and not a list of entries.  I 
>>>>>>>>>>> think for
>>>>>>>>>>> what we are trying to achieve, it's more practical to limit the 
>>>>>>>>>>> lineage to
>>>>>>>>>>> Iceberg sources.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Benny
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, May 15, 2024 at 12:06 AM Jan Kaul
>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I agree with Szehon and Benny that storing the lineage
>>>>>>>>>>>> information as multiple table properties is too brittle, 
>>>>>>>>>>>> especially for
>>>>>>>>>>>> many source tables (base tables). I would prefer to have the whole 
>>>>>>>>>>>> lineage
>>>>>>>>>>>> information in one entry as it is more concise. This is also how 
>>>>>>>>>>>> Trino has
>>>>>>>>>>>> been doing it, as you can see here
>>>>>>>>>>>> <https://github.com/trinodb/trino/blob/212455d3e1d393f58cbc395d2b9da47ed8f23dd8/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java#L2915>
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>>> As we've discussed in the google doc
>>>>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit#heading=h.60qmzug7bzxc>:
>>>>>>>>>>>> it is helpful to also store the table identifiers of the source 
>>>>>>>>>>>> tables to
>>>>>>>>>>>> enable clients to determine the freshness of the MV that don't 
>>>>>>>>>>>> understand
>>>>>>>>>>>> the SQL dialect of the MV definition, like other query engines, BI 
>>>>>>>>>>>> tools
>>>>>>>>>>>> and Dataframe libraries. This is also how Trino is doing it. 
>>>>>>>>>>>> That's why we
>>>>>>>>>>>> chose the design in the google doc.
>>>>>>>>>>>>
>>>>>>>>>>>> Storing the storage table identifier as a property might work.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Jan
>>>>>>>>>>>> On 15.05.24 02:38, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks Benny. My specific thoughts about the spec and the
>>>>>>>>>>>> properties are captured in the spec PR
>>>>>>>>>>>> https://github.com/apache/iceberg/pull/10280. The spec is also
>>>>>>>>>>>> implemented in the Spark implementation PR
>>>>>>>>>>>> https://github.com/apache/iceberg/pull/9830, and I believe
>>>>>>>>>>>> this follows the same nature of how the information was captured in
>>>>>>>>>>>> Netflix's implementation with Spark, and Trino implementation 
>>>>>>>>>>>> (prior to
>>>>>>>>>>>> formalizing through that spec), both of which have been used 
>>>>>>>>>>>> reliably for
>>>>>>>>>>>> years. I think it also aligns with Ryan's feedback here
>>>>>>>>>>>> https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546
>>>>>>>>>>>>  which
>>>>>>>>>>>> indicated the usage of properties.
>>>>>>>>>>>>
>>>>>>>>>>>> The reasons for choosing properties:
>>>>>>>>>>>> * Not every table is a storage table and not every view is a
>>>>>>>>>>>> materialized view. I feel exposing the info as top level metadata 
>>>>>>>>>>>> is an
>>>>>>>>>>>> overkill for the original object type.
>>>>>>>>>>>> * The properties are simple. They contain either single
>>>>>>>>>>>> snapshot ID each, or single view version each, or lastly, the 
>>>>>>>>>>>> storage table
>>>>>>>>>>>> identifier. Engines can use them without issues (also as shown in 
>>>>>>>>>>>> the
>>>>>>>>>>>> implementation).
>>>>>>>>>>>> * To be meaningful, the metadata fields should be captured in
>>>>>>>>>>>> the engine API as well, which is an effort that has to be pursued 
>>>>>>>>>>>> outside
>>>>>>>>>>>> the Iceberg community. Until engine APIs evolve, we would have to 
>>>>>>>>>>>> define a
>>>>>>>>>>>> mapping between Iceberg metadata fields and engine properties 
>>>>>>>>>>>> (only current
>>>>>>>>>>>> place in engine side to capture this info). This requires an 
>>>>>>>>>>>> additional
>>>>>>>>>>>> spec on its own, and it will introduce complexities. Hence it is 
>>>>>>>>>>>> always
>>>>>>>>>>>> cleaner to map Iceberg properties to engine properties and Iceberg 
>>>>>>>>>>>> metadata
>>>>>>>>>>>> to designated engine APIs. Mixing and matching (e.g., Iceberg 
>>>>>>>>>>>> metadata
>>>>>>>>>>>> fields as engine properties) just for the lack of other cleaner 
>>>>>>>>>>>> options
>>>>>>>>>>>> does not sound like a good idea in both short and long term.
>>>>>>>>>>>>
>>>>>>>>>>>> Let me know your thoughts.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, May 14, 2024 at 5:12 PM Benny Chow <btc...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I agree with Szheon here.  I think storing the materialization
>>>>>>>>>>>>> lineage as a bunch of properties is brittle.  This lineage 
>>>>>>>>>>>>> information is
>>>>>>>>>>>>> needed by engines to validate the staleness of a materialization 
>>>>>>>>>>>>> and also
>>>>>>>>>>>>> to perform full or incremental refreshes.  There’s a lot to 
>>>>>>>>>>>>> capture here.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maybe we should drill down into the use cases first - such as
>>>>>>>>>>>>> incremental refresh with aggregates?  (Pick a harder one first 😀)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I left a few comments about this in the doc.  I wonder what
>>>>>>>>>>>>> are your thoughts here Walaa?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>
>>>>>>>>>>>>> On May 14, 2024, at 4:20 PM, Walaa Eldin Moustafa <
>>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks John. The current metadata does not sound complex. We
>>>>>>>>>>>>> need to track the underlying table snapshot IDs as well as the 
>>>>>>>>>>>>> view version
>>>>>>>>>>>>> ID. I agree as long as it is simple and before this feature fully 
>>>>>>>>>>>>> matures,
>>>>>>>>>>>>> we should track it in properties.
>>>>>>>>>>>>>
>>>>>>>>>>>>> One important factor for me (apart from the API effort,
>>>>>>>>>>>>> especially on the engine side), is that not each table is an MV 
>>>>>>>>>>>>> storage
>>>>>>>>>>>>> table. Surfacing MV-specific storage table properties as first 
>>>>>>>>>>>>> class table
>>>>>>>>>>>>> metadata sounds to impose this metadata on every table, when it 
>>>>>>>>>>>>> is not
>>>>>>>>>>>>> required for normal table operation (yes, they can be optional, 
>>>>>>>>>>>>> but it does
>>>>>>>>>>>>> not sound like the use case warrants exposing them as metadata 
>>>>>>>>>>>>> fields yet).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Similarly, since not every view is a materialized view, it
>>>>>>>>>>>>> sounds reasonable to keep MV-specific data in properties.
>>>>>>>>>>>>>
>>>>>>>>>>>>> UUID (for views), on the other hand, is common to all views,
>>>>>>>>>>>>> hence it made sense to add it as a top level field.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, May 14, 2024 at 1:01 PM John Zhuge <jzh...@apache.org>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Szheon,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> While I fully share your concern of abusing table properties,
>>>>>>>>>>>>>> we took the approach of option 1 and run it in production for 
>>>>>>>>>>>>>> several years:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - the feature was still evolving
>>>>>>>>>>>>>>    - quick and simple implementation
>>>>>>>>>>>>>>    - table properties are simple enough and not confusing
>>>>>>>>>>>>>>    - haven't seen an urgent need to convert the properties
>>>>>>>>>>>>>>    to metadata fields and add API
>>>>>>>>>>>>>>    - do not wish on-disk changes (requiring lengthy
>>>>>>>>>>>>>>    tedious migration)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That said, I am open to codifying the mv metadata into api
>>>>>>>>>>>>>> and spec, with the following considerations
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - mv metadata has reached maturity and consensus (could
>>>>>>>>>>>>>>    be just a core portion)
>>>>>>>>>>>>>>    - when mv metadata becomes too complex
>>>>>>>>>>>>>>    - wish to support use cases that are quicker to adopt API
>>>>>>>>>>>>>>    changes (than engines), e.g., using Iceberg library to 
>>>>>>>>>>>>>> manipulate MVs, or
>>>>>>>>>>>>>>    parsing metadata files directly
>>>>>>>>>>>>>>    - Spark view catalog API can evolve separately from
>>>>>>>>>>>>>>    Iceberg API and spec changes
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks all for the great discussion!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, May 10, 2024 at 10:48 PM Walaa Eldin Moustafa <
>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Szheon,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for the follow-up. It is possible some of the
>>>>>>>>>>>>>>> concerns were referring to the backend catalogs, but it is all 
>>>>>>>>>>>>>>> connected.
>>>>>>>>>>>>>>> My main personal concern is from the engine connector APIs 
>>>>>>>>>>>>>>> point of
>>>>>>>>>>>>>>> view, but I share the concern about the catalogs.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think everyone's concern is not about the complexity* per*
>>>>>>>>>>>>>>> backend catalog/engine catalog API (in which case adding new 
>>>>>>>>>>>>>>> metadata to
>>>>>>>>>>>>>>> existing objects could require less code), but rather about the
>>>>>>>>>>>>>>> *number* of APIs and implementations that need to change
>>>>>>>>>>>>>>> (in which case both new metadata to existing objects and new 
>>>>>>>>>>>>>>> objects
>>>>>>>>>>>>>>> altogether introduce equal complexity).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, May 10, 2024 at 10:31 AM Szehon Ho <
>>>>>>>>>>>>>>> szehon.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Walaa
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> OK thanks for confirming.  I am still not 100% in
>>>>>>>>>>>>>>>> agreement, my understanding of the rationale for separate 
>>>>>>>>>>>>>>>> Table/View
>>>>>>>>>>>>>>>> objects in the comment that you linked:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think the biggest problem with this is that we would need
>>>>>>>>>>>>>>>>> to modify every catalog to support this combination and that 
>>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>> really difficult.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> is about JavaCatalogs /REST Catalog needing to to support
>>>>>>>>>>>>>>>> creating , persisting, and loading a MaterializedView object, 
>>>>>>>>>>>>>>>> which is much
>>>>>>>>>>>>>>>> more complex.  See HiveView PR for example :
>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/9852   We would
>>>>>>>>>>>>>>>> have to do the same exercise for persisting MV.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In our case though, there's not much complexity regardless
>>>>>>>>>>>>>>>> of approach ('properties' or new metadata fields), in terms of 
>>>>>>>>>>>>>>>> Java
>>>>>>>>>>>>>>>> Catalog/REST Catalog.  It's mostly pass-through to storage.  
>>>>>>>>>>>>>>>> Looks like you
>>>>>>>>>>>>>>>> are referring to Spark's View model in terms of complexity, 
>>>>>>>>>>>>>>>> which may be a
>>>>>>>>>>>>>>>> different story, but not sure if it is a good rationale to 
>>>>>>>>>>>>>>>> make Iceberg to
>>>>>>>>>>>>>>>> use 'properties' .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 'properties'  is for read/write configurations, not to save
>>>>>>>>>>>>>>>> metadatas.  To me, its also brittle to save important 
>>>>>>>>>>>>>>>> metadata, as it's not
>>>>>>>>>>>>>>>> in the defined schema.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> A string to string map of table properties. This is used to
>>>>>>>>>>>>>>>>> control settings that affect reading and writing and is not 
>>>>>>>>>>>>>>>>> intended to be
>>>>>>>>>>>>>>>>> used for arbitrary metadata.  For example,
>>>>>>>>>>>>>>>>> commit.retry.num-retries is used to control the number of
>>>>>>>>>>>>>>>>> commit retries.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On the other hand, the Draft Spec suggests to save
>>>>>>>>>>>>>>>> `lineage` as a modeled field on the Storage Table's snapshot 
>>>>>>>>>>>>>>>> metadata.
>>>>>>>>>>>>>>>> This allows you to 'time travel', 'branch', and have this 
>>>>>>>>>>>>>>>> metadata life
>>>>>>>>>>>>>>>> cycle integrated via normal snapshots lifecycle operations.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So that's my rationale.  Not sure if we can come to an
>>>>>>>>>>>>>>>> agreement over email though, and may need others to chime in 
>>>>>>>>>>>>>>>> as well.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> Szehon
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, May 9, 2024 at 11:58 PM Walaa Eldin Moustafa <
>>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Szehon,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes, you are reading the PR correctly, and interpreting
>>>>>>>>>>>>>>>>> the meaning of properties correctly. I think the reply you 
>>>>>>>>>>>>>>>>> pasted from Ryan
>>>>>>>>>>>>>>>>> refers to the same concept as well.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For the initial Google doc and the issue (by the way it is
>>>>>>>>>>>>>>>>> an issue, not a PR), yes both are proposing new metadata 
>>>>>>>>>>>>>>>>> fields.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The references I made to the modeling doc [1, 2] are
>>>>>>>>>>>>>>>>> reasons why new APIs are not desired. The cons/concerns 
>>>>>>>>>>>>>>>>> applicable to new
>>>>>>>>>>>>>>>>> MV metadata apply by extension to new table and view metadata 
>>>>>>>>>>>>>>>>> fields.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The reason why new metadata adds complexity is that this
>>>>>>>>>>>>>>>>> new metadata needs to be propagated to the engine API. For 
>>>>>>>>>>>>>>>>> example, here is
>>>>>>>>>>>>>>>>> the ViewInfo [3] class in the Spark catalog, which is used in 
>>>>>>>>>>>>>>>>> view methods
>>>>>>>>>>>>>>>>> like createView. Its fields correspond with the Iceberg 
>>>>>>>>>>>>>>>>> metadata. Adding
>>>>>>>>>>>>>>>>> new Iceberg fields should be accompanied with new fields in 
>>>>>>>>>>>>>>>>> the engine
>>>>>>>>>>>>>>>>> catalog/connector APIs, which was a major reason for 
>>>>>>>>>>>>>>>>> rejecting the combined
>>>>>>>>>>>>>>>>> MV object model as well.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4
>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE
>>>>>>>>>>>>>>>>> [3]
>>>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/2df494fd4e4e64b9357307fb0c5e8fc1b7491ac3/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ViewInfo.java#L45
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, May 9, 2024 at 11:30 PM Szehon Ho <
>>>>>>>>>>>>>>>>> szehon.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Walaa
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> As there may be confusion in the word 'properties', I
>>>>>>>>>>>>>>>>>> want to double check if we are talking about the same thing 
>>>>>>>>>>>>>>>>>> here.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I am reading your PR as adding lineage metadata as new
>>>>>>>>>>>>>>>>>> key/value pair under the storage Table's 'properties' field:
>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/main/format/spec.md?plain=1#L677
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *optional* *optional* *properties* A string to string
>>>>>>>>>>>>>>>>>> map of table properties. This is used to control settings 
>>>>>>>>>>>>>>>>>> that affect
>>>>>>>>>>>>>>>>>> reading and writing and is not intended to be used for 
>>>>>>>>>>>>>>>>>> arbitrary metadata.
>>>>>>>>>>>>>>>>>> For example, commit.retry.num-retries is used to control
>>>>>>>>>>>>>>>>>> the number of commit retries.
>>>>>>>>>>>>>>>>>> and adding Storage Table pointer as key/value pair in the
>>>>>>>>>>>>>>>>>> View's 'properties' field:
>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/main/format/view-spec.md?plain=1#L65
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *optional* properties A string to string map of view
>>>>>>>>>>>>>>>>>> properties [2]
>>>>>>>>>>>>>>>>>> Is that correct?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On the other hand, I was talking about adding this
>>>>>>>>>>>>>>>>>> metadata as actual fields, as is described in the Draft Spec 
>>>>>>>>>>>>>>>>>> of the Design
>>>>>>>>>>>>>>>>>> Doc
>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A
>>>>>>>>>>>>>>>>>>  and
>>>>>>>>>>>>>>>>>> first PR https://github.com/apache/iceberg/issues/6420 .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Do you mean, the vote means we cannot model new fields
>>>>>>>>>>>>>>>>>> like 'materialization' and 'lineage' as was proposed there ? 
>>>>>>>>>>>>>>>>>>    If that is
>>>>>>>>>>>>>>>>>> the interpretation, I am not sure I agree.  I dont fully see 
>>>>>>>>>>>>>>>>>> how new fields
>>>>>>>>>>>>>>>>>> adds more catalog implementation complexity over new 
>>>>>>>>>>>>>>>>>> key/value properties?
>>>>>>>>>>>>>>>>>> To me, the vote seemed to just rule out using a combined 
>>>>>>>>>>>>>>>>>> catalog object
>>>>>>>>>>>>>>>>>> (MaterializedView) in favor of re-using the Table and View 
>>>>>>>>>>>>>>>>>> metadata models,
>>>>>>>>>>>>>>>>>> not to prevent change to the Table and View model.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> Szehon
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, May 9, 2024 at 10:05 PM Walaa Eldin Moustafa <
>>>>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Szehon,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think choosing separate view + table objects precludes
>>>>>>>>>>>>>>>>>>> us from adding new metadata to table and view metadata. 
>>>>>>>>>>>>>>>>>>> Here is one
>>>>>>>>>>>>>>>>>>> relevant comment [1] from Ryan on the modeling doc, where 
>>>>>>>>>>>>>>>>>>> his point is that
>>>>>>>>>>>>>>>>>>> we want to avoid introducing new APIs since it requires 
>>>>>>>>>>>>>>>>>>> updating every
>>>>>>>>>>>>>>>>>>> catalog, and (quoting) even now, we have few 
>>>>>>>>>>>>>>>>>>> implementations that support
>>>>>>>>>>>>>>>>>>> views because of the problems updating back ends. 
>>>>>>>>>>>>>>>>>>> Therefore, one of the
>>>>>>>>>>>>>>>>>>> major reasons to avoid a new model with new metadata is to 
>>>>>>>>>>>>>>>>>>> avoid adding new
>>>>>>>>>>>>>>>>>>> metadata, which introduces this complexity. Here is another 
>>>>>>>>>>>>>>>>>>> similar comment
>>>>>>>>>>>>>>>>>>> from Renjie [2] on the cons listed for the combined object 
>>>>>>>>>>>>>>>>>>> approach.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Even Ryan's point on the MV issue that you referenced
>>>>>>>>>>>>>>>>>>> reads to me as he is supportive of the property model. Here 
>>>>>>>>>>>>>>>>>>> are some quotes:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> > We would still want some MV metadata in table
>>>>>>>>>>>>>>>>>>> *properties*.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> > I recommend instead reusing the existing snapshot
>>>>>>>>>>>>>>>>>>> metadata structure to store what you need as snapshot 
>>>>>>>>>>>>>>>>>>> *properties*.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> > First, I think we want to avoid keeping much state
>>>>>>>>>>>>>>>>>>> information in complex table *properties*.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Again, here, he is supportive of table properties, but
>>>>>>>>>>>>>>>>>>> wants to make sure that the information is simple.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> > We may want additional metadata as well, like a UUID
>>>>>>>>>>>>>>>>>>> to ensure we have the right view. I don't think we have a 
>>>>>>>>>>>>>>>>>>> UUID in the view
>>>>>>>>>>>>>>>>>>> spec yet, but we could add one.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Here, he is very specific when it comes to new metadata
>>>>>>>>>>>>>>>>>>> fields, and explicitly calls it out. That is the only new 
>>>>>>>>>>>>>>>>>>> metadata field in
>>>>>>>>>>>>>>>>>>> that reply and by now it is already supported. It is also 
>>>>>>>>>>>>>>>>>>> not MV-specific.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hope this addresses your question on the property vs new
>>>>>>>>>>>>>>>>>>> metadata model.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4
>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, May 9, 2024 at 5:49 PM Szehon Ho <
>>>>>>>>>>>>>>>>>>> szehon.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Walaa,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I agree, I definitely do not want yet another pr/doc
>>>>>>>>>>>>>>>>>>>> where discussion happens. as its already quite spread out 
>>>>>>>>>>>>>>>>>>>> :)  But did not
>>>>>>>>>>>>>>>>>>>> want to clarify some points before we get started on the 
>>>>>>>>>>>>>>>>>>>> discussion on your
>>>>>>>>>>>>>>>>>>>> PR.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> With reusing the table and view objects, we are not
>>>>>>>>>>>>>>>>>>>>> changing the existing metadata of either table or view 
>>>>>>>>>>>>>>>>>>>>> spec but rather
>>>>>>>>>>>>>>>>>>>>> introduce new properties and formalize them to express 
>>>>>>>>>>>>>>>>>>>>> materialized views
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On this point, I am not 100% sure that choosing to
>>>>>>>>>>>>>>>>>>>> represent a MaterializedView as a separate View + Table 
>>>>>>>>>>>>>>>>>>>> object precludes us
>>>>>>>>>>>>>>>>>>>> from adding to metadata of Table or View as the Draft Spec 
>>>>>>>>>>>>>>>>>>>> suggested,
>>>>>>>>>>>>>>>>>>>> though.  I think this point was discussed in Jan's initial 
>>>>>>>>>>>>>>>>>>>> PR with a good
>>>>>>>>>>>>>>>>>>>> point from Ryan:
>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/6420#issuecomment-1369280546
>>>>>>>>>>>>>>>>>>>>  that
>>>>>>>>>>>>>>>>>>>> using Table Properties to track lineage is fairly brittle, 
>>>>>>>>>>>>>>>>>>>> and having it
>>>>>>>>>>>>>>>>>>>> formalized in the Iceberg metadata is cleaner, and that 
>>>>>>>>>>>>>>>>>>>> was thus the
>>>>>>>>>>>>>>>>>>>> direction of the Draft Spec in the design doc.  What do 
>>>>>>>>>>>>>>>>>>>> people think?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>> Szehon
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Thu, May 9, 2024 at 5:35 PM Walaa Eldin Moustafa <
>>>>>>>>>>>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks Szehon.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The reason for the difference is that the proposal in
>>>>>>>>>>>>>>>>>>>>> the Google doc is based on a new MV model, hence, new 
>>>>>>>>>>>>>>>>>>>>> metadata fields and a
>>>>>>>>>>>>>>>>>>>>> new metadata model were being introduced (with types, 
>>>>>>>>>>>>>>>>>>>>> optionality, etc).
>>>>>>>>>>>>>>>>>>>>> With reusing the table and view objects, we are not 
>>>>>>>>>>>>>>>>>>>>> changing the existing
>>>>>>>>>>>>>>>>>>>>> metadata of either table or view spec but rather 
>>>>>>>>>>>>>>>>>>>>> introduce new properties
>>>>>>>>>>>>>>>>>>>>> and formalize them to express materialized views. This 
>>>>>>>>>>>>>>>>>>>>> would be the answer
>>>>>>>>>>>>>>>>>>>>> to most of the questions you posted on the PR (besides 
>>>>>>>>>>>>>>>>>>>>> some naming
>>>>>>>>>>>>>>>>>>>>> questions, which I think should be straightforward).
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> With that fundamental difference, we cannot lift and
>>>>>>>>>>>>>>>>>>>>> shift what is in the doc to any PR. Further, having 
>>>>>>>>>>>>>>>>>>>>> consensus on separate
>>>>>>>>>>>>>>>>>>>>> table and view objects contradicts with the point being 
>>>>>>>>>>>>>>>>>>>>> made on having
>>>>>>>>>>>>>>>>>>>>> consensus on the doc. We might have had agreements on 
>>>>>>>>>>>>>>>>>>>>> some elements, but
>>>>>>>>>>>>>>>>>>>>> definitely not on the whole doc, proven by the follow ups 
>>>>>>>>>>>>>>>>>>>>> (also as a
>>>>>>>>>>>>>>>>>>>>> community, not individuals).
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Therefore: we need a new space to discuss the separate
>>>>>>>>>>>>>>>>>>>>> table and view properties.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Is the question whether to:
>>>>>>>>>>>>>>>>>>>>> 1- Create a new doc
>>>>>>>>>>>>>>>>>>>>> 2- Create a new PR?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I feel a PR is the most effective way, especially
>>>>>>>>>>>>>>>>>>>>> given the fact that we discussed the topic a lot by now. 
>>>>>>>>>>>>>>>>>>>>> If we agree, we
>>>>>>>>>>>>>>>>>>>>> can continue the discussion on the PR, else, we can 
>>>>>>>>>>>>>>>>>>>>> create a doc.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Thu, May 9, 2024 at 4:39 PM Szehon Ho <
>>>>>>>>>>>>>>>>>>>>> szehon.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks Walaa for driving it forward, looking forward
>>>>>>>>>>>>>>>>>>>>>> to thinking about implementation of Materialized Views.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I see Jan's point, the PR spec change is similar but
>>>>>>>>>>>>>>>>>>>>>> does not seem to be completely aligned with the Draft 
>>>>>>>>>>>>>>>>>>>>>> Spec in the design
>>>>>>>>>>>>>>>>>>>>>> doc:
>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/
>>>>>>>>>>>>>>>>>>>>>> .  I left my comments on PR of those sections with the 
>>>>>>>>>>>>>>>>>>>>>> links to the
>>>>>>>>>>>>>>>>>>>>>> difference.  I think most of those Draft Spec proposal 
>>>>>>>>>>>>>>>>>>>>>> is still applicable
>>>>>>>>>>>>>>>>>>>>>> after the decision to have separate Table and View 
>>>>>>>>>>>>>>>>>>>>>> objects  It will be
>>>>>>>>>>>>>>>>>>>>>> interesting to at least see drill a bit further why we 
>>>>>>>>>>>>>>>>>>>>>> did not choose the
>>>>>>>>>>>>>>>>>>>>>> approach in the Draft Spec and chose another way.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>> Szehon
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Wed, May 8, 2024 at 4:48 AM Jan Kaul
>>>>>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid>
>>>>>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Well, everybody that actively contributed to the
>>>>>>>>>>>>>>>>>>>>>>> discussion on the original google doc was in consensus. 
>>>>>>>>>>>>>>>>>>>>>>> That's why I
>>>>>>>>>>>>>>>>>>>>>>> brought up the topic at the Community Sync on the 
>>>>>>>>>>>>>>>>>>>>>>> 2024-02-14 (
>>>>>>>>>>>>>>>>>>>>>>> https://youtu.be/uAQVGd5zV4I?t=890) to raise the
>>>>>>>>>>>>>>>>>>>>>>> awareness of the broader community. After which the 
>>>>>>>>>>>>>>>>>>>>>>> discussion about the
>>>>>>>>>>>>>>>>>>>>>>> storage model started. I don't think that the 
>>>>>>>>>>>>>>>>>>>>>>> discussion about a single
>>>>>>>>>>>>>>>>>>>>>>> aspect of a proposal should invalidate all other 
>>>>>>>>>>>>>>>>>>>>>>> aspects of the proposal.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Regardless, the state of the proposal from the
>>>>>>>>>>>>>>>>>>>>>>> original google doc contains a lot of valuable 
>>>>>>>>>>>>>>>>>>>>>>> contributions from Micah,
>>>>>>>>>>>>>>>>>>>>>>> Szehon, Jack, Dan, yourself and others and it should at 
>>>>>>>>>>>>>>>>>>>>>>> least provide the
>>>>>>>>>>>>>>>>>>>>>>> basis for any further discussion. I don't think it's 
>>>>>>>>>>>>>>>>>>>>>>> effective to start
>>>>>>>>>>>>>>>>>>>>>>> with a completely different design because we are bound 
>>>>>>>>>>>>>>>>>>>>>>> to have the same
>>>>>>>>>>>>>>>>>>>>>>> discussions all over again.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks, Jan
>>>>>>>>>>>>>>>>>>>>>>> On 08.05.24 12:11, Walaa Eldin Moustafa wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> The only consensus the community had was on the
>>>>>>>>>>>>>>>>>>>>>>> object model through the most recent voting thread [1]. 
>>>>>>>>>>>>>>>>>>>>>>> This kind of
>>>>>>>>>>>>>>>>>>>>>>> consensus was not present during the doc discussions, 
>>>>>>>>>>>>>>>>>>>>>>> and this should be
>>>>>>>>>>>>>>>>>>>>>>> evident from the fact the last doc state listed 5 
>>>>>>>>>>>>>>>>>>>>>>> alternatives with no
>>>>>>>>>>>>>>>>>>>>>>> particular conclusion. I am not quite sure what type of 
>>>>>>>>>>>>>>>>>>>>>>> consensus we are
>>>>>>>>>>>>>>>>>>>>>>> referring to here given all the follow up discussions, 
>>>>>>>>>>>>>>>>>>>>>>> alternatives, etc.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Due to the separate object model, the PR is
>>>>>>>>>>>>>>>>>>>>>>> fundamentally different from the doc in the sense it 
>>>>>>>>>>>>>>>>>>>>>>> does not propose a new
>>>>>>>>>>>>>>>>>>>>>>> metadata model but rather formalizes some new table and 
>>>>>>>>>>>>>>>>>>>>>>> view properties
>>>>>>>>>>>>>>>>>>>>>>> related to MVs. That is also one reason there are no 
>>>>>>>>>>>>>>>>>>>>>>> repeated discussions.
>>>>>>>>>>>>>>>>>>>>>>> That said, if you feel there is a repeated discussion 
>>>>>>>>>>>>>>>>>>>>>>> (which I do not see
>>>>>>>>>>>>>>>>>>>>>>> so far), it would be best to link the relevant 
>>>>>>>>>>>>>>>>>>>>>>> discussion from the doc in a
>>>>>>>>>>>>>>>>>>>>>>> comment.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Happy to move the discussion elsewhere if there is
>>>>>>>>>>>>>>>>>>>>>>> sufficient support for this idea, but as things stand, 
>>>>>>>>>>>>>>>>>>>>>>> I do not see this as
>>>>>>>>>>>>>>>>>>>>>>> an efficient way to make progress. It sounds we have 
>>>>>>>>>>>>>>>>>>>>>>> been re-emphasizing
>>>>>>>>>>>>>>>>>>>>>>> the same points in the last two replies, so I will let 
>>>>>>>>>>>>>>>>>>>>>>> others chime in at
>>>>>>>>>>>>>>>>>>>>>>> this point.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>> https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Wed, May 8, 2024 at 2:31 AM Jan Kaul
>>>>>>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid>
>>>>>>>>>>>>>>>>>>>>>>> <jank...@mailbox.org.invalid> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> The original google doc
>>>>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing>
>>>>>>>>>>>>>>>>>>>>>>>> discussed multiple
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>

Re: Materialized Views: Next Steps

Reply via email to