Re: Materialized view integration with REST spec

Ryan Blue Thu, 29 Feb 2024 16:38:27 -0800

> Ryan, in the option "Separate table and view", will there be a reference
(or pointer) to the table from the view metadata?


Yes. And this is a problem we need to solve generally because a
materialized table needs to be able to track the upstream state of tables
that were used. I think it would be one or more identifiers stored in a
view metadata field, one for each materialization. But there's a lot of
assumptions about how we come out on these questions before we get to how
to store metadata.

On Thu, Feb 29, 2024 at 4:35 PM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Ryan, in the option "Separate table and view", will there be a reference
> (or pointer) to the table from the view metadata? Since the option of
> "embedding a table metadata location in view metadata" is not preferred, it
> is not clear how to associate the table with the view in the "Separate
> table and view" option without such a pointer.
>
> Thanks,
> Walaa.
>
>
> On Thu, Feb 29, 2024 at 3:04 PM Ryan Blue <b...@tabular.io> wrote:
>
>> Looks like it wasn’t clear what I meant for the 3 categories, so I’ll be
>> more specific:
>>
>>    - *Separate table and view*: this option is to have the objects that
>>    we have today, with extra metadata. Commit processes are separate:
>>    committing to the table doesn’t alter the view and committing to the view
>>    doesn’t change the table. However, changing the view can make it so the
>>    table is no longer useful as a materialization.
>>    - *A combination of a view and a table*: in this option, the table
>>    metadata and view metadata are the same as the first option. The 
>> difference
>>    is that the commit process combines them, either by embedding a table
>>    metadata location in view metadata or by tracking both in the same catalog
>>    reference.
>>    - *A new metadata type*: this option is where we define a new
>>    metadata object that has view attributes, like SQL representations, along
>>    with table attributes, like partition specs and snapshots.
>>
>> Hopefully this is clear because I think much of the confusion is caused
>> by different definitions.
>>
>> The LoadTableResponse having optional metadata-location field implies
>> that the object in the catalog no longer needs to hold a metadata file
>> pointer
>>
>> The REST protocol has not removed the requirement for a metadata file, so
>> I’m going to keep focused on the MV design options.
>>
>> When we say a MV can be a “new metadata type”, it does not mean it needs
>> to define a completely brand new structure of the metadata content
>>
>> I’m making a distinction between separate metadata files for the table
>> and the view and a combined metadata object, as above.
>>
>> We can define an “Iceberg MV” to be an object in a catalog, which has 1
>> table metadata file pointer, and 1 view metadata file pointer
>>
>> This is the option I am referring to as a “combination of a view and a
>> table”.
>>
>> So to review my initial email, I don’t see a reason why a combined view
>> and table is advantageous, either implemented by having a catalog reference
>> with two metadata locations or embedding a table metadata location in view
>> metadata. This would cause unnecessary dependence between the view and
>> table in catalogs. I guess there’s an argument that you could load both
>> table and view metadata locations at the same time. That hardly seems worth
>> the trouble given the recent issues with adding views to the JDBC catalog.
>>
>> I also think that once we decide on structure, we can make it possible
>> for REST catalog implementations to do smart things, in a way that doesn’t
>> put additional requirements on the underlying catalog store. For instance,
>> we could specify how to send additional objects in a LoadViewResult, in
>> case the catalog wants to pre-fetch table metadata. I think these
>> optimizations are a later addition, after we define the relationship
>> between views and tables.
>>
>> Jack, it sounds like you’re the proponent of a combined table and view
>> (rather than a new metadata spec for a materialized view). What is the main
>> motivation? It seems like you’re convinced of that approach, but I don’t
>> understand the advantage it brings.
>>
>> Ryan
>>
>> On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <szehon.apa...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> Yes I mostly agree with the assessment.  To clarify a few minor points.
>>>
>>> is a materialized view a view and a separate table, a combination of the
>>>> two (i.e. commits are combined), or a new metadata type?
>>>
>>>
>>> For 'new metadata type', I consider mostly Jack's initial proposal of a
>>> new Catalog MV object that has two references (ViewMetadata +
>>> TableMetadata).
>>>
>>> The arguments that I see for a combined materialized view object are:
>>>>
>>>>    - Regular views are separate, rather than being tables with SQL and
>>>>    no data so it would be inconsistent (“Iceberg view is just a table with 
>>>> no
>>>>    data but with representations defined. But we did not do that.”)
>>>>
>>>>
>>>>    - Materialized views are different objects in DDL
>>>>
>>>>
>>>>    - Tables may be a superset of functionality needed for materialized
>>>>    views
>>>>
>>>>
>>>>    - Tables are not typically exposed to end users — but this isn’t
>>>>    required by the separate view and table option
>>>>
>>>> For completeness, there seem to be a few additional ones (mentioned in
>>> the Slack and above messages).
>>>
>>>    - Lack of spec change (to ViewMetadata).  But as Jack says it is a
>>>    spec change (ie, to catalogs)
>>>    - A single call to get the View's StorageTable (versus two calls)
>>>    - A more natural API, no opportunity for user to call
>>>    Catalog.dropTable() and renameTable() on storage table
>>>
>>>
>>> *Thoughts:  *I think the long discussion sessions we had on Slack
>>> was fruitful for me, as seeing the API clarified some things.
>>>
>>> I was initially more in favor of MV being a new metadata type
>>> (TableMetadata + ViewMetadata).  But seeing most of the MV operations end
>>> up being ViewCatalog or Catalog operations, I am starting to think API-wise
>>> that it may not align with the new metadata type (unless we define
>>> MVCatalog and /MV REST endpoints, which then are boilerplate wrappers).
>>>
>>> Initially one question I had for option 'a view and a separate table',
>>> was how to make this table reference (metadata.json or catalog reference).
>>> In the previous option, we had a precedent of Catalog references to
>>> Metadata, but not pointers between Metadatas.  I initially saw the proposed
>>> Catalog's TableIdentifier pointer as 'polluting' catalog concerns in
>>> ViewMetadata.  (I saw Catalog and ViewCatalog as a layer above
>>> TableMetadata and ViewMetadata).  But I think Dan in the Slack made a fair
>>> point that ViewMetadata already is tightly bound with a Catalog.  In this
>>> case, I think this approach does have its merits as well in aligning
>>> Catalog API's with the metadata.
>>>
>>> Thanks
>>> Szehon
>>>
>>>
>>>
>>> On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul <jank...@mailbox.org.invalid>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I would like to provide my perspective on the question of what a
>>>> materialized view is and elaborate on Jack's recent proposal to view a
>>>> materialized view as a catalog concept.
>>>>
>>>> Firstly, let's look at the role of the catalog. Every entity in the
>>>> catalog has a *unique identifier*, and the catalog provides methods to
>>>> create, load, and update these entities. An important thing to note is that
>>>> the catalog methods exhibit two different behaviors: the *create and
>>>> load methods deal with the entire entity*, while the *update(commit)
>>>> method only deals with partial changes* to the entities.
>>>>
>>>> In the context of our current discussion, materialized view (MV)
>>>> metadata is a union of view and table metadata. The fact that the update
>>>> method deals only with partial changes, enables us to *reuse the
>>>> existing methods for updating tables and views*. For updates we don't
>>>> have to define what constitutes an entire materialized view. Changes to a
>>>> materialized view targeting the properties related to the view metadata
>>>> could use the update(commit) view method. Similarly, changes targeting the
>>>> properties related to the table metadata could use the update(commit) table
>>>> method. This is great news because we don't have to redefine view and table
>>>> commits (requirements, updates).
>>>> This is shown in the fact that Jack uses the same operation to update
>>>> the storage table for Option 1 and 3:
>>>>
>>>> // REST: POST /namespaces/db1/tables/mv1?materializedView=true
>>>> // non-REST: update JSON files at table_metadata_location
>>>> storageTable.newAppend().appendFile(...).commit();
>>>>
>>>> The open question is *whether the create and load methods should treat
>>>> the properties that constitute the MV metadata as two entities (View +
>>>> Table) or one entity (new MV object)*. This is all part of Jack's
>>>> proposal, where Option 1 proposes a new MV object, and Option 3 proposes
>>>> two separate entities. The advantage of Option 1 is that it doesn't require
>>>> two operations to load the metadata. On the other hand, the advantage of
>>>> Option 3 is that no new operations or catalogs have to be defined.
>>>>
>>>> In my opinion, defining a new representation for materialized views
>>>> (Option 1) is generally the cleaner solution. However, I see a path where
>>>> we could first introduce Option 3 and still have the possibility to
>>>> transition to Option 1 if needed. The great thing about Option 3 is that it
>>>> only requires minor changes to the current spec and is mostly
>>>> implementation detail.
>>>>
>>>> Therefore I would propose small additions to Jacks Option 3 that only
>>>> introduce changes to the spec that are not specific to materialized views.
>>>> The idea is to introduce boolean properties to be set on the creation of
>>>> the view and the storage table that indicate that they belong to a
>>>> materialized view. The view property "materialized" is set to "true" for a
>>>> MV and "false" for a regular view. And the table property "storage_table"
>>>> is set to "true" for a storage table and "false" for a regular table. The
>>>> absence of these properties indicates a regular view or table.
>>>>
>>>> ViewCatalog viewCatalog = (ViewCatalog) catalog;
>>>>
>>>> // REST: GET /namespaces/db1/views/mv1
>>>> // non-REST: load JSON file at metadata_location
>>>> View mv = viewCatalog.loadView(TableIdentifier.of("db1", "mv1"));
>>>>
>>>> // REST: GET /namespaces/db1/tables/mv1
>>>> // non-REST: load JSON file at table_metadata_location if present
>>>> Table storageTable = view.storageTable();
>>>>
>>>> // REST: POST /namespaces/db1/tables/mv1
>>>> // non-REST: update JSON file at table_metadata_location
>>>> storageTable.newAppend().appendFile(...).commit();
>>>>
>>>> We could then introduce a new requirement for views and tables called
>>>> "AssertProperty" which could make sure to only perform updates that are
>>>> inline with materialized views. The additional requirement can be seen as a
>>>> general extension which does not need to be changed if we decide to got
>>>> with Option 1 in the future.
>>>>
>>>> Let me know what you think.
>>>>
>>>> Best wishes,
>>>>
>>>> Jan
>>>>
>>>> On 29.02.24 04:09, Walaa Eldin Moustafa wrote:
>>>>
>>>> Thanks Ryan for the insights. I agree that reusing existing metadata
>>>> definitions and minimizing spec changes are very important. This also
>>>> minimizes spec drift (between materialized views and views spec, and
>>>> between materialized views and tables spec), and simplifies the
>>>> implementation.
>>>>
>>>> In an effort to take the discussion forward with concrete design
>>>> options based on an end-to-end implementation, I have prototyped the
>>>> implementation (and added Spark support) in this PR
>>>> https://github.com/apache/iceberg/pull/9830. I hope it helps us reach
>>>> convergence faster. More details about some of the design options are
>>>> discussed in the description of the PR.
>>>>
>>>> Thanks,
>>>> Walaa.
>>>>
>>>>
>>>> On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> I mean separate table and view metadata that is somehow combined
>>>>> through a commit process. For instance, keeping a pointer to a table
>>>>> metadata file in a view metadata file or combining commits to reference
>>>>> both. I don't see the value in either option.
>>>>>
>>>>> On Wed, Feb 28, 2024 at 5:05 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>
>>>>>> Thanks Ryan for the help to trace back to the root question! Just a
>>>>>> clarification question regarding your reply before I reply further: what
>>>>>> exactly does the option "a combination of the two (i.e. commits are
>>>>>> combined)" mean? How is that different from "a new metadata type"?
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>
>>>>>>> I’m catching up on this conversation, so hopefully I can bring a
>>>>>>> fresh perspective.
>>>>>>>
>>>>>>> Jack already pointed out that we need to start from the basics and I
>>>>>>> agree with that. Let’s remove voting at this point. Right now is the 
>>>>>>> time
>>>>>>> for discussing trade-offs, not lining up and taking sides. I realize 
>>>>>>> that
>>>>>>> wasn’t the intent with adding a vote, but that’s almost always the 
>>>>>>> result.
>>>>>>> It’s too easy to use it as a stand-in for consensus and move on
>>>>>>> prematurely. I get the impression from the swirl in Slack that 
>>>>>>> discussion
>>>>>>> has moved ahead of agreement.
>>>>>>>
>>>>>>> We’re still at the most basic question: is a materialized view a
>>>>>>> view and a separate table, a combination of the two (i.e. commits are
>>>>>>> combined), or a new metadata type?
>>>>>>>
>>>>>>> For now, I’m ignoring whether the “separate table” is some kind of
>>>>>>> “system table” (meaning hidden?) or if it is exposed in the catalog. 
>>>>>>> That’s
>>>>>>> a later choice (already pointed out) and, I suspect, it should be 
>>>>>>> delegated
>>>>>>> to catalog implementations.
>>>>>>>
>>>>>>> To simplify this a little, I think that we can eliminate the option
>>>>>>> to combine table and view commits. I don’t think there is a reason to
>>>>>>> combine the two. If separate, a table would track the view version used
>>>>>>> along with freshness information for referenced tables. If the table is
>>>>>>> automatically skipped when the version no longer matches the view, then 
>>>>>>> no
>>>>>>> action needs to happen when a view definition changes. Similarly, the 
>>>>>>> table
>>>>>>> can be updated independently without needing to also swap view metadata.
>>>>>>> This also aligns with the idea from the original doc that there can be
>>>>>>> multiple materialization tables for a view. Each should operate
>>>>>>> independently unless I’m missing something
>>>>>>>
>>>>>>> I don’t think the last paragraph’s conclusion is contentious so I’ll
>>>>>>> move on, but please stop here and reply if you disagree!
>>>>>>>
>>>>>>> That leaves the main two options, a view and a separate table linked
>>>>>>> by metadata, or, combined materialized view metadata.
>>>>>>>
>>>>>>> As the doc notes, the separate view and table option is simpler
>>>>>>> because it reuses existing metadata definitions and falls back to simple
>>>>>>> views. That is a significantly smaller spec and small is very, very
>>>>>>> important when it comes to specs. I think that the argument for a new
>>>>>>> definition of a materialized view needs to overcome this disadvantage.
>>>>>>>
>>>>>>> The arguments that I see for a combined materialized view object are:
>>>>>>>
>>>>>>>    - Regular views are separate, rather than being tables with SQL
>>>>>>>    and no data so it would be inconsistent (“Iceberg view is just a 
>>>>>>> table with
>>>>>>>    no data but with representations defined. But we did not do that.”)
>>>>>>>    - Materialized views are different objects in DDL
>>>>>>>    - Tables may be a superset of functionality needed for
>>>>>>>    materialized views
>>>>>>>    - Tables are not typically exposed to end users — but this isn’t
>>>>>>>    required by the separate view and table option
>>>>>>>
>>>>>>> Am I missing any arguments for combined metadata?
>>>>>>>
>>>>>>> Ryan
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: Materialized view integration with REST spec

Reply via email to