Re: [DISCUSS] Partial Metadata Loading

Daniel Weeks Thu, 31 Oct 2024 10:37:50 -0700

I'd like to clarify my concerns here because I think there are more aspects
to this than we've captured.


*Partial metadata loads adds significant complexity to the protocol*
Iceberg metadata is a complicated structure and finding a way to represent
how and what we want to piece apart is non-trivial.  There are nested
structures and references between different fields that would all need
custom ways to return through a response.  This also makes it difficult for
clients to process and services to implement.  Adding this (even with an
option to return full metadata with requirements that reflect the table
spec) necessitates a v2 endpoint.  If catalogs are required to support all
partial load semantics, then the catalog becomes complicated.  If the
catalog can opt to always return the full metadata, it makes the client
more complicated since they may have to handle to very different looking
response objects for any load request.

*Partial metadata doesn't address the underlying issue, but pushes it
somewhere else*
>From a client perspective, I can see that this feels like an optimization
because I can just grab what I want from the metadata (e.g. schema, or
properties).  However, all we've done is push that complexity to the server
which either has to parse the metadata and return a subset of it, or needs
to have a more complicated way of representing and storing independent
pieces of metadata (all while still being required to produce new json
metadata).  All we've done here is make the service more complicated, and
the underlying issue of maintenance of the metadata still needs to be
addressed.

*Partial metadata** doesn't align with primary use cases*
The vast majority of use cases require a significant amount of the metadata
returned in the load table response.  While some pieces may be discarded,
much of the information is necessary to read or update a table.  The ref
loading was an effort to limit the overall size of the response and include
the vast majority of relevant information for read only uses cases, but
even our most complete implementations still need the full metadata to
properly construct a new commit and resolve conflicts.

Even the example of Impala trying to load the location to determine if the
table has changed is less than ideal because to accurately answer that
question, you need to load the metadata.  For example, if there was a
background compaction that resulted in a rewrite operation or a property
change that doesn't affect the underlying data, it may not be necessary to
invalidate the cache.  This approach is further exacerbated if the
community decides to remove the location requirement because it would then
not be available to signify the state of the table.

*Partial metadata impedes adoption*
My biggest concern is that the added complexity here impedes adoption of
the REST specification.  There are a large number of engines and catalog
implementations that are still in the early stages of the adoption curve.
Partial metadata loads splits these groups into the catalogs willing to
implement it and engines that start requiring it in order to function.
While I think partial metadata loads is an interesting technical challenge,
I don't believe that it's necessary and our effort should go into producing
good solutions for metadata management and implementations of catalogs that
can return the table metadata quickly to clients.

I feel like focusing on table metadata maintenance addresses all of the
issues except the most extreme edge cases and good catalog implementations
can return a metadata payload faster the most object stores can even load
the metadata json file (in practice single digit millisecond responses are
achievable here), so performance is not the tradeoff.

- Dan


On Tue, Oct 29, 2024 at 1:31 AM Gabor Kaszab <gaborkas...@apache.org> wrote:

> Hi Iceberg Community,
>
> I just wanted to mention that I was also going to start a discussion about
> getting partial information from LoadTableResponse through the REST API.
> My motivation is a bit different here, though:
> Impala currently has strong integration with HMS and in turn with the
> HiveCatalog. Nowadays there are efforts put into the project to make it
> work with REST catalog for Iceberg tables, and there is one piece that we
> miss now with the REST API. Impala caches table metadata and we need a way
> to decide whether we have to reload the metadata for a particular table or
> not. Currently, with HMS we have a push-based solution where every change
> of the table is pushed to Impala from HMS as notifications/events, and with
> REST catalog we were thinking of a pull-based approach where Impala
> occasionally asks the REST catalog whether a particular table is up-to-date
> or not.
>
> *Use-case*: So in Impala's case what would be important is to have a REST
> Catalog API to answer a question like:
> "I cached this version of this particular table, is it up-to-date or do I
> have to reload it?"
>
> *Possible solutions*:
> 1) This could either be achieved by an API like this:
>     boolean isLatest(TableIdentifier ident, String metadataLocation);
> 2) Another approach could be to get the latest metadata location and let
> the engine compare it to the one it holds:
>     String metadataLocation(TableIdentifier ident);
> 3) Similarly to 2) querying metadata location could also be achieved by
> the current proposal of partial metadata like: (I just made up some types
> here)
>     Table loadTable(TableIdentifier ident,
> SomeFilterClass.MetadataLocation);
>
> Either way is fine for Impala I think, I just wanted to share our use-case
> that could also leverage getting partial metadata.
> Now that I have written this mail it seems to hijack the original
> conversation a bit. Let me know if I should raise this in a separate
> [discuss] thread.
>
> Regards,
> Gabor
>
> On Tue, Oct 29, 2024 at 2:16 AM Haizhou Zhao <zhaohaizhou940...@gmail.com>
> wrote:
>
>> Hello Dev list,
>>
>> I want to update the community on the current thread for the proposal
>> "Partially Loading Metadata - LoadTable V2" after hearing more perspectives
>> from the community. In general, there are still some distance to go for a
>> general consensus which I hope to foster more conversations and hear new
>> inputs.
>>
>> *Previous Discussions* (
>> https://docs.google.com/document/d/1Nv7_9XqS8EyR30_mrrqkwbZx9pw34i3HYIwuDDXnOY4/edit?tab=t.0
>> *)*
>>
>>
>> *10/28/2024, quick google meet discussion*
>>
>> Thanks, Christian, Dmitri, Eric, JB, Szehon, Yufei for your time and
>> voicing your opinion this morning. Here're a quick summary of what we
>> discussed (detail meeting notes also included in the link above):
>>
>> Folks agreed that having a REST endpoint allowing clients to filter for
>> what they need from LoadTableResult is a useful feature. The preliminary
>> use cases that are brought up:
>> 1. Load only current snapshot and current schema
>> 2. Load only metadata file location
>> 3. Load only credentials to access table
>> 4. Query historical status of the table when time traveling
>> Meanwhile, it is also important for this endpoint to be extensible enough
>> so that it could take care of likewise use cases that only require a
>> portion of LoadTableResult (metadata included) in the future.
>>
>> What the group has no strong preference or needs further inputs are:
>> 1. Whether to modify the existing loadTable endpoint for partial loading
>> or creating a new endpoint. The possible concern here is backward
>> compatibility.
>> 2. Whether to add bulk support to support cases like loading the current
>> schema of all tables belonging to the same namespace.
>>
>>
>> *10/23/2024, Iceberg community sync*
>>
>> Thanks, Ryan, Dan, Yufei, JB, Russel and Szehon for your inputs here.
>>
>> Folks are divided in two aspects:
>> 1. Can we use table maintenance work to keep metadata size at check, thus
>> preventing the necessity to slice metadata at all?
>> 2. Is it the same use case to bulk load part of the information for many
>> tables and to load part of the information for one table?
>>
>>
>> *10/09/2024, Dev list*
>>
>> Thanks, Dan, Eduard for your inputs here.
>>
>> Folks are aligned here to extend the existing "refs" mode to other fields
>> (i.e. metadata-log, snapshot-log, schemas), so that we can lazily load
>> those fields if not needed.
>>
>>
>> There are other parties from the community I had discussion on this topic
>> with. I appreciate your input, and I failed to mention the discussion here
>> because I forgot to keep a written record of the context for those
>> discussions. In case you fall into this category, then I do apologize.
>>
>>
>> *Summary of perspectives*
>>
>> The original proposal was aimed to tackle the growing metadata problem,
>> and proposed a loadTable V2 endpoint. As the last thread mentioned, the
>> conclusion at the time was that *extending the existing "refs" loading
>> mode to more fields is preferable as it introduces less complexity and is
>> more feasible to implement*.
>>
>> The later threads were where the community divided. On the one side, *there's
>> a general scepticism on the concept of partial metadata* (i.e. union
>> results from different requests has been a problem, even for "refs" lazy
>> loading in the past); on the other side, *there's a push to generalize
>> partial metadata concept to "LoadTableResult" as a whole* (e.g. to only
>> return metadata file location, or only return table access creds based on
>> client filter).
>>
>> Related is the concept of bulk API, where the community has raised this
>> use case more than once, which are typically related to data warehouse
>> management features, such as: 1) querying current schemas of all the tables
>> belonging to a namespace; 2) querying certain table properties of many
>> tables to see if any maintenance (downstream) jobs should be triggered; 3)
>> querying ownership information of all tables to check security compliance
>> of all the tables in data warehouse, etc.
>>
>> I want to lay everything down and foster more discussion for a good
>> direction:
>> 1. extend the current "refs" lazy loading mechanism to be a more generic
>> solution
>> 2. prevent partial metadata at all cost, and try to contain metadata size
>> to always (or most of the time) load in full
>> 3. generalize partial loading concept to the entire "LoadTableResult"
>> (e.g. a generic loadTable V2 endpoint), so that users can use the same
>> endpoint whether they want part of metadata, or other part of the
>> "LoadTableResult" (e.g. metadata file location; table creds)
>> 4. repurposing the last direction to make a bulk API for the REST spec,
>> where loading pieces of information from many tables are permitted
>> Or if there are other directions I failed to account for here.
>>
>> Looking forward to feedback/discussion from the community, thanks!
>> Haizhou
>>
>

Re: [DISCUSS] Partial Metadata Loading

Reply via email to