Re: Table and Snapshot Level Configs and Metadata

Jack Ye Mon, 17 May 2021 20:56:27 -0700

Yeah I agree that this use case is quite common. I added some thoughts to
the issue, we can continue the discussion there.
-Jack


On Mon, May 17, 2021 at 3:22 AM Peter Vary <[email protected]>
wrote:

> Hi Qinhua, Jack,
>
> We are also trying to explore the possibilities for users to share a
> specific version of a table easily.
>
> The use-case is that we have a quite frequently updated working table, but
> during that we identify specific snapshot which are good working copies to
> share. Other users do not want to know about the lifecycle of the original
> table, but want to query the specific snapshot in their daily routines. We
> were considering to enhance the HiveCatalog to allow creating tables above
> a specific snapshot.
>
> Szehon Ho also created an issue about "Allow snapshotting iceberg table"
> which would have a similar goal. See:
> https://github.com/apache/iceberg/issues/2481
>
> The main problem with storing these snapshots outside Iceberg is that we
> need to prevent expiration and removal of the files for these snapshots.
>
> Thanks,
> Peter
>
> On May 14, 2021, at 22:08, QH Yan <[email protected]> wrote:
>
> 1. Yes, our use case is as you suggested -- primary key will be mapped to
> partition spec and sort scheme etc.
> However, our intention is to hide the details and provide default for
> users because most of them do not have expertise to set an optimal
> partition spec and sort order and are often ok with a default. That means
> if we convert the PrimaryKey to partition spec + sort order, it would be
> nice to restore it back to the user-facing concept. So there's no need to
> "add" anything, and we just need a place to persist this info.
> Actually a set might be good enough because the order is captured in
> partition/sort specs.
>
> 2. Putting all of them the same place sounds good now. There's a risk
> where user breaks system properties accidentally, but it doesn't worth to
> change the design.
>
> 3. You are right -- this is a tagging feature that we'd like to support
> and thanks for confirming my understanding of Iceberg -- I understand that
> the concept *Snapshot *is analogous to git commit, and it is indeed the
> fundamental blocker to build tagging on top of merely Iceberg table spec.
> In the worst case we probably will use a customized Catalog to achieve it.
> IMHO this is still a common feature, and I'd love to hear what people think
> of it.
>
> Thank you,
> Qinhua
>
> On Fri, May 14, 2021 at 1:54 PM Jack Ye <[email protected]> wrote:
>
>> 1. what is your use case for ordered primary key? We explored this option
>> but discarded it because ordering is mostly used for secondary indexing,
>> but that should instead leverage information about sort order and partition
>> key in Iceberg. For the upsert use case, ordering is not needed.
>>
>> 2. Yes the read and write properties are defined in
>> https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableProperties.java,
>> but I think there is nothing preventing you from adding custom property to
>> that map, because there is a public API for that operation, and this should
>> not be a bottleneck for scalability as long as you have a bounded number of
>> custom properties. But let's see how other people reacts to my suggestion
>> here, I was not aware that there is an expected usage of table properties
>> to only read and write configs.
>>
>> 3. This sounds to me like a tagging feature, which could be natively
>> supported by Iceberg or be hooked to an external system to achieve the
>> goal. Currently there is no public API for updating that, so I would not
>> suggest leveraging the field. I am currently leaning towards not supporting
>> this natively in Iceberg, because (1) snapshot information should be
>> immutable once written, (2) Iceberg handles the evolution of table in a
>> timeline of commits, but this feature is trying to annotate the historical
>> commits which feels out of the Iceberg scope to me. It should be easy
>> enough for you to hook it to an external key-value storage where the key is
>> tableId + snapshotId, and value is the property map that you want to manage
>> given the immutable nature of table id and historical snapshot ids. Let's
>> see how other people think about this.
>>
>> -Jack
>>
>> On Fri, May 14, 2021 at 10:31 AM QH Yan <[email protected]> wrote:
>>
>>> Thanks for the reply Jack!
>>>
>>> 1. That is great to have!
>>> However, we might also like to preserve order among the key columns.
>>> What's the reason that it is a set?
>>>
>>> 2. Sorry maybe I wasn't clear about the use-case.
>>> We are working to extend Iceberg for our users. Here I meant metadata
>>> that is of our users' interest but not related to Iceberg's functionality.
>>> For example, the table owner may want to note that "Table A is about my
>>> research on topic B, and it should be published to group C after 2022Q1".
>>> This is not a P0 requirement to us and I just want to learn in case it is
>>> available.
>>> It seems to me that *table.properties* aren't intended for it according
>>> to "This is used to control settings that affect reading and writing
>>> and is not intended to be used for arbitrary metadata.
>>> <https://iceberg.apache.org/spec/#table-metadata-fields>"
>>>
>>> 3. Naming snapshot is a feature that we want to support.
>>> For example, it is convenient for a user who often time-travel to a
>>> weekly checkpoint which is named by formatted-date-string instead of a
>>> random int. This feature is also related to compaction, which was discussed
>>> in a previous thread and meeting, that we hope to compact and replace a
>>> named Snapshot so that the time-travel reader gets the better performance,
>>> which means the SnapshotSummary could contain information about compaction.
>>> Let alone the compaction (I know the current proposal of compaction
>>> doesn't work in that way), does it sound reasonable to add an optional
>>> "name" field in the SnapshotSummary for this?
>>>
>>>
>>> On Fri, May 14, 2021 at 12:16 PM Jack Ye <[email protected]> wrote:
>>>
>>>> 1. The primary key concept is now added to Iceberg as the identifier
>>>> concept in schema. I am in the progress of adding the documentation. You
>>>> can read the javadoc for more details for now:
>>>> https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/Schema.java#L189-L210
>>>>
>>>> 2. Yes, properties is the one for user metadata.
>>>>
>>>> 3. Why do you want to store additional configs and names for snapshot?
>>>> The snapshot.summary field is written by engine and has these defined
>>>> fields:
>>>> https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/SnapshotSummary.java
>>>>
>>>> -Jack
>>>>
>>>> On Fri, May 14, 2021 at 8:49 AM QH Yan <[email protected]> wrote:
>>>>
>>>>> Hi there,
>>>>> We want to store 3 different kinds of metadata/config in Iceberg
>>>>> tables.
>>>>>
>>>>> 1. Additional settings/admin properties for a table (e.g. PrimaryKey
>>>>> info)
>>>>> I think it is* table.properties* according to here
>>>>> <https://iceberg.apache.org/spec/#table-metadata-fields> and would
>>>>> like to confirm.
>>>>>
>>>>> 2. User metadata at table level.
>>>>> Sometimes user wants to take notes about a table. Is there a field for
>>>>> this? (map of String is good enough and I don't want to abuse
>>>>> table.properties also as the doc points out)
>>>>>
>>>>> 3. Snapshot name and additional configs
>>>>> Seems that* snapshot.summary* is for both of these according to here
>>>>> <https://iceberg.apache.org/spec/#snapshots>, am I right?
>>>>>
>>>>> Thank you!
>>>>>
>>>>> --
>>>>> *Qinhua*
>>>>>
>>>>>
>>>
>>> --
>>> *Qinhua*
>>>
>>>
>
> --
> *Qinhua*
>
>
>

Re: Table and Snapshot Level Configs and Metadata

Reply via email to