Re: Table and Snapshot Level Configs and Metadata

Peter Vary Mon, 17 May 2021 03:22:34 -0700

Hi Qinhua, Jack,

We are also trying to explore the possibilities for users to share a specific 
version of a table easily.


The use-case is that we have a quite frequently updated working table, but 
during that we identify specific snapshot which are good working copies to 
share. Other users do not want to know about the lifecycle of the original 
table, but want to query the specific snapshot in their daily routines. We were 
considering to enhance the HiveCatalog to allow creating tables above a 
specific snapshot.

Szehon Ho also created an issue about "Allow snapshotting iceberg table" which 
would have a similar goal. See: https://github.com/apache/iceberg/issues/2481 
<https://github.com/apache/iceberg/issues/2481>

The main problem with storing these snapshots outside Iceberg is that we need 
to prevent expiration and removal of the files for these snapshots.

Thanks,
Peter

> On May 14, 2021, at 22:08, QH Yan <[email protected]> wrote:
> 
> 1. Yes, our use case is as you suggested -- primary key will be mapped to 
> partition spec and sort scheme etc.
> However, our intention is to hide the details and provide default for users 
> because most of them do not have expertise to set an optimal partition spec 
> and sort order and are often ok with a default. That means if we convert the 
> PrimaryKey to partition spec + sort order, it would be nice to restore it 
> back to the user-facing concept. So there's no need to "add" anything, and we 
> just need a place to persist this info. 
> Actually a set might be good enough because the order is captured in 
> partition/sort specs.
> 
> 2. Putting all of them the same place sounds good now. There's a risk where 
> user breaks system properties accidentally, but it doesn't worth to change 
> the design.
> 
> 3. You are right -- this is a tagging feature that we'd like to support and 
> thanks for confirming my understanding of Iceberg -- I understand that the 
> concept Snapshot is analogous to git commit, and it is indeed the fundamental 
> blocker to build tagging on top of merely Iceberg table spec. In the worst 
> case we probably will use a customized Catalog to achieve it. IMHO this is 
> still a common feature, and I'd love to hear what people think of it. 
> 
> Thank you,
> Qinhua
> 
> On Fri, May 14, 2021 at 1:54 PM Jack Ye <[email protected] 
> <mailto:[email protected]>> wrote:
> 1. what is your use case for ordered primary key? We explored this option but 
> discarded it because ordering is mostly used for secondary indexing, but that 
> should instead leverage information about sort order and partition key in 
> Iceberg. For the upsert use case, ordering is not needed.
> 
> 2. Yes the read and write properties are defined in 
> https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableProperties.java
>  
> <https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableProperties.java>,
>  but I think there is nothing preventing you from adding custom property to 
> that map, because there is a public API for that operation, and this should 
> not be a bottleneck for scalability as long as you have a bounded number of 
> custom properties. But let's see how other people reacts to my suggestion 
> here, I was not aware that there is an expected usage of table properties to 
> only read and write configs.
> 
> 3. This sounds to me like a tagging feature, which could be natively 
> supported by Iceberg or be hooked to an external system to achieve the goal. 
> Currently there is no public API for updating that, so I would not suggest 
> leveraging the field. I am currently leaning towards not supporting this 
> natively in Iceberg, because (1) snapshot information should be immutable 
> once written, (2) Iceberg handles the evolution of table in a timeline of 
> commits, but this feature is trying to annotate the historical commits which 
> feels out of the Iceberg scope to me. It should be easy enough for you to 
> hook it to an external key-value storage where the key is tableId + 
> snapshotId, and value is the property map that you want to manage given the 
> immutable nature of table id and historical snapshot ids. Let's see how other 
> people think about this.
> 
> -Jack
> 
> On Fri, May 14, 2021 at 10:31 AM QH Yan <[email protected] 
> <mailto:[email protected]>> wrote:
> Thanks for the reply Jack!
> 
> 1. That is great to have!
> However, we might also like to preserve order among the key columns. What's 
> the reason that it is a set? 
> 
> 2. Sorry maybe I wasn't clear about the use-case. 
> We are working to extend Iceberg for our users. Here I meant metadata that is 
> of our users' interest but not related to Iceberg's functionality. For 
> example, the table owner may want to note that "Table A is about my research 
> on topic B, and it should be published to group C after 2022Q1".  This is not 
> a P0 requirement to us and I just want to learn in case it is available.
> It seems to me that table.properties aren't intended for it according to 
> "This is used to control settings that affect reading and writing and is not 
> intended to be used for arbitrary metadata. 
> <https://iceberg.apache.org/spec/#table-metadata-fields>" 
> 
> 3. Naming snapshot is a feature that we want to support. 
> For example, it is convenient for a user who often time-travel to a weekly 
> checkpoint which is named by formatted-date-string instead of a random int. 
> This feature is also related to compaction, which was discussed in a previous 
> thread and meeting, that we hope to compact and replace a named Snapshot so 
> that the time-travel reader gets the better performance, which means the 
> SnapshotSummary could contain information about compaction. 
> Let alone the compaction (I know the current proposal of compaction doesn't 
> work in that way), does it sound reasonable to add an optional  "name" field 
> in the SnapshotSummary for this?
> 
> 
> On Fri, May 14, 2021 at 12:16 PM Jack Ye <[email protected] 
> <mailto:[email protected]>> wrote:
> 1. The primary key concept is now added to Iceberg as the identifier concept 
> in schema. I am in the progress of adding the documentation. You can read the 
> javadoc for more details for now: 
> https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/Schema.java#L189-L210
>  
> <https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/Schema.java#L189-L210>
> 
> 2. Yes, properties is the one for user metadata.
> 
> 3. Why do you want to store additional configs and names for snapshot? The 
> snapshot.summary field is written by engine and has these defined fields: 
> https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/SnapshotSummary.java
>  
> <https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/SnapshotSummary.java>
> 
> -Jack
> 
> On Fri, May 14, 2021 at 8:49 AM QH Yan <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi there,
> We want to store 3 different kinds of metadata/config in Iceberg tables.
> 
> 1. Additional settings/admin properties for a table (e.g. PrimaryKey info)
> I think it is table.properties according to here 
> <https://iceberg.apache.org/spec/#table-metadata-fields> and would like to 
> confirm.
> 
> 2. User metadata at table level. 
> Sometimes user wants to take notes about a table. Is there a field for this? 
> (map of String is good enough and I don't want to abuse table.properties also 
> as the doc points out)
> 
> 3. Snapshot name and additional configs
> Seems that snapshot.summary is for both of these according to here 
> <https://iceberg.apache.org/spec/#snapshots>, am I right?
> 
> Thank you!
> 
> -- 
> Qinhua
> 
> 
> 
> -- 
> Qinhua
> 
> 
> 
> -- 
> Qinhua
>

Re: Table and Snapshot Level Configs and Metadata

Reply via email to