Yeah I agree that this use case is quite common. I added some thoughts to the issue, we can continue the discussion there. -Jack
On Mon, May 17, 2021 at 3:22 AM Peter Vary <[email protected]> wrote: > Hi Qinhua, Jack, > > We are also trying to explore the possibilities for users to share a > specific version of a table easily. > > The use-case is that we have a quite frequently updated working table, but > during that we identify specific snapshot which are good working copies to > share. Other users do not want to know about the lifecycle of the original > table, but want to query the specific snapshot in their daily routines. We > were considering to enhance the HiveCatalog to allow creating tables above > a specific snapshot. > > Szehon Ho also created an issue about "Allow snapshotting iceberg table" > which would have a similar goal. See: > https://github.com/apache/iceberg/issues/2481 > > The main problem with storing these snapshots outside Iceberg is that we > need to prevent expiration and removal of the files for these snapshots. > > Thanks, > Peter > > On May 14, 2021, at 22:08, QH Yan <[email protected]> wrote: > > 1. Yes, our use case is as you suggested -- primary key will be mapped to > partition spec and sort scheme etc. > However, our intention is to hide the details and provide default for > users because most of them do not have expertise to set an optimal > partition spec and sort order and are often ok with a default. That means > if we convert the PrimaryKey to partition spec + sort order, it would be > nice to restore it back to the user-facing concept. So there's no need to > "add" anything, and we just need a place to persist this info. > Actually a set might be good enough because the order is captured in > partition/sort specs. > > 2. Putting all of them the same place sounds good now. There's a risk > where user breaks system properties accidentally, but it doesn't worth to > change the design. > > 3. You are right -- this is a tagging feature that we'd like to support > and thanks for confirming my understanding of Iceberg -- I understand that > the concept *Snapshot *is analogous to git commit, and it is indeed the > fundamental blocker to build tagging on top of merely Iceberg table spec. > In the worst case we probably will use a customized Catalog to achieve it. > IMHO this is still a common feature, and I'd love to hear what people think > of it. > > Thank you, > Qinhua > > On Fri, May 14, 2021 at 1:54 PM Jack Ye <[email protected]> wrote: > >> 1. what is your use case for ordered primary key? We explored this option >> but discarded it because ordering is mostly used for secondary indexing, >> but that should instead leverage information about sort order and partition >> key in Iceberg. For the upsert use case, ordering is not needed. >> >> 2. Yes the read and write properties are defined in >> https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableProperties.java, >> but I think there is nothing preventing you from adding custom property to >> that map, because there is a public API for that operation, and this should >> not be a bottleneck for scalability as long as you have a bounded number of >> custom properties. But let's see how other people reacts to my suggestion >> here, I was not aware that there is an expected usage of table properties >> to only read and write configs. >> >> 3. This sounds to me like a tagging feature, which could be natively >> supported by Iceberg or be hooked to an external system to achieve the >> goal. Currently there is no public API for updating that, so I would not >> suggest leveraging the field. I am currently leaning towards not supporting >> this natively in Iceberg, because (1) snapshot information should be >> immutable once written, (2) Iceberg handles the evolution of table in a >> timeline of commits, but this feature is trying to annotate the historical >> commits which feels out of the Iceberg scope to me. It should be easy >> enough for you to hook it to an external key-value storage where the key is >> tableId + snapshotId, and value is the property map that you want to manage >> given the immutable nature of table id and historical snapshot ids. Let's >> see how other people think about this. >> >> -Jack >> >> On Fri, May 14, 2021 at 10:31 AM QH Yan <[email protected]> wrote: >> >>> Thanks for the reply Jack! >>> >>> 1. That is great to have! >>> However, we might also like to preserve order among the key columns. >>> What's the reason that it is a set? >>> >>> 2. Sorry maybe I wasn't clear about the use-case. >>> We are working to extend Iceberg for our users. Here I meant metadata >>> that is of our users' interest but not related to Iceberg's functionality. >>> For example, the table owner may want to note that "Table A is about my >>> research on topic B, and it should be published to group C after 2022Q1". >>> This is not a P0 requirement to us and I just want to learn in case it is >>> available. >>> It seems to me that *table.properties* aren't intended for it according >>> to "This is used to control settings that affect reading and writing >>> and is not intended to be used for arbitrary metadata. >>> <https://iceberg.apache.org/spec/#table-metadata-fields>" >>> >>> 3. Naming snapshot is a feature that we want to support. >>> For example, it is convenient for a user who often time-travel to a >>> weekly checkpoint which is named by formatted-date-string instead of a >>> random int. This feature is also related to compaction, which was discussed >>> in a previous thread and meeting, that we hope to compact and replace a >>> named Snapshot so that the time-travel reader gets the better performance, >>> which means the SnapshotSummary could contain information about compaction. >>> Let alone the compaction (I know the current proposal of compaction >>> doesn't work in that way), does it sound reasonable to add an optional >>> "name" field in the SnapshotSummary for this? >>> >>> >>> On Fri, May 14, 2021 at 12:16 PM Jack Ye <[email protected]> wrote: >>> >>>> 1. The primary key concept is now added to Iceberg as the identifier >>>> concept in schema. I am in the progress of adding the documentation. You >>>> can read the javadoc for more details for now: >>>> https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/Schema.java#L189-L210 >>>> >>>> 2. Yes, properties is the one for user metadata. >>>> >>>> 3. Why do you want to store additional configs and names for snapshot? >>>> The snapshot.summary field is written by engine and has these defined >>>> fields: >>>> https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/SnapshotSummary.java >>>> >>>> -Jack >>>> >>>> On Fri, May 14, 2021 at 8:49 AM QH Yan <[email protected]> wrote: >>>> >>>>> Hi there, >>>>> We want to store 3 different kinds of metadata/config in Iceberg >>>>> tables. >>>>> >>>>> 1. Additional settings/admin properties for a table (e.g. PrimaryKey >>>>> info) >>>>> I think it is* table.properties* according to here >>>>> <https://iceberg.apache.org/spec/#table-metadata-fields> and would >>>>> like to confirm. >>>>> >>>>> 2. User metadata at table level. >>>>> Sometimes user wants to take notes about a table. Is there a field for >>>>> this? (map of String is good enough and I don't want to abuse >>>>> table.properties also as the doc points out) >>>>> >>>>> 3. Snapshot name and additional configs >>>>> Seems that* snapshot.summary* is for both of these according to here >>>>> <https://iceberg.apache.org/spec/#snapshots>, am I right? >>>>> >>>>> Thank you! >>>>> >>>>> -- >>>>> *Qinhua* >>>>> >>>>> >>> >>> -- >>> *Qinhua* >>> >>> > > -- > *Qinhua* > > >
