rdblue commented on a change in pull request #3425: URL: https://github.com/apache/iceberg/pull/3425#discussion_r756345310
########## File path: site/docs/snapshot-tag-branch.md ########## @@ -0,0 +1,168 @@ +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Snapshot Tagging and Branching + +Iceberg snapshot tagging and branching feature provides users more functionalities in managing Iceberg snapshot lifecycle. +Users can assign tags to snapshots, create new branches, set the current branch for read and write, and configure customized retention policy for them. + +## Example use cases + +### Time-based Snapshot tagging + +Users can leverage Iceberg snapshot tagging to keep multiple versions of the table across different points in time. +For example, a table can be configured to keep all snapshots within 24 hours, then 1 tagged snapshot per day, per week, per month, etc. +The daily snapshots are retained for 1 week, weekly snapshots are retained for 1 month, monthly snapshots are retained for 1 year, etc. + +### Critical snapshot maintenance branch + +There are snapshots that are critical for legal or business reasons, such as the yearly snapshots used for financial auditing. +Because they are kept for an extended period of time (maybe even forever), data files in the table are commonly compacted and encrypted with periodic key rotation. +Occasionally, rows in the snapshot also have to be deleted or updated to satisfy GDPR requirements. +Users can create an Iceberg branch for such snapshots to maintain its independent lifecycle. + +### Experimental Branch + +An experimental branch is useful for many user groups, including: + +1. Data scientists and ML researchers can easily create an Iceberg branch to experiment with table data without worrying about polluting the main table snapshot. +2. Data engineers can perform production AB testing against the experimental branch to ensure the correctness of certain table updates. +3. Data producers can perform test load in a table in an experimental branch, and then append all the loaded files back to the main branch (similar to Git cherry-pick). + +!!!Note + Iceberg does not plan to offer a Git-like merge operation through branching. + Merging arbitrary changes requires a lot of work to keep track of the intent of the commit and the context. + Merging in a table is actually committing a transaction. The expectation is different from a merge in Git, where the lack of a conflict is the definition of "correct". + In a table, the lack of a file conflict does not mean that the transaction can be committed. + In addition, longer transaction lengths from branch-like behavior dramatically increases the likelihood that the transaction could fail. + The merge feature would likely be supported through multi-table transaction in the future. + +## Snapshot Reference + +In version control systems like git, branch and tag are both references of commits. +In Iceberg, we use a similar concept of **Snapshot Reference** to implement branching and tagging. + +Each Iceberg table metadata contains a list of `refs` (references), and a `current-branch` indicating the current branch to use. +When user creates an Iceberg table, the first commit belongs to the default `main` branch. +Each snapshot reference has a uniquely identifiable name across all references of a table. +A snapshot can have multiple references. The exact snapshot reference spec is documented at the [Spec](../spec/#snapshot-reference) page. +Here we will provide some more explanations to the concepts in snapshot reference. + +### Reference Type + +There are clearly 2 types of snapshot reference, which are `branch` and `tag`. Their key differences are: + +- **New commit**: when a new snapshot is added as a child of a referenced snapshot, tag remains on the old snapshot, but branch reference moves to the child. + +- **Retention policy**: retention policy affects all the snapshots in a branch, but only a single tagged snapshot. (More details in the next section) + +### Retention Policy + +Iceberg offers a [snapshot expiration procedure](../spark-procedures/#expire_snapshots) to clean up snapshots that are not needed to free up storage space. +Retention policy can be configured both globally and on snapshot reference to provide highly flexible customization to the expiration behavior. + +#### Global snapshot retention policy + +Global snapshot retention policy can be set through the following table properties: + +| Property | Default | Description | +| ------------------------------------ | ------------------ | ------------------------------------------------------------- | +| history.expire.max-snapshot-age-ms | 432000000 (5 days) | Default max age of snapshots to keep while expiring snapshots | +| history.expire.min-snapshots-to-keep | 1 | Default min number of snapshots to keep while expiring snapshots | + +#### Snapshot reference retention policy + +Similarly, snapshot reference has the properties below to provide finer grain control: + +| Property | Type | Description | +|------------------------------|-----------|-------------| +| **`min-snapshots-to-keep`** | `int` | For `branch` type only, the minimum number of snapshots to keep in a branch, default to the current value of table property `history.expire.min-snapshots-to-keep` when this value is evaluated | +| **`max-snapshot-age-ms`** | `long` | The duration before a snapshot tagged or in a branch could be expired by any automatic snapshot expiration process, default to the current value of table property `history.expire.max-snapshot-age-ms` when this value is evaluated | + +#### Policy evaluation mechanism + +When a snapshot expiration process starts, it follows the steps described below: + +1. form an expiration candidate pool containing all snapshots +2. for each snapshot reference, evaluate the associated policy and move snapshots out of the candidate pool +3. apply global retention policy to and move snapshots out of the candidate pool +4. when multiple snapshots can be chosen to be moved out, newer snapshots win +4. after evaluation, expire all snapshots that are still in the candidate pool + +#### Policy evaluation example + +Here is an example for how an Iceberg snapshot expiration procedure evaluates what snapshots to expire. + +Suppose we have the following snapshot graph and retention policies configured: + +``` +A -> B -> C (main) + \ (dev) + D -> E (b1) + \ + F -> G (b2) + \ + H (b3) +``` + +| Policy Type | Max Age | Min to Keep | Snapshots Affected | +|------------------|---------------|-------------|--------------------------| +| global | 5 hours | 4 | A, B, C, D, E, F, G, H | +| branch/main | 3 days | 1 | A, B, C | +| branch/b1 | 2 days | 2 | A, D, E | +| branch/b2 | 1 day | 0 | A, D, F, G | +| tag/dev | forever | N/A | C | + +Assume that we have a process continuously running the snapshot expiration procedure, we would have the results below as time progresses: + +##### Day 1: F, G, H are expired + +On day 1, the global and branch b2 max age has passed, affecting A, D, F, G, H. +There is no policy configured for branch b3, so H is expired. +On branch b2, A and D cannot be expired due to branch b1 policy, so only F and G are expired. Review comment: Like the process above, I think it will be more clear when this focuses on what to retain, rather than what to expire. The problem is that focusing on what to expire begs the question about which setting takes priority for removal. Focusing on what to retain avoids that problem because it is clear that a snapshot may be retained for multiple reasons. Here, I'd say that: * `main` is set to keep 3 days of snapshots, so all ancestors of C newer than 3 days should be kept: (C, B, A) * `b1` is set to keep 2 days of snapshots, so all ancestors of E newer than 2 days are kept: (E, D, A) * `dev` retains snapshot E (I think) * `b2` aged off, so it doesn't affect retention -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
