[GitHub] [iceberg] harshm-dev commented on a change in pull request #3425: Doc: add snapshot tagging and branching to spec

GitBox Mon, 01 Nov 2021 03:55:36 -0700


harshm-dev commented on a change in pull request #3425:
URL: https://github.com/apache/iceberg/pull/3425#discussion_r740108173




##########
File path: site/docs/snapshot-tag-branch.md
##########
@@ -0,0 +1,148 @@
+# Snapshot Tagging and Branching
+
+Iceberg snapshot tagging and branching feature offers user a Git-like 
experience in manging table snapshots.
+Users can assign tags to snapshots, create branches and configure customized 
retention policy for them.
+
+## Example use cases
+
+### Time-based Snapshot tagging
+
+Users can leverage Iceberg snapshot tagging to keep multiple versions of the 
table across different points in time.
+For example, a table can be configured to keep all snapshots within 24 hours, 
then 1 tagged snapshot per day, per week, per month, etc.
+The daily snapshots are retained for 1 week, weekly snapshots are retained for 
1 month, monthly snapshots are retained for 1 year, etc.
+
+### Critical snapshot maintenance branch
+
+There are snapshots that are critical for legal or business reasons, such as 
the yearly snapshots used for financial auditing.
+Because they are kept for an extended period of time (maybe even forever), 
data files in the table are commonly compacted and encrypted with periodic key 
rotation.
+Occasionally, rows in the snapshot also have to be deleted or updated to 
satisfy GDPR requirements.
+Users can create an Iceberg branch for such snapshots to maintain its 
independent lifecycle.
+
+### Experimental Branch
+
+An experimental branch is useful for many user groups, including:
+
+1. Data scientists and ML researchers can easily create an Iceberg branch to 
experiment with table data without worrying about polluting the main table 
snapshot.
+2. Data engineers can perform production AB testing against the experimental 
branch to ensure the correctness of certain table updates.
+3. Data producers can perform test load in a table in an experimental branch, 
and then append all the loaded files back to the main branch (similar to Git 
cherry-pick).

Review comment:
       Since Iceberg does not plan to offer Git-like merge, can we assume 
cherry-pick can happen only in a scenario similar to fast-forward merges (where 
there's no additional commit on the base)?
   If yes, do you think it is worth mentioning this pre-requisite? 

##########
File path: site/docs/snapshot-tag-branch.md
##########
@@ -0,0 +1,148 @@
+# Snapshot Tagging and Branching
+
+Iceberg snapshot tagging and branching feature offers user a Git-like 
experience in manging table snapshots.
+Users can assign tags to snapshots, create branches and configure customized 
retention policy for them.
+
+## Example use cases
+
+### Time-based Snapshot tagging
+
+Users can leverage Iceberg snapshot tagging to keep multiple versions of the 
table across different points in time.
+For example, a table can be configured to keep all snapshots within 24 hours, 
then 1 tagged snapshot per day, per week, per month, etc.
+The daily snapshots are retained for 1 week, weekly snapshots are retained for 
1 month, monthly snapshots are retained for 1 year, etc.
+
+### Critical snapshot maintenance branch
+
+There are snapshots that are critical for legal or business reasons, such as 
the yearly snapshots used for financial auditing.
+Because they are kept for an extended period of time (maybe even forever), 
data files in the table are commonly compacted and encrypted with periodic key 
rotation.
+Occasionally, rows in the snapshot also have to be deleted or updated to 
satisfy GDPR requirements.
+Users can create an Iceberg branch for such snapshots to maintain its 
independent lifecycle.
+
+### Experimental Branch
+
+An experimental branch is useful for many user groups, including:
+
+1. Data scientists and ML researchers can easily create an Iceberg branch to 
experiment with table data without worrying about polluting the main table 
snapshot.
+2. Data engineers can perform production AB testing against the experimental 
branch to ensure the correctness of certain table updates.
+3. Data producers can perform test load in a table in an experimental branch, 
and then append all the loaded files back to the main branch (similar to Git 
cherry-pick).
+
+!!!Note
+    Iceberg does not plan to offer a Git-like merge operation through 
branching.
+    Merging arbitrary changes requires a lot of work to keep track of the 
intent of the commit and the context. 
+    Merging in a table is actually committing a transaction. The expectation 
is different from a merge in Git, where the lack of a conflict is the 
definition of "correct". 
+    In a table, the lack of a file conflict does not mean that the transaction 
can be committed.
+    In addition, longer transaction lengths from branch-like behavior 
dramatically increases the likelihood that the transaction could fail.
+    The merge feature would likely be supported through multi-table 
transaction in the future.
+
+## Snapshot Reference
+
+In version control systems like git, branch and tag are both references of 
commits.
+In Iceberg, we use a similar concept of **Snapshot Reference** to implement 
branching and tagging.
+
+Each Iceberg table metadata contains a list of `refs` (references), and a 
`current-branch` indicating the current branch to use.
+When user creates an Iceberg table, the first commit belongs to the default 
`main` branch.
+Each snapshot reference has a uniquely identifiable name across all references 
of a table.
+A snapshot can have multiple references. The exact snapshot reference spec is 
documented at the [Spec](../spec/#snapshot-reference) page.
+Here we will provide some more explanations to the concepts in snapshot 
reference.
+
+### Reference Type
+
+There are clearly 2 types of snapshot reference, which are `branch` and `tag`. 
Their key differences are:
+
+- **New commit**: when a new snapshot is added as a child of a referenced 
snapshot, tag remains on the old snapshot, but branch reference moves to the 
child.
+
+- **Retention policy**: retention policy affects all the snapshots in a 
branch, but only a single tagged snapshot. (More details in the next section)
+
+### Retention Policy
+
+Iceberg offers a [snapshot expiration 
procedure](../spark-procedures/#expire_snapshots) to clean up snapshots that 
are not needed to free up storage space.
+Retention policy can be configured both globally and on snapshot reference to 
provide highly flexible customization to the expiration behavior.
+
+#### Global snapshot retention policy
+
+Global snapshot retention policy can be set through the following table 
properties:
+
+| Property                             | Default            | Description      
                                             |
+| ------------------------------------ | ------------------ | 
------------------------------------------------------------- |
+| history.expire.max-snapshot-age-ms   | 432000000 (5 days) | Default max age 
of snapshots to keep while expiring snapshots    |
+| history.expire.min-snapshots-to-keep | 1                  | Default min 
number of snapshots to keep while expiring snapshots |
+
+#### Snapshot reference retention policy
+
+Similarly, snapshot reference has the properties below to provider finer grain 
control:

Review comment:
       nit: typo
   `provider finer grain`

##########
File path: site/docs/spec.md
##########
@@ -548,6 +548,21 @@ Notes:
 1. An alternative, *strict projection*, creates a partition predicate that 
will match a file if all of the rows in the file must match the scan predicate. 
These projections are used to calculate the residual predicates for each file 
in a scan.
 2. For example, if `file_a` has rows with `id` between 1 and 10 and a delete 
file contains rows with `id` between 1 and 4, a scan for `id = 9` may ignore 
the delete file because none of the deletes can match a row that will be 
selected.
 
+#### Snapshot Reference
+
+Snapshot reference allows users to perform tagging and branching within an 
Iceberg table. The detailed user experience is described in [Iceberg Tagging 
and Branching](../snapshot-tag-branch).
+
+The snapshot reference object records all the user-defined information of a 
snapshot including name, reference type and retention policy configurations.
+
+| v2         | Field name                   | Type      | Description |
+| ---------- |------------------------------|-----------|-------------|
+| _required_ | **`snapshot-id`**            | `long`    | The ID of the 
snapshot referenced |
+| _required_ | **`name`**                   | `string`  | The name of the 
reference |
+| _required_ | **`type`**                   | `string`  | Type of the 
reference, `tag` or `branch` |

Review comment:
       Is it possible to have an enum?

##########
File path: site/docs/spec.md
##########
@@ -548,6 +548,21 @@ Notes:
 1. An alternative, *strict projection*, creates a partition predicate that 
will match a file if all of the rows in the file must match the scan predicate. 
These projections are used to calculate the residual predicates for each file 
in a scan.
 2. For example, if `file_a` has rows with `id` between 1 and 10 and a delete 
file contains rows with `id` between 1 and 4, a scan for `id = 9` may ignore 
the delete file because none of the deletes can match a row that will be 
selected.
 
+#### Snapshot Reference
+
+Snapshot reference allows users to perform tagging and branching within an 
Iceberg table. The detailed user experience is described in [Iceberg Tagging 
and Branching](../snapshot-tag-branch).
+
+The snapshot reference object records all the user-defined information of a 
snapshot including name, reference type and retention policy configurations.
+
+| v2         | Field name                   | Type      | Description |
+| ---------- |------------------------------|-----------|-------------|
+| _required_ | **`snapshot-id`**            | `long`    | The ID of the 
snapshot referenced |
+| _required_ | **`name`**                   | `string`  | The name of the 
reference |
+| _required_ | **`type`**                   | `string`  | Type of the 
reference, `tag` or `branch` |
+| _optional_ | **`min-snapshots-to-keep`**  | `int`     | For `branch` type 
only, the minimum number of snapshots to keep in a branch |

Review comment:
       Will it make sense to mention the default, since it is an optional field.

##########
File path: site/docs/spec.md
##########
@@ -548,6 +548,21 @@ Notes:
 1. An alternative, *strict projection*, creates a partition predicate that 
will match a file if all of the rows in the file must match the scan predicate. 
These projections are used to calculate the residual predicates for each file 
in a scan.
 2. For example, if `file_a` has rows with `id` between 1 and 10 and a delete 
file contains rows with `id` between 1 and 4, a scan for `id = 9` may ignore 
the delete file because none of the deletes can match a row that will be 
selected.
 
+#### Snapshot Reference
+
+Snapshot reference allows users to perform tagging and branching within an 
Iceberg table. The detailed user experience is described in [Iceberg Tagging 
and Branching](../snapshot-tag-branch).
+
+The snapshot reference object records all the user-defined information of a 
snapshot including name, reference type and retention policy configurations.
+
+| v2         | Field name                   | Type      | Description |
+| ---------- |------------------------------|-----------|-------------|
+| _required_ | **`snapshot-id`**            | `long`    | The ID of the 
snapshot referenced |
+| _required_ | **`name`**                   | `string`  | The name of the 
reference |

Review comment:
       Is the name required to be unique? If yes, will it be worth mentioning 
the same?

##########
File path: site/docs/spec.md
##########
@@ -581,10 +596,11 @@ Table metadata consists of the following fields:
 | _optional_ | _optional_ | **`metadata-log`**| A list (optional) of timestamp 
and metadata file location pairs that encodes changes to the previous metadata 
files for the table. Each time a new metadata file is created, a new entry of 
the previous metadata file location should be added to the list. Tables can be 
configured to remove oldest metadata log entries and keep a fixed-size log of 
the most recent entries after a commit. |
 | _optional_ | _required_ | **`sort-orders`**| A list of sort orders, stored 
as full sort order objects. |
 | _optional_ | _required_ | **`default-sort-order-id`**| Default sort order id 
of the table. Note that this could be used by writers, but is not used when 
reading because reads use the specs stored in manifest files. |
+|            | _optional_ | **`refs`** | A list of snapshot references, stored 
as full snapshot reference objects. |
+|            | _optional_ | **`current-branch`** | The name of the current 
branch. If not specified, it defaults to the `main` branch that starts with the 
table creation commit. | 

Review comment:
       Should we have a `current_branch` in metadata? Will it make more sense 
to position this as a client-session specific conf?

##########
File path: site/docs/snapshot-tag-branch.md
##########
@@ -0,0 +1,148 @@
+# Snapshot Tagging and Branching
+
+Iceberg snapshot tagging and branching feature offers user a Git-like 
experience in manging table snapshots.
+Users can assign tags to snapshots, create branches and configure customized 
retention policy for them.
+
+## Example use cases
+
+### Time-based Snapshot tagging
+
+Users can leverage Iceberg snapshot tagging to keep multiple versions of the 
table across different points in time.
+For example, a table can be configured to keep all snapshots within 24 hours, 
then 1 tagged snapshot per day, per week, per month, etc.
+The daily snapshots are retained for 1 week, weekly snapshots are retained for 
1 month, monthly snapshots are retained for 1 year, etc.
+
+### Critical snapshot maintenance branch
+
+There are snapshots that are critical for legal or business reasons, such as 
the yearly snapshots used for financial auditing.
+Because they are kept for an extended period of time (maybe even forever), 
data files in the table are commonly compacted and encrypted with periodic key 
rotation.
+Occasionally, rows in the snapshot also have to be deleted or updated to 
satisfy GDPR requirements.
+Users can create an Iceberg branch for such snapshots to maintain its 
independent lifecycle.
+
+### Experimental Branch
+
+An experimental branch is useful for many user groups, including:
+
+1. Data scientists and ML researchers can easily create an Iceberg branch to 
experiment with table data without worrying about polluting the main table 
snapshot.
+2. Data engineers can perform production AB testing against the experimental 
branch to ensure the correctness of certain table updates.
+3. Data producers can perform test load in a table in an experimental branch, 
and then append all the loaded files back to the main branch (similar to Git 
cherry-pick).
+
+!!!Note
+    Iceberg does not plan to offer a Git-like merge operation through 
branching.
+    Merging arbitrary changes requires a lot of work to keep track of the 
intent of the commit and the context. 
+    Merging in a table is actually committing a transaction. The expectation 
is different from a merge in Git, where the lack of a conflict is the 
definition of "correct". 
+    In a table, the lack of a file conflict does not mean that the transaction 
can be committed.
+    In addition, longer transaction lengths from branch-like behavior 
dramatically increases the likelihood that the transaction could fail.
+    The merge feature would likely be supported through multi-table 
transaction in the future.
+
+## Snapshot Reference
+
+In version control systems like git, branch and tag are both references of 
commits.
+In Iceberg, we use a similar concept of **Snapshot Reference** to implement 
branching and tagging.
+
+Each Iceberg table metadata contains a list of `refs` (references), and a 
`current-branch` indicating the current branch to use.
+When user creates an Iceberg table, the first commit belongs to the default 
`main` branch.
+Each snapshot reference has a uniquely identifiable name across all references 
of a table.
+A snapshot can have multiple references. The exact snapshot reference spec is 
documented at the [Spec](../spec/#snapshot-reference) page.
+Here we will provide some more explanations to the concepts in snapshot 
reference.
+
+### Reference Type
+
+There are clearly 2 types of snapshot reference, which are `branch` and `tag`. 
Their key differences are:
+
+- **New commit**: when a new snapshot is added as a child of a referenced 
snapshot, tag remains on the old snapshot, but branch reference moves to the 
child.
+
+- **Retention policy**: retention policy affects all the snapshots in a 
branch, but only a single tagged snapshot. (More details in the next section)
+
+### Retention Policy
+
+Iceberg offers a [snapshot expiration 
procedure](../spark-procedures/#expire_snapshots) to clean up snapshots that 
are not needed to free up storage space.
+Retention policy can be configured both globally and on snapshot reference to 
provide highly flexible customization to the expiration behavior.
+
+#### Global snapshot retention policy
+
+Global snapshot retention policy can be set through the following table 
properties:
+
+| Property                             | Default            | Description      
                                             |
+| ------------------------------------ | ------------------ | 
------------------------------------------------------------- |
+| history.expire.max-snapshot-age-ms   | 432000000 (5 days) | Default max age 
of snapshots to keep while expiring snapshots    |
+| history.expire.min-snapshots-to-keep | 1                  | Default min 
number of snapshots to keep while expiring snapshots |
+
+#### Snapshot reference retention policy
+
+Similarly, snapshot reference has the properties below to provider finer grain 
control:
+
+| Property                     | Type      | Description |
+|------------------------------|-----------|-------------|
+| **`min-snapshots-to-keep`**  | `int`     | For `branch` type only, the 
minimum number of snapshots to keep in a branch |

Review comment:
       How do we mention the branch name in the property?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] harshm-dev commented on a change in pull request #3425: Doc: add snapshot tagging and branching to spec

Reply via email to