(iceberg) branch main updated: Spec: Adds Row Lineage (#11130)

blue Wed, 23 Oct 2024 15:00:20 -0700

This is an automated email from the ASF dual-hosted git repository.

blue pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg.git



The following commit(s) were added to refs/heads/main by this push:
     new 02a988b09a Spec: Adds Row Lineage (#11130)
02a988b09a is described below

commit 02a988b09ab5b6e9aaa47c79e5e131313cf983cc
Author: Russell Spitzer <[email protected]>
AuthorDate: Thu Oct 24 05:59:18 2024 +0800

    Spec: Adds Row Lineage (#11130)
    
    Co-authored-by: Ryan Blue <[email protected]>
---
 format/spec.md | 289 ++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 203 insertions(+), 86 deletions(-)

diff --git a/format/spec.md b/format/spec.md
index 2974430a37..601cbcc3bc 100644
--- a/format/spec.md
+++ b/format/spec.md
@@ -51,7 +51,7 @@ Version 3 of the Iceberg spec extends data types and existing 
metadata structure
 * New data types: nanosecond timestamp(tz), unknown
 * Default value support for columns
 * Multi-argument transforms for partitioning and sorting
-
+* Row Lineage tracking
 
 ## Goals
 
@@ -322,16 +322,101 @@ Iceberg tables must not use field ids greater than 
2147483447 (`Integer.MAX_VALU
 
 The set of metadata columns is:
 
-| Field id, name              | Type          | Description |
-|-----------------------------|---------------|-------------|
-| **`2147483646  _file`**     | `string`      | Path of the file in which a 
row is stored |
-| **`2147483645  _pos`**      | `long`        | Ordinal position of a row in 
the source data file |
-| **`2147483644  _deleted`**  | `boolean`     | Whether the row has been 
deleted |
-| **`2147483643  _spec_id`**  | `int`         | Spec ID used to track the file 
containing a row |
-| **`2147483642  _partition`** | `struct`     | Partition to which a row 
belongs |
-| **`2147483546  file_path`** | `string`      | Path of a file, used in 
position-based delete files |
-| **`2147483545  pos`**       | `long`        | Ordinal position of a row, 
used in position-based delete files |
-| **`2147483544  row`**       | `struct<...>` | Deleted row values, used in 
position-based delete files |
+| Field id, name                   | Type          | Description               
                                                                             |
+|----------------------------------|---------------|--------------------------------------------------------------------------------------------------------|
+| **`2147483646  _file`**          | `string`      | Path of the file in which 
a row is stored                                                              |
+| **`2147483645  _pos`**           | `long`        | Ordinal position of a row 
in the source data file, starting at `0`                                     |
+| **`2147483644  _deleted`**       | `boolean`     | Whether the row has been 
deleted                                                                       |
+| **`2147483643  _spec_id`**       | `int`         | Spec ID used to track the 
file containing a row                                                        |
+| **`2147483642  _partition`**     | `struct`      | Partition to which a row 
belongs                                                                       |
+| **`2147483546  file_path`**      | `string`      | Path of a file, used in 
position-based delete files                                                    |
+| **`2147483545  pos`**            | `long`        | Ordinal position of a 
row, used in position-based delete files                                        
 |
+| **`2147483544  row`**            | `struct<...>` | Deleted row values, used 
in position-based delete files                                                |
+| **`2147483543  _row_id`**        | `long`        | A unique long assigned 
when row-lineage is enabled, see [Row Lineage](#row-lineage)                    
|
+| **`2147483542  _last_updated_sequence_number`**   | `long`        | The 
sequence number which last updated this row when row-lineage is enabled [Row 
Lineage](#row-lineage) |
+
+### Row Lineage
+
+In v3 and later, an Iceberg table can track row lineage fields for all newly 
created rows.  Row lineage is enabled by setting the field `row-lineage` to 
true in the table's metadata. When enabled, engines must maintain the 
`next-row-id` table field and the following row-level fields when writing data 
files:
+
+* `_row_id` a unique long identifier for every row within the table. The value 
is assigned via inheritance when a row is first added to the table and the 
existing value is explicitly written when the row is copied into a new file.
+* `_last_updated_sequence_number` the sequence number of the commit that last 
updated a row. The value is inherited when a row is first added or modified and 
the existing value is explicitly written when the row is written to a different 
data file but not modified.
+
+These fields are assigned and updated by inheritance because the commit 
sequence number and starting row ID are not assigned until the snapshot is 
successfully committed. Inheritance is used to allow writing data and manifest 
files before values are known so that it is not necessary to rewrite data and 
manifest files when an optimistic commit is retried.
+
+When row lineage is enabled, new snapshots cannot include [Equality 
Deletes](#equality-delete-files). Row lineage is incompatible with equality 
deletes because lineage values must be maintained, but equality deletes are 
used to avoid reading existing data before writing changes.
+
+
+#### Row lineage assignment
+
+Row lineage fields are written when row lineage is enabled. When not enabled, 
row lineage fields (`_row_id` and `_last_updated_sequence_number`) must not be 
written to data files. The rest of this section applies when row lineage is 
enabled.
+
+When a row is added or modified, the `_last_updated_sequence_number` field is 
set to `null` so that it is inherited when reading. Similarly, the `_row_id` 
field for an added row is set to `null` and assigned when reading.
+
+A data file with only new rows for the table may omit the 
`_last_updated_sequence_number` and `_row_id`. If the columns are missing, 
readers should treat both columns as if they exist and are set to null for all 
rows.
+
+On read, if `_last_updated_sequence_number` is `null` it is assigned the 
`sequence_number` of the data file's manifest entry. The data sequence number 
of a data file is documented in [Sequence Number 
Inheritance](#sequence-number-inheritance).
+
+When `null`, a row's `_row_id` field is assigned to the `first_row_id` from 
its containing data file plus the row position in that data file (`_pos`). A 
data file's `first_row_id` field is assigned using inheritance and is 
documented in [First Row ID Inheritance](#first-row-id-inheritance). A 
manifest's `first_row_id` is assigned when writing the manifest list for a 
snapshot and is documented in [First Row ID 
Assignment](#first-row-id-assignment). A snapshot's `first-row-id` is set to th 
[...]
+
+Values for `_row_id` and `_last_updated_sequence_number` are either read from 
the data file or assigned at read time. As a result on read, rows in a table 
always have non-null values for these fields when lineage is enabled.
+
+When an existing row is moved to a different data file for any reason, writers 
are required to write `_row_id` and `_last_updated_sequence_number` according 
to the following rules:
+
+1. The row's existing non-null `_row_id` must be copied into the new data file
+2. If the write has modified the row, the `_last_updated_sequence_number` 
field must be set to `null` (so that the modification's sequence number 
replaces the current value)
+3. If the write has not modified the row, the existing non-null 
`_last_updated_sequence_number` value must be copied to the new data file
+
+
+#### Row lineage example
+
+This example demonstrates how `_row_id` and `_last_updated_sequence_number` 
are assigned for a snapshot when row lineage is enabled. This starts with a 
table with row lineage enabled and a `next-row-id` of 1000.
+
+Writing a new append snapshot would create snapshot metadata with 
`first-row-id` assigned to the table's `next-row-id`:
+
+```json
+{
+  "operation": "append",
+  "first-row-id": 1000,
+  ...
+}
+```
+
+The snapshot's manifest list would contain existing manifests, plus new 
manifests with an assigned `first_row_id` based on the `added_rows_count` of 
previously listed added manifests:
+
+| `manifest_path` | `added_rows_count` | `existing_rows_count` | 
`first_row_id`     |
+|-----------------|--------------------|-----------------------|--------------------|
+| ...             | ...                | ...                   | ...           
     |
+| existing        | 75                 | 0                     | 925           
     |
+| added1          | 100                | 25                    | 1000          
     |
+| added2          | 0                  | 100                   | 1100          
     |
+| added3          | 125                | 25                    | 1100          
     |
+
+The first added file, `added1`, is assigned the same `first_row_id` as the 
snapshot and the following manifests are assigned `first_row_id` based on the 
number of rows added by the previously listed manifests. The second file, 
`added2`, does not change the `first_row_id` of the next manifest because it 
contains no added data files.
+
+Within `added1`, the first added manifest, each data file's `first_row_id` 
follows a similar pattern:
+
+| `status` | `file_path` | `record_count` | `first_row_id` |
+|----------|-------------|----------------|----------------|
+| EXISTING | data1       | 25             | 800            |
+| ADDED    | data2       | 50             | null (1000)    |
+| ADDED    | data3       | 50             | null (1050)    |
+
+The `first_row_id` of the EXISTING file `data1` was already assigned, so the 
file metadata was copied into manifest `added1`.
+
+Files `data2` and `data3` are written with `null` for `first_row_id` and are 
assigned `first_row_id` at read time based on the manifest's `first_row_id` and 
the `record_count` of previously listed ADDED files in this manifest: (1,000 + 
0) and (1,000 + 50).
+
+When the new snapshot is committed, the table's `next-row-id` must also be 
updated (even if the new snapshot is not in the main branch). Because 225 rows 
were added (`added1`: 100 + `added2`: 0 + `added3`: 125), the new value is 
1,000 + 225 = 1,225:
+
+
+### Enabling Row Lineage for Non-empty Tables
+
+Any snapshot without the field `first-row-id` does not have any lineage 
information and values for `_row_id` and `_last_updated_sequence_number` cannot 
be assigned accurately.  
+
+All files that were added before `row-lineage` was enabled should propagate 
null for all of the `row-lineage` related
+fields. The values for `_row_id` and `_last_updated_sequence_number` should 
always return null and when these rows are copied, 
+null should be explicitly written. After this point, rows are treated as if 
they were just created 
+and assigned `row_id` and `_last_updated_sequence_number` as if they were new 
rows.
 
 
 ## Partitioning
@@ -478,29 +563,29 @@ The schema of a manifest file is a struct called 
`manifest_entry` with the follo
 
 `data_file` is a struct with the following fields:
 
-| v1         | v2         | Field id, name                    | Type           
              | Description |
-| ---------- | ---------- 
|-----------------------------------|------------------------------|-------------|
-|            | _required_ | **`134  content`**                | `int` with 
meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of 
content stored by the data file: data, equality deletes, or position deletes 
(all v1 files are data files) |
-| _required_ | _required_ | **`100  file_path`**              | `string`       
              | Full URI for the file with FS scheme |
-| _required_ | _required_ | **`101  file_format`**            | `string`       
              | String file format name, avro, orc or parquet |
-| _required_ | _required_ | **`102  partition`**              | `struct<...>`  
              | Partition data tuple, schema based on the partition spec output 
using partition field ids for the struct field ids |
-| _required_ | _required_ | **`103  record_count`**           | `long`         
              | Number of records in this file |
-| _required_ | _required_ | **`104  file_size_in_bytes`**     | `long`         
              | Total file size in bytes |
-| _required_ |            | ~~**`105 block_size_in_bytes`**~~ | `long`         
              | **Deprecated. Always write a default in v1. Do not write in 
v2.** |
-| _optional_ |            | ~~**`106  file_ordinal`**~~       | `int`          
              | **Deprecated. Do not write.** |
-| _optional_ |            | ~~**`107  sort_columns`**~~       | `list<112: 
int>`             | **Deprecated. Do not write.** |
-| _optional_ | _optional_ | **`108  column_sizes`**           | `map<117: int, 
118: long>`   | Map from column id to the total size on disk of all regions 
that store the column. Does not include bytes necessary to read other columns, 
like footers. Leave null for row-oriented formats (Avro) |
-| _optional_ | _optional_ | **`109  value_counts`**           | `map<119: int, 
120: long>`   | Map from column id to number of values in the column (including 
null and NaN values) |
-| _optional_ | _optional_ | **`110  null_value_counts`**      | `map<121: int, 
122: long>`   | Map from column id to number of null values in the column |
-| _optional_ | _optional_ | **`137  nan_value_counts`**       | `map<138: int, 
139: long>`   | Map from column id to number of NaN values in the column |
-| _optional_ | _optional_ | **`111  distinct_counts`**        | `map<123: int, 
124: long>`   | Map from column id to number of distinct values in the column; 
distinct counts must be derived using values in the file by counting or using 
sketches, but not using methods like merging existing distinct counts |
-| _optional_ | _optional_ | **`125  lower_bounds`**           | `map<126: int, 
127: binary>` | Map from column id to lower bound in the column serialized as 
binary [1]. Each value must be less than or equal to all non-null, non-NaN 
values in the column for the file [2] |
-| _optional_ | _optional_ | **`128  upper_bounds`**           | `map<129: int, 
130: binary>` | Map from column id to upper bound in the column serialized as 
binary [1]. Each value must be greater than or equal to all non-null, non-Nan 
values in the column for the file [2] |
-| _optional_ | _optional_ | **`131  key_metadata`**           | `binary`       
              | Implementation-specific key metadata for encryption |
-| _optional_ | _optional_ | **`132  split_offsets`**          | `list<133: 
long>`            | Split offsets for the data file. For example, all row group 
offsets in a Parquet file. Must be sorted ascending |
-|            | _optional_ | **`135  equality_ids`**           | `list<136: 
int>`             | Field ids used to determine row equality in equality delete 
files. Required when `content=2` and should be null otherwise. Fields with ids 
listed in this column must be present in the delete file |
-| _optional_ | _optional_ | **`140  sort_order_id`**          | `int`          
              | ID representing sort order for this file [3]. |
-
+| v1         | v2         | v3         | Field id, name                    | 
Type                                                                        | 
Description                                                                     
                                                                                
                                                   |
+| ---------- 
|------------|------------|-----------------------------------|-----------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+|            | _required_ | _required_ | **`134  content`**                | 
`int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | 
Type of content stored by the data file: data, equality deletes, or position 
deletes (all v1 files are data files)                                           
                                                      |
+| _required_ | _required_ | _required_ | **`100  file_path`**              | 
`string`                                                                    | 
Full URI for the file with FS scheme                                            
                                                                                
                                                   |
+| _required_ | _required_ | _required_ | **`101  file_format`**            | 
`string`                                                                    | 
String file format name, avro, orc or parquet                                   
                                                                                
                                                   |
+| _required_ | _required_ | _required_ | **`102  partition`**              | 
`struct<...>`                                                               | 
Partition data tuple, schema based on the partition spec output using partition 
field ids for the struct field ids                                              
                                                   |
+| _required_ | _required_ | _required_ | **`103  record_count`**           | 
`long`                                                                      | 
Number of records in this file                                                  
                                                                                
                                                   |
+| _required_ | _required_ | _required_ | **`104  file_size_in_bytes`**     | 
`long`                                                                      | 
Total file size in bytes                                                        
                                                                                
                                                   |
+| _required_ |            |            | ~~**`105 block_size_in_bytes`**~~ | 
`long`                                                                      | 
**Deprecated. Always write a default in v1. Do not write in v2 or v3.**         
                                                                                
                                                   |
+| _optional_ |            |            | ~~**`106  file_ordinal`**~~       | 
`int`                                                                       | 
**Deprecated. Do not write.**                                                   
                                                                                
                                                   |
+| _optional_ |            |            | ~~**`107  sort_columns`**~~       | 
`list<112: int>`                                                            | 
**Deprecated. Do not write.**                                                   
                                                                                
                                                   |
+| _optional_ | _optional_ | _optional_ | **`108  column_sizes`**           | 
`map<117: int, 118: long>`                                                  | 
Map from column id to the total size on disk of all regions that store the 
column. Does not include bytes necessary to read other columns, like footers. 
Leave null for row-oriented formats (Avro)                |
+| _optional_ | _optional_ | _optional_ | **`109  value_counts`**           | 
`map<119: int, 120: long>`                                                  | 
Map from column id to number of values in the column (including null and NaN 
values)                                                                         
                                                      |
+| _optional_ | _optional_ | _optional_ | **`110  null_value_counts`**      | 
`map<121: int, 122: long>`                                                  | 
Map from column id to number of null values in the column                       
                                                                                
                                                   |
+| _optional_ | _optional_ | _optional_ | **`137  nan_value_counts`**       | 
`map<138: int, 139: long>`                                                  | 
Map from column id to number of NaN values in the column                        
                                                                                
                                                   |
+| _optional_ | _optional_ | _optional_ | **`111  distinct_counts`**        | 
`map<123: int, 124: long>`                                                  | 
Map from column id to number of distinct values in the column; distinct counts 
must be derived using values in the file by counting or using sketches, but not 
using methods like merging existing distinct counts |
+| _optional_ | _optional_ | _optional_ | **`125  lower_bounds`**           | 
`map<126: int, 127: binary>`                                                | 
Map from column id to lower bound in the column serialized as binary [1]. Each 
value must be less than or equal to all non-null, non-NaN values in the column 
for the file [2]                                     |
+| _optional_ | _optional_ | _optional_ | **`128  upper_bounds`**           | 
`map<129: int, 130: binary>`                                                | 
Map from column id to upper bound in the column serialized as binary [1]. Each 
value must be greater than or equal to all non-null, non-Nan values in the 
column for the file [2]                                  |
+| _optional_ | _optional_ | _optional_ | **`131  key_metadata`**           | 
`binary`                                                                    | 
Implementation-specific key metadata for encryption                             
                                                                                
                                                   |
+| _optional_ | _optional_ | _optional_ | **`132  split_offsets`**          | 
`list<133: long>`                                                           | 
Split offsets for the data file. For example, all row group offsets in a 
Parquet file. Must be sorted ascending                                          
                                                          |
+|            | _optional_ | _optional_ | **`135  equality_ids`**           | 
`list<136: int>`                                                            | 
Field ids used to determine row equality in equality delete files. Required 
when `content=2` and should be null otherwise. Fields with ids listed in this 
column must be present in the delete file                |
+| _optional_ | _optional_ | _optional_ | **`140  sort_order_id`**          | 
`int`                                                                       | 
ID representing sort order for this file [3].                                   
                                                                                
                                                   |
+|            |            | _optional_ | **`142  first_row_id`**           | 
`long`                                                                      | 
The `_row_id` for the first row in the data file. See [First Row ID 
Inheritance](#first-row-id-inheritance)                                         
                                                                        |
 Notes:
 
 1. Single-value serialization for lower and upper bounds is detailed in 
Appendix D.
@@ -544,21 +629,31 @@ Inheriting sequence numbers through the metadata tree 
allows writing a new manif
 
 When reading v1 manifests with no sequence number column, sequence numbers for 
all files must default to 0.
 
+### First Row ID Inheritance
+
+Row ID inheritance is used when row lineage is enabled. When not enabled, a 
data file's `first_row_id` must always be set to `null`. The rest of this 
section applies when row lineage is enabled.
+
+When adding a new data file, its `first_row_id` field is set to `null` because 
it is not assigned until the snapshot is successfully committed.
+
+When reading, the `first_row_id` is assigned by replacing `null` with the 
manifest's `first_row_id` plus the sum of `record_count` for all added data 
files that preceded the file in the manifest.
+
+The `first_row_id` is only inherited for added data files. The inherited value 
must be written into the data file metadata for existing and deleted entries. 
The value of `first_row_id` for delete files is always `null`.
 
 ## Snapshots
 
 A snapshot consists of the following fields:
 
-| v1         | v2         | Field                    | Description |
-| ---------- | ---------- | ------------------------ | ----------- |
-| _required_ | _required_ | **`snapshot-id`**        | A unique long ID |
-| _optional_ | _optional_ | **`parent-snapshot-id`** | The snapshot ID of the 
snapshot's parent. Omitted for any snapshot with no parent |
-|            | _required_ | **`sequence-number`**    | A monotonically 
increasing long that tracks the order of changes to a table |
-| _required_ | _required_ | **`timestamp-ms`**       | A timestamp when the 
snapshot was created, used for garbage collection and table inspection |
-| _optional_ | _required_ | **`manifest-list`**      | The location of a 
manifest list for this snapshot that tracks manifest files with additional 
metadata |
-| _optional_ |            | **`manifests`**          | A list of manifest file 
locations. Must be omitted if `manifest-list` is present |
-| _optional_ | _required_ | **`summary`**            | A string map that 
summarizes the snapshot changes, including `operation` (see below) |
-| _optional_ | _optional_ | **`schema-id`**          | ID of the table's 
current schema when the snapshot was created |
+| v1         | v2         | v3         | Field                        | 
Description                                                                     
                                                   |
+| ---------- | ---------- 
|------------|------------------------------|------------------------------------------------------------------------------------------------------------------------------------|
+| _required_ | _required_ | _required_ | **`snapshot-id`**            | A 
unique long ID                                                                  
                                                 |
+| _optional_ | _optional_ | _optional_ | **`parent-snapshot-id`**     | The 
snapshot ID of the snapshot's parent. Omitted for any snapshot with no parent   
                                               |
+|            | _required_ | _required_ | **`sequence-number`**        | A 
monotonically increasing long that tracks the order of changes to a table       
                                                 |
+| _required_ | _required_ | _required_ | **`timestamp-ms`**           | A 
timestamp when the snapshot was created, used for garbage collection and table 
inspection                                        |
+| _optional_ | _required_ | _required_ | **`manifest-list`**          | The 
location of a manifest list for this snapshot that tracks manifest files with 
additional metadata                              |
+| _optional_ |            |            | **`manifests`**              | A list 
of manifest file locations. Must be omitted if `manifest-list` is present       
                                            |
+| _optional_ | _required_ | _required_ | **`summary`**                | A 
string map that summarizes the snapshot changes, including `operation` (see 
below)                                               |
+| _optional_ | _optional_ | _optional_ | **`schema-id`**              | ID of 
the table's current schema when the snapshot was created                        
                                             |
+|            |            | _optional_ | **`first-row-id`**           | The 
first `_row_id` assigned to the first row in the first data file in the first 
manifest, see [Row Lineage](#row-lineage) |
 
 The snapshot summary's `operation` field is used by some operations, like 
snapshot expiration, to skip processing certain snapshots. Possible `operation` 
values are:
 
@@ -578,6 +673,15 @@ Manifests for a snapshot are tracked by a manifest list.
 Valid snapshots are stored as a list in table metadata. For serialization, see 
Appendix C.
 
 
+### Snapshot Row IDs
+
+When row lineage is not enabled, `first-row-id` must be omitted. The rest of 
this section applies when row lineage is enabled.
+
+A snapshot's `first-row-id` is assigned to the table's current `next-row-id` 
on each commit attempt. If a commit is retried, the `first-row-id` must be 
reassigned. If a commit contains no new rows, `first-row-id` should be omitted.
+
+The snapshot's `first-row-id` is the starting `first_row_id` assigned to 
manifests in the snapshot's manifest list.
+
+
 ### Manifest Lists
 
 Snapshots are embedded in table metadata, but the list of manifests for a 
snapshot are stored in a separate manifest list file.
@@ -590,23 +694,24 @@ A manifest list is a valid Iceberg data file: files must 
use valid Iceberg forma
 
 Manifest list files store `manifest_file`, a struct with the following fields:
 
-| v1         | v2         | Field id, name                 | Type              
                          | Description |
-| ---------- | ---------- 
|--------------------------------|---------------------------------------------|-------------|
-| _required_ | _required_ | **`500 manifest_path`**        | `string`          
                          | Location of the manifest file |
-| _required_ | _required_ | **`501 manifest_length`**      | `long`            
                          | Length of the manifest file in bytes |
-| _required_ | _required_ | **`502 partition_spec_id`**    | `int`             
                          | ID of a partition spec used to write the manifest; 
must be listed in table metadata `partition-specs` |
-|            | _required_ | **`517 content`**              | `int` with 
meaning: `0: data`, `1: deletes` | The type of files tracked by the manifest, 
either data or delete files; 0 for all v1 manifests |
-|            | _required_ | **`515 sequence_number`**      | `long`            
                          | The sequence number when the manifest was added to 
the table; use 0 when reading v1 manifest lists |
-|            | _required_ | **`516 min_sequence_number`**  | `long`            
                          | The minimum data sequence number of all live data 
or delete files in the manifest; use 0 when reading v1 manifest lists |
-| _required_ | _required_ | **`503 added_snapshot_id`**    | `long`            
                          | ID of the snapshot where the  manifest file was 
added |
-| _optional_ | _required_ | **`504 added_files_count`**    | `int`             
                          | Number of entries in the manifest that have status 
`ADDED` (1), when `null` this is assumed to be non-zero |
-| _optional_ | _required_ | **`505 existing_files_count`** | `int`             
                          | Number of entries in the manifest that have status 
`EXISTING` (0), when `null` this is assumed to be non-zero |
-| _optional_ | _required_ | **`506 deleted_files_count`**  | `int`             
                          | Number of entries in the manifest that have status 
`DELETED` (2), when `null` this is assumed to be non-zero |
-| _optional_ | _required_ | **`512 added_rows_count`**     | `long`            
                          | Number of rows in all of files in the manifest that 
have status `ADDED`, when `null` this is assumed to be non-zero |
-| _optional_ | _required_ | **`513 existing_rows_count`**  | `long`            
                          | Number of rows in all of files in the manifest that 
have status `EXISTING`, when `null` this is assumed to be non-zero |
-| _optional_ | _required_ | **`514 deleted_rows_count`**   | `long`            
                          | Number of rows in all of files in the manifest that 
have status `DELETED`, when `null` this is assumed to be non-zero |
-| _optional_ | _optional_ | **`507 partitions`**           | `list<508: 
field_summary>` (see below)      | A list of field summaries for each partition 
field in the spec. Each field in the list corresponds to a field in the 
manifest file’s partition spec. |
-| _optional_ | _optional_ | **`519 key_metadata`**         | `binary`          
                          | Implementation-specific key metadata for encryption 
|
+| v1         | v2         | v3         | Field id, name                   | 
Type                                        | Description                       
                                                                                
                                   |
+| ---------- | ---------- 
|------------|----------------------------------|---------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
+| _required_ | _required_ | _required_ | **`500 manifest_path`**          | 
`string`                                    | Location of the manifest file     
                                                                                
                                   |
+| _required_ | _required_ | _required_ | **`501 manifest_length`**        | 
`long`                                      | Length of the manifest file in 
bytes                                                                           
                                      |
+| _required_ | _required_ | _required_ | **`502 partition_spec_id`**      | 
`int`                                       | ID of a partition spec used to 
write the manifest; must be listed in table metadata `partition-specs`          
                                      |
+|            | _required_ | _required_ | **`517 content`**                | 
`int` with meaning: `0: data`, `1: deletes` | The type of files tracked by the 
manifest, either data or delete files; 0 for all v1 manifests                   
                                    |
+|            | _required_ | _required_ | **`515 sequence_number`**        | 
`long`                                      | The sequence number when the 
manifest was added to the table; use 0 when reading v1 manifest lists           
                                        |
+|            | _required_ | _required_ | **`516 min_sequence_number`**    | 
`long`                                      | The minimum data sequence number 
of all live data or delete files in the manifest; use 0 when reading v1 
manifest lists                              |
+| _required_ | _required_ | _required_ | **`503 added_snapshot_id`**      | 
`long`                                      | ID of the snapshot where the  
manifest file was added                                                         
                                       |
+| _optional_ | _required_ | _required_ | **`504 added_files_count`**      | 
`int`                                       | Number of entries in the manifest 
that have status `ADDED` (1), when `null` this is assumed to be non-zero        
                                   |
+| _optional_ | _required_ | _required_ | **`505 existing_files_count`**   | 
`int`                                       | Number of entries in the manifest 
that have status `EXISTING` (0), when `null` this is assumed to be non-zero     
                                   |
+| _optional_ | _required_ | _required_ | **`506 deleted_files_count`**    | 
`int`                                       | Number of entries in the manifest 
that have status `DELETED` (2), when `null` this is assumed to be non-zero      
                                   |
+| _optional_ | _required_ | _required_ | **`512 added_rows_count`**       | 
`long`                                      | Number of rows in all of files in 
the manifest that have status `ADDED`, when `null` this is assumed to be 
non-zero                                  |
+| _optional_ | _required_ | _required_ | **`513 existing_rows_count`**    | 
`long`                                      | Number of rows in all of files in 
the manifest that have status `EXISTING`, when `null` this is assumed to be 
non-zero                               |
+| _optional_ | _required_ | _required_ | **`514 deleted_rows_count`**     | 
`long`                                      | Number of rows in all of files in 
the manifest that have status `DELETED`, when `null` this is assumed to be 
non-zero                                |
+| _optional_ | _optional_ | _optional_ | **`507 partitions`**             | 
`list<508: field_summary>` (see below)      | A list of field summaries for 
each partition field in the spec. Each field in the list corresponds to a field 
in the manifest file’s partition spec. |
+| _optional_ | _optional_ | _optional_ | **`519 key_metadata`**           | 
`binary`                                    | Implementation-specific key 
metadata for encryption                                                         
                                         |
+|            |            | _optional_ | **`520 first_row_id`**           | 
`long`                                      | The starting `_row_id` to assign 
to rows added by `ADDED` data files [First Row ID 
Assignment](#first-row-id-assignment)                               |
 
 `field_summary` is a struct with the following fields:
 
@@ -622,6 +727,14 @@ Notes:
 1. Lower and upper bounds are serialized to bytes using the single-object 
serialization in Appendix D. The type of used to encode the value is the type 
of the partition field data.
 2. If -0.0 is a value of the partition field, the `lower_bound` must not be 
+0.0, and if +0.0 is a value of the partition field, the `upper_bound` must not 
be -0.0.
 
+#### First Row ID Assignment
+
+Row ID inheritance is used when row lineage is enabled. When not enabled, a 
manifest's `first_row_id` must always be set to `null`. Once enabled, row 
lineage cannot be disabled. The rest of this section applies when row lineage 
is enabled.
+
+When adding a new data manifest file, its `first_row_id` field is assigned the 
value of the snapshot's `first_row_id` plus the sum of `added_rows_count` for 
all data manifests that preceded the manifest in the manifest list.
+
+The `first_row_id` is only assigned for new data manifests. Values for 
existing manifests must be preserved when writing a new manifest list. The 
value of `first_row_id` for delete manifests is always `null`.
+
 ### Scan Planning
 
 Scans are planned by reading the manifest files for the current snapshot. 
Deleted entries in data and delete manifests (those marked with status 
"DELETED") are not used in a scan.
@@ -708,34 +821,38 @@ The atomic operation used to commit metadata depends on 
how tables are tracked a
 
 Table metadata consists of the following fields:
 
-| v1         | v2         | Field | Description |
-| ---------- | ---------- | ----- | ----------- |
-| _required_ | _required_ | **`format-version`** | An integer version number 
for the format. Currently, this can be 1 or 2 based on the spec. 
Implementations must throw an exception if a table's version is higher than the 
supported version. |
-| _optional_ | _required_ | **`table-uuid`** | A UUID that identifies the 
table, generated when the table is created. Implementations must throw an 
exception if a table's UUID does not match the expected UUID after refreshing 
metadata. |
-| _required_ | _required_ | **`location`**| The table's base location. This is 
used by writers to determine where to store data files, manifest files, and 
table metadata files. |
-|            | _required_ | **`last-sequence-number`**| The table's highest 
assigned sequence number, a monotonically increasing long that tracks the order 
of snapshots in a table. |
-| _required_ | _required_ | **`last-updated-ms`**| Timestamp in milliseconds 
from the unix epoch when the table was last updated. Each table metadata file 
should update this field just before writing. |
-| _required_ | _required_ | **`last-column-id`**| An integer; the highest 
assigned column ID for the table. This is used to ensure columns are always 
assigned an unused ID when evolving schemas. |
-| _required_ |            | **`schema`**| The table’s current schema. 
(**Deprecated**: use `schemas` and `current-schema-id` instead) |
-| _optional_ | _required_ | **`schemas`**| A list of schemas, stored as 
objects with `schema-id`. |
-| _optional_ | _required_ | **`current-schema-id`**| ID of the table's current 
schema. |
-| _required_ |            | **`partition-spec`**| The table’s current 
partition spec, stored as only fields. Note that this is used by writers to 
partition data, but is not used when reading because reads use the specs stored 
in manifest files. (**Deprecated**: use `partition-specs` and `default-spec-id` 
instead) |
-| _optional_ | _required_ | **`partition-specs`**| A list of partition specs, 
stored as full partition spec objects. |
-| _optional_ | _required_ | **`default-spec-id`**| ID of the "current" spec 
that writers should use by default. |
-| _optional_ | _required_ | **`last-partition-id`**| An integer; the highest 
assigned partition field ID across all partition specs for the table. This is 
used to ensure partition fields are always assigned an unused ID when evolving 
specs. |
-| _optional_ | _optional_ | **`properties`**| A string to string map of table 
properties. This is used to control settings that affect reading and writing 
and is not intended to be used for arbitrary metadata. For example, 
`commit.retry.num-retries` is used to control the number of commit retries. |
-| _optional_ | _optional_ | **`current-snapshot-id`**| `long` ID of the 
current table snapshot; must be the same as the current ID of the `main` branch 
in `refs`. |
-| _optional_ | _optional_ | **`snapshots`**| A list of valid snapshots. Valid 
snapshots are snapshots for which all data files exist in the file system. A 
data file must not be deleted from the file system until the last snapshot in 
which it was listed is garbage collected. |
-| _optional_ | _optional_ | **`snapshot-log`**| A list (optional) of timestamp 
and snapshot ID pairs that encodes changes to the current snapshot for the 
table. Each time the current-snapshot-id is changed, a new entry should be 
added with the last-updated-ms and the new current-snapshot-id. When snapshots 
are expired from the list of valid snapshots, all entries before a snapshot 
that has expired should be removed. |
-| _optional_ | _optional_ | **`metadata-log`**| A list (optional) of timestamp 
and metadata file location pairs that encodes changes to the previous metadata 
files for the table. Each time a new metadata file is created, a new entry of 
the previous metadata file location should be added to the list. Tables can be 
configured to remove oldest metadata log entries and keep a fixed-size log of 
the most recent entries after a commit. |
-| _optional_ | _required_ | **`sort-orders`**| A list of sort orders, stored 
as full sort order objects. |
-| _optional_ | _required_ | **`default-sort-order-id`**| Default sort order id 
of the table. Note that this could be used by writers, but is not used when 
reading because reads use the specs stored in manifest files. |
-|            | _optional_ | **`refs`** | A map of snapshot references. The map 
keys are the unique snapshot reference names in the table, and the map values 
are snapshot reference objects. There is always a `main` branch reference 
pointing to the `current-snapshot-id` even if the `refs` map is null. |
-| _optional_ | _optional_ | **`statistics`** | A list (optional) of [table 
statistics](#table-statistics). |
-| _optional_ | _optional_ | **`partition-statistics`**  | A list (optional) of 
[partition statistics](#partition-statistics). |
+| v1         | v2         | v3         | Field                       | 
Description                                                                     
                                                                                
                                                                                
                                                                                
                                                                 |
+| ---------- | ---------- 
|------------|-----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| _required_ | _required_ | _required_ | **`format-version`**        | An 
integer version number for the format. Currently, this can be 1 or 2 based on 
the spec. Implementations must throw an exception if a table's version is 
higher than the supported version.                                              
                                                                                
                                                                      |
+| _optional_ | _required_ | _required_ | **`table-uuid`**            | A UUID 
that identifies the table, generated when the table is created. Implementations 
must throw an exception if a table's UUID does not match the expected UUID 
after refreshing metadata.                                                      
                                                                                
                                                               |
+| _required_ | _required_ | _required_ | **`location`**              | The 
table's base location. This is used by writers to determine where to store data 
files, manifest files, and table metadata files.                                
                                                                                
                                                                                
                                                             |
+|            | _required_ | _required_ | **`last-sequence-number`**  | The 
table's highest assigned sequence number, a monotonically increasing long that 
tracks the order of snapshots in a table.                                       
                                                                                
                                                                                
                                                              |
+| _required_ | _required_ | _required_ | **`last-updated-ms`**       | 
Timestamp in milliseconds from the unix epoch when the table was last updated. 
Each table metadata file should update this field just before writing.          
                                                                                
                                                                                
                                                                  |
+| _required_ | _required_ | _required_ | **`last-column-id`**        | An 
integer; the highest assigned column ID for the table. This is used to ensure 
columns are always assigned an unused ID when evolving schemas.                 
                                                                                
                                                                                
                                                                |
+| _required_ |            |            | **`schema`**                | The 
table’s current schema. (**Deprecated**: use `schemas` and `current-schema-id` 
instead)                                                                        
                                                                                
                                                                                
                                                              |
+| _optional_ | _required_ | _required_ | **`schemas`**               | A list 
of schemas, stored as objects with `schema-id`.                                 
                                                                                
                                                                                
                                                                                
                                                          |
+| _optional_ | _required_ | _required_ | **`current-schema-id`**     | ID of 
the table's current schema.                                                     
                                                                                
                                                                                
                                                                                
                                                           |
+| _required_ |            |            | **`partition-spec`**        | The 
table’s current partition spec, stored as only fields. Note that this is used 
by writers to partition data, but is not used when reading because reads use 
the specs stored in manifest files. (**Deprecated**: use `partition-specs` and 
`default-spec-id` instead)                                                      
                                                                   |
+| _optional_ | _required_ | _required_ | **`partition-specs`**       | A list 
of partition specs, stored as full partition spec objects.                      
                                                                                
                                                                                
                                                                                
                                                          |
+| _optional_ | _required_ | _required_ | **`default-spec-id`**       | ID of 
the "current" spec that writers should use by default.                          
                                                                                
                                                                                
                                                                                
                                                           |
+| _optional_ | _required_ | _required_ | **`last-partition-id`**     | An 
integer; the highest assigned partition field ID across all partition specs for 
the table. This is used to ensure partition fields are always assigned an 
unused ID when evolving specs.                                                  
                                                                                
                                                                    |
+| _optional_ | _optional_ | _optional_ | **`properties`**            | A 
string to string map of table properties. This is used to control settings that 
affect reading and writing and is not intended to be used for arbitrary 
metadata. For example, `commit.retry.num-retries` is used to control the number 
of commit retries.                                                              
                                                                       |
+| _optional_ | _optional_ | _optional_ | **`current-snapshot-id`**   | `long` 
ID of the current table snapshot; must be the same as the current ID of the 
`main` branch in `refs`.                                                        
                                                                                
                                                                                
                                                              |
+| _optional_ | _optional_ | _optional_ | **`snapshots`**             | A list 
of valid snapshots. Valid snapshots are snapshots for which all data files 
exist in the file system. A data file must not be deleted from the file system 
until the last snapshot in which it was listed is garbage collected.            
                                                                                
                                                                |
+| _optional_ | _optional_ | _optional_ | **`snapshot-log`**          | A list 
(optional) of timestamp and snapshot ID pairs that encodes changes to the 
current snapshot for the table. Each time the current-snapshot-id is changed, a 
new entry should be added with the last-updated-ms and the new 
current-snapshot-id. When snapshots are expired from the list of valid 
snapshots, all entries before a snapshot that has expired should be removed.    
          |
+| _optional_ | _optional_ | _optional_ | **`metadata-log`**          | A list 
(optional) of timestamp and metadata file location pairs that encodes changes 
to the previous metadata files for the table. Each time a new metadata file is 
created, a new entry of the previous metadata file location should be added to 
the list. Tables can be configured to remove oldest metadata log entries and 
keep a fixed-size log of the most recent entries after a commit. |
+| _optional_ | _required_ | _required_ | **`sort-orders`**           | A list 
of sort orders, stored as full sort order objects.                              
                                                                                
                                                                                
                                                                                
                                                          |
+| _optional_ | _required_ | _required_ | **`default-sort-order-id`** | Default 
sort order id of the table. Note that this could be used by writers, but is not 
used when reading because reads use the specs stored in manifest files.         
                                                                                
                                                                                
                                                         |
+|            | _optional_ | _optional_ | **`refs`**                  | A map 
of snapshot references. The map keys are the unique snapshot reference names in 
the table, and the map values are snapshot reference objects. There is always a 
`main` branch reference pointing to the `current-snapshot-id` even if the 
`refs` map is null.                                                             
                                                                 |
+| _optional_ | _optional_ | _optional_ | **`statistics`**            | A list 
(optional) of [table statistics](#table-statistics).                            
                                                                                
                                                                                
                                                                                
                                                          |
+| _optional_ | _optional_ | _optional_ | **`partition-statistics`**  | A list 
(optional) of [partition statistics](#partition-statistics).                    
                                                                                
                                                                                
                                                                                
                                                          |
+|            |            | _optional_ | **`row-lineage`**           | A 
boolean, defaulting to false, setting whether or not to track the creation and 
updates to rows in the table. See [Row Lineage](#row-lineage).                  
                                                                                
                                                                                
                                                                |
+|            |            | _optional_ | **`next-row-id`**           | A value 
higher than all assigned row IDs; the next snapshot's `first-row-id`. See [Row 
Lineage](#row-lineage).                                                         
                                                                                
                                                                                
                                                          |
 
 For serialization details, see Appendix C.
 
+When a new snapshot is added, the table's `next-row-id` should be updated to 
the previous `next-row-id` plus the sum of `record_count` for all data files 
added in the snapshot (this is also equal to the sum of `added_rows_count` for 
all manifests added in the snapshot). This ensures that `next-row-id` is always 
higher than any assigned row ID in the table.
+
 ### Table Statistics
 
 Table statistics files are valid [Puffin files](puffin-spec.md). Statistics 
are informational. A reader can choose to

(iceberg) branch main updated: Spec: Adds Row Lineage (#11130)

Reply via email to