(iceberg) branch main updated: Docs: improve structure for manifest entry fields (#13333)

etudenhoefner Fri, 27 Jun 2025 09:42:02 -0700

This is an automated email from the ASF dual-hosted git repository.

etudenhoefner pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg.git



The following commit(s) were added to refs/heads/main by this push:
     new 069a6128fd Docs: improve structure for manifest entry fields (#13333)
069a6128fd is described below

commit 069a6128fd50e22eac404715ce83688b93867026
Author: Elphas Toringepi <etori...@amazon.co.uk>
AuthorDate: Fri Jun 27 17:40:16 2025 +0100

    Docs: improve structure for manifest entry fields (#13333)
---
 docs/docs/spark-queries.md |  4 ++--
 format/spec.md             | 55 ++++++++++++++++++++++++++--------------------
 2 files changed, 33 insertions(+), 26 deletions(-)

diff --git a/docs/docs/spark-queries.md b/docs/docs/spark-queries.md
index eb3ed9708e..c289f5e2cb 100644
--- a/docs/docs/spark-queries.md
+++ b/docs/docs/spark-queries.md
@@ -301,12 +301,12 @@ SELECT * FROM prod.db.table.entries;
 
 Note:
 
-1. The columns of the `entries` table correspond to the fields of the 
`manifest_entry` struct (see the [manifest file 
schema](../../spec.md#manifests) for the full definition):
+1. The columns in the `entries` table correspond to the [manifest entry 
fields](../../spec.md#manifest-entry-fields):
     - `status`: Used to track additions and deletions
     - `snapshot_id`: The ID of the snapshot in which the file was added or 
removed
     - `sequence_number`: Used for ordering changes across snapshots
     - `file_sequence_number`: Indicates when the file was added
-    - `data_file`: A struct with metadata about the data file. The fields of 
the struct are defined in the [data_file schema](../../spec.md#manifests)
+    - `data_file`: A struct containing metadata about the data file, see the 
[data file fields](../../spec.md#data-file-fields)
 2. The `readable_metrics` column provides a human-readable map of extended 
column-level metrics derived from the `data_file` column, making it easier to 
inspect and debug file-level statistics.
 
 ### Files
diff --git a/format/spec.md b/format/spec.md
index 586e484439..7558bba8ec 100644
--- a/format/spec.md
+++ b/format/spec.md
@@ -617,7 +617,12 @@ A manifest file must store the partition spec and other 
metadata as properties i
 | _optional_ | _required_ | `format-version`    | Table format version number 
of the manifest as a string                      |
 |            | _required_ | `content`           | Type of content files 
tracked by the manifest: "data" or "deletes"           |
 
-The schema of a manifest file is a struct called `manifest_entry` with the 
following fields:
+The schema of a manifest file is defined by the `manifest_entry` struct, 
described in the following section.
+
+
+#### Manifest Entry Fields
+
+The `manifest_entry` struct consists of the following fields:
 
 | v1         | v2         | Field id, name                | Type               
                                       | Description |
 | ---------- | ---------- 
|-------------------------------|-----------------------------------------------------------|-------------|
@@ -627,7 +632,25 @@ The schema of a manifest file is a struct called 
`manifest_entry` with the follo
 |            | _optional_ | **`4  file_sequence_number`** | `long`             
                                       | File sequence number indicating when 
the file was added. Inherited when null and status is 1 (added). |
 | _required_ | _required_ | **`2  data_file`**            | `data_file` 
`struct` (see below)                          | File path, partition tuple, 
metrics, ... |
 
-`data_file` is a struct with the following fields:
+The manifest entry fields are used to keep track of the snapshot in which 
files were added or logically deleted. The `data_file` struct, defined below, 
is nested inside the manifest entry so that it can be easily passed to job 
planning without the manifest entry fields.
+
+When a file is added to the dataset, its manifest entry should store the 
snapshot ID in which the file was added and set status to 1 (added).
+
+When a file is replaced or deleted from the dataset, its manifest entry fields 
store the snapshot ID in which the file was deleted and status 2 (deleted). The 
file may be deleted from the file system when the snapshot in which it was 
deleted is garbage collected, assuming that older snapshots have also been 
garbage collected [1].
+
+Iceberg v2 adds data and file sequence numbers to the entry and makes the 
snapshot ID optional. Values for these fields are inherited from manifest 
metadata when `null`. That is, if the field is `null` for an entry, then the 
entry must inherit its value from the manifest file's metadata, stored in the 
manifest list.
+The `sequence_number` field represents the data sequence number and must never 
change after a file is added to the dataset. The data sequence number 
represents a relative age of the file content and should be used for planning 
which delete files apply to a data file.
+The `file_sequence_number` field represents the sequence number of the 
snapshot that added the file and must also remain unchanged upon assigning at 
commit. The file sequence number can't be used for pruning delete files as the 
data within the file may have an older data sequence number. 
+The data and file sequence numbers are inherited only if the entry status is 1 
(added). If the entry status is 0 (existing) or 2 (deleted), the entry must 
include both sequence numbers explicitly.
+
+Notes:
+
+1. Technically, data files can be deleted when the last snapshot that contains 
the file as “live” data is garbage collected. But this is harder to detect and 
requires finding the diff of multiple snapshots. It is easier to track what 
files are deleted in a snapshot and delete them when that snapshot expires.  It 
is not recommended to add a deleted file back to a table. Adding a deleted file 
can lead to edge cases where incremental deletes can break table snapshots.
+2. Manifest list files are required in v2, so that the `sequence_number` and 
`snapshot_id` to inherit are always available.
+
+##### Data File Fields
+
+The `data_file` struct consists of the following fields:
 
 | v1         | v2         | v3         | Field id, name                    | 
Type                                                                        | 
Description                                                                     
                                                                                
                                                   |
 | ---------- 
|------------|------------|-----------------------------------|-----------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
@@ -656,6 +679,10 @@ The schema of a manifest file is a struct called 
`manifest_entry` with the follo
 |            |            | _optional_ | **`144  content_offset`**         | 
`long`                                                                      | 
The offset in the file where the content starts [5]                             
                                                                                
                                                   |
 |            |            | _optional_ | **`145  content_size_in_bytes`**  | 
`long`                                                                      | 
The length of a referenced content stored in the file; required if 
`content_offset` is present [5]                                                 
                                                                |
 
+The `partition` struct stores the tuple of partition values for each file. Its 
type is derived from the partition fields of the partition spec used to write 
the manifest file. In v2, the partition struct's field ids must match the ids 
from the partition spec.
+
+The column metrics maps are used when filtering to select both data and delete 
files. For delete files, the metrics must store bounds and counts for all 
deleted rows, or must be omitted. Storing metrics for deleted rows ensures that 
the values can be used during job planning to find delete files that must be 
merged during a scan.
+
 Notes:
 
 1. Single-value serialization for lower and upper bounds is detailed in 
Appendix D.
@@ -665,6 +692,8 @@ Notes:
 5. The `content_offset` and `content_size_in_bytes` fields are used to 
reference a specific blob for direct access to a deletion vector. For deletion 
vectors, these values are required and must exactly match the `offset` and 
`length` stored in the Puffin footer for the deletion vector blob.
 6. The following field ids are reserved on `data_file`: 141.
 
+###### Bounds for Variant, Geometry, and Geography
+
 For Variant, values in the `lower_bounds` and `upper_bounds` maps store 
serialized Variant objects that contain lower or upper bounds respectively. The 
object keys for the bound-variants are normalized JSON path expressions that 
uniquely identify a field. The object values are primitive Variant 
representations of the lower or upper bound for that field. Including bounds 
for any field is optional and upper and lower bounds must have the same Variant 
type.
 
 Bounds for a field must be accurate for all non-null values of the field in a 
data file. Bounds for values within arrays must be accurate all values in the 
array. Bounds must not be written to describe values with mixed Variant types 
(other than null). For example, a "measurement" field that contains int64 and 
null values may have bounds, but if the field also contained a string value 
such as "n/a" or "0" then the field may not have bounds.
@@ -685,28 +714,6 @@ For `geometry` and `geography` types, `lower_bounds` and 
`upper_bounds` are both
 
 When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
 
-The `partition` struct stores the tuple of partition values for each file. Its 
type is derived from the partition fields of the partition spec used to write 
the manifest file. In v2, the partition struct's field ids must match the ids 
from the partition spec.
-
-The column metrics maps are used when filtering to select both data and delete 
files. For delete files, the metrics must store bounds and counts for all 
deleted rows, or must be omitted. Storing metrics for deleted rows ensures that 
the values can be used during job planning to find delete files that must be 
merged during a scan.
-
-
-#### Manifest Entry Fields
-
-The manifest entry fields are used to keep track of the snapshot in which 
files were added or logically deleted. The `data_file` struct is nested inside 
of the manifest entry so that it can be easily passed to job planning without 
the manifest entry fields.
-
-When a file is added to the dataset, its manifest entry should store the 
snapshot ID in which the file was added and set status to 1 (added).
-
-When a file is replaced or deleted from the dataset, its manifest entry fields 
store the snapshot ID in which the file was deleted and status 2 (deleted). The 
file may be deleted from the file system when the snapshot in which it was 
deleted is garbage collected, assuming that older snapshots have also been 
garbage collected [1].
-
-Iceberg v2 adds data and file sequence numbers to the entry and makes the 
snapshot ID optional. Values for these fields are inherited from manifest 
metadata when `null`. That is, if the field is `null` for an entry, then the 
entry must inherit its value from the manifest file's metadata, stored in the 
manifest list.
-The `sequence_number` field represents the data sequence number and must never 
change after a file is added to the dataset. The data sequence number 
represents a relative age of the file content and should be used for planning 
which delete files apply to a data file.
-The `file_sequence_number` field represents the sequence number of the 
snapshot that added the file and must also remain unchanged upon assigning at 
commit. The file sequence number can't be used for pruning delete files as the 
data within the file may have an older data sequence number. 
-The data and file sequence numbers are inherited only if the entry status is 1 
(added). If the entry status is 0 (existing) or 2 (deleted), the entry must 
include both sequence numbers explicitly.
-
-Notes:
-
-1. Technically, data files can be deleted when the last snapshot that contains 
the file as “live” data is garbage collected. But this is harder to detect and 
requires finding the diff of multiple snapshots. It is easier to track what 
files are deleted in a snapshot and delete them when that snapshot expires.  It 
is not recommended to add a deleted file back to a table. Adding a deleted file 
can lead to edge cases where incremental deletes can break table snapshots.
-2. Manifest list files are required in v2, so that the `sequence_number` and 
`snapshot_id` to inherit are always available.
 
 #### Sequence Number Inheritance

(iceberg) branch main updated: Docs: improve structure for manifest entry fields (#13333)

Reply via email to