This is an automated email from the ASF dual-hosted git repository. etudenhoefner pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/iceberg.git
The following commit(s) were added to refs/heads/main by this push: new 069a6128fd Docs: improve structure for manifest entry fields (#13333) 069a6128fd is described below commit 069a6128fd50e22eac404715ce83688b93867026 Author: Elphas Toringepi <etori...@amazon.co.uk> AuthorDate: Fri Jun 27 17:40:16 2025 +0100 Docs: improve structure for manifest entry fields (#13333) --- docs/docs/spark-queries.md | 4 ++-- format/spec.md | 55 ++++++++++++++++++++++++++-------------------- 2 files changed, 33 insertions(+), 26 deletions(-) diff --git a/docs/docs/spark-queries.md b/docs/docs/spark-queries.md index eb3ed9708e..c289f5e2cb 100644 --- a/docs/docs/spark-queries.md +++ b/docs/docs/spark-queries.md @@ -301,12 +301,12 @@ SELECT * FROM prod.db.table.entries; Note: -1. The columns of the `entries` table correspond to the fields of the `manifest_entry` struct (see the [manifest file schema](../../spec.md#manifests) for the full definition): +1. The columns in the `entries` table correspond to the [manifest entry fields](../../spec.md#manifest-entry-fields): - `status`: Used to track additions and deletions - `snapshot_id`: The ID of the snapshot in which the file was added or removed - `sequence_number`: Used for ordering changes across snapshots - `file_sequence_number`: Indicates when the file was added - - `data_file`: A struct with metadata about the data file. The fields of the struct are defined in the [data_file schema](../../spec.md#manifests) + - `data_file`: A struct containing metadata about the data file, see the [data file fields](../../spec.md#data-file-fields) 2. The `readable_metrics` column provides a human-readable map of extended column-level metrics derived from the `data_file` column, making it easier to inspect and debug file-level statistics. ### Files diff --git a/format/spec.md b/format/spec.md index 586e484439..7558bba8ec 100644 --- a/format/spec.md +++ b/format/spec.md @@ -617,7 +617,12 @@ A manifest file must store the partition spec and other metadata as properties i | _optional_ | _required_ | `format-version` | Table format version number of the manifest as a string | | | _required_ | `content` | Type of content files tracked by the manifest: "data" or "deletes" | -The schema of a manifest file is a struct called `manifest_entry` with the following fields: +The schema of a manifest file is defined by the `manifest_entry` struct, described in the following section. + + +#### Manifest Entry Fields + +The `manifest_entry` struct consists of the following fields: | v1 | v2 | Field id, name | Type | Description | | ---------- | ---------- |-------------------------------|-----------------------------------------------------------|-------------| @@ -627,7 +632,25 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo | | _optional_ | **`4 file_sequence_number`** | `long` | File sequence number indicating when the file was added. Inherited when null and status is 1 (added). | | _required_ | _required_ | **`2 data_file`** | `data_file` `struct` (see below) | File path, partition tuple, metrics, ... | -`data_file` is a struct with the following fields: +The manifest entry fields are used to keep track of the snapshot in which files were added or logically deleted. The `data_file` struct, defined below, is nested inside the manifest entry so that it can be easily passed to job planning without the manifest entry fields. + +When a file is added to the dataset, its manifest entry should store the snapshot ID in which the file was added and set status to 1 (added). + +When a file is replaced or deleted from the dataset, its manifest entry fields store the snapshot ID in which the file was deleted and status 2 (deleted). The file may be deleted from the file system when the snapshot in which it was deleted is garbage collected, assuming that older snapshots have also been garbage collected [1]. + +Iceberg v2 adds data and file sequence numbers to the entry and makes the snapshot ID optional. Values for these fields are inherited from manifest metadata when `null`. That is, if the field is `null` for an entry, then the entry must inherit its value from the manifest file's metadata, stored in the manifest list. +The `sequence_number` field represents the data sequence number and must never change after a file is added to the dataset. The data sequence number represents a relative age of the file content and should be used for planning which delete files apply to a data file. +The `file_sequence_number` field represents the sequence number of the snapshot that added the file and must also remain unchanged upon assigning at commit. The file sequence number can't be used for pruning delete files as the data within the file may have an older data sequence number. +The data and file sequence numbers are inherited only if the entry status is 1 (added). If the entry status is 0 (existing) or 2 (deleted), the entry must include both sequence numbers explicitly. + +Notes: + +1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires. It is not recommended to add a deleted file back to a table. Adding a deleted file can lead to edge cases where incremental deletes can break table snapshots. +2. Manifest list files are required in v2, so that the `sequence_number` and `snapshot_id` to inherit are always available. + +##### Data File Fields + +The `data_file` struct consists of the following fields: | v1 | v2 | v3 | Field id, name | Type | Description | | ---------- |------------|------------|-----------------------------------|-----------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| @@ -656,6 +679,10 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo | | | _optional_ | **`144 content_offset`** | `long` | The offset in the file where the content starts [5] | | | | _optional_ | **`145 content_size_in_bytes`** | `long` | The length of a referenced content stored in the file; required if `content_offset` is present [5] | +The `partition` struct stores the tuple of partition values for each file. Its type is derived from the partition fields of the partition spec used to write the manifest file. In v2, the partition struct's field ids must match the ids from the partition spec. + +The column metrics maps are used when filtering to select both data and delete files. For delete files, the metrics must store bounds and counts for all deleted rows, or must be omitted. Storing metrics for deleted rows ensures that the values can be used during job planning to find delete files that must be merged during a scan. + Notes: 1. Single-value serialization for lower and upper bounds is detailed in Appendix D. @@ -665,6 +692,8 @@ Notes: 5. The `content_offset` and `content_size_in_bytes` fields are used to reference a specific blob for direct access to a deletion vector. For deletion vectors, these values are required and must exactly match the `offset` and `length` stored in the Puffin footer for the deletion vector blob. 6. The following field ids are reserved on `data_file`: 141. +###### Bounds for Variant, Geometry, and Geography + For Variant, values in the `lower_bounds` and `upper_bounds` maps store serialized Variant objects that contain lower or upper bounds respectively. The object keys for the bound-variants are normalized JSON path expressions that uniquely identify a field. The object values are primitive Variant representations of the lower or upper bound for that field. Including bounds for any field is optional and upper and lower bounds must have the same Variant type. Bounds for a field must be accurate for all non-null values of the field in a data file. Bounds for values within arrays must be accurate all values in the array. Bounds must not be written to describe values with mixed Variant types (other than null). For example, a "measurement" field that contains int64 and null values may have bounds, but if the field also contained a string value such as "n/a" or "0" then the field may not have bounds. @@ -685,28 +714,6 @@ For `geometry` and `geography` types, `lower_bounds` and `upper_bounds` are both When calculating upper and lower bounds for `geometry` and `geography`, null or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) contributes a value to X but no values to Y, Z, or M dimension bounds. If a dimension has only null or NaN values, that dimension is omitted from the bounding box. If either the X or Y dimension is missing then the bounding box itself is not produced. -The `partition` struct stores the tuple of partition values for each file. Its type is derived from the partition fields of the partition spec used to write the manifest file. In v2, the partition struct's field ids must match the ids from the partition spec. - -The column metrics maps are used when filtering to select both data and delete files. For delete files, the metrics must store bounds and counts for all deleted rows, or must be omitted. Storing metrics for deleted rows ensures that the values can be used during job planning to find delete files that must be merged during a scan. - - -#### Manifest Entry Fields - -The manifest entry fields are used to keep track of the snapshot in which files were added or logically deleted. The `data_file` struct is nested inside of the manifest entry so that it can be easily passed to job planning without the manifest entry fields. - -When a file is added to the dataset, its manifest entry should store the snapshot ID in which the file was added and set status to 1 (added). - -When a file is replaced or deleted from the dataset, its manifest entry fields store the snapshot ID in which the file was deleted and status 2 (deleted). The file may be deleted from the file system when the snapshot in which it was deleted is garbage collected, assuming that older snapshots have also been garbage collected [1]. - -Iceberg v2 adds data and file sequence numbers to the entry and makes the snapshot ID optional. Values for these fields are inherited from manifest metadata when `null`. That is, if the field is `null` for an entry, then the entry must inherit its value from the manifest file's metadata, stored in the manifest list. -The `sequence_number` field represents the data sequence number and must never change after a file is added to the dataset. The data sequence number represents a relative age of the file content and should be used for planning which delete files apply to a data file. -The `file_sequence_number` field represents the sequence number of the snapshot that added the file and must also remain unchanged upon assigning at commit. The file sequence number can't be used for pruning delete files as the data within the file may have an older data sequence number. -The data and file sequence numbers are inherited only if the entry status is 1 (added). If the entry status is 0 (existing) or 2 (deleted), the entry must include both sequence numbers explicitly. - -Notes: - -1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires. It is not recommended to add a deleted file back to a table. Adding a deleted file can lead to edge cases where incremental deletes can break table snapshots. -2. Manifest list files are required in v2, so that the `sequence_number` and `snapshot_id` to inherit are always available. #### Sequence Number Inheritance