(iceberg) branch main updated: [SPEC] Removing trailing whitespace (#14416)

dweeks Fri, 24 Oct 2025 12:40:21 -0700

This is an automated email from the ASF dual-hosted git repository.

dweeks pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg.git



The following commit(s) were added to refs/heads/main by this push:
     new e7bd3fa214 [SPEC] Removing trailing whitespace (#14416)
e7bd3fa214 is described below

commit e7bd3fa214db616cab22243f8b8a6ec01ad5afdf
Author: emkornfield <[email protected]>
AuthorDate: Fri Oct 24 12:40:09 2025 -0700

    [SPEC] Removing trailing whitespace (#14416)
---
 format/spec.md | 44 ++++++++++++++++++++++----------------------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/format/spec.md b/format/spec.md
index 119de1779d..3da5c1cddf 100644
--- a/format/spec.md
+++ b/format/spec.md
@@ -102,7 +102,7 @@ Inheriting the sequence number from manifest metadata 
allows writing a new manif
 
 Row-level deletes are stored in delete files.
 
-There are two types of row-level deletes: 
+There are two types of row-level deletes:
 
 * **Position deletes** -- Mark a row deleted by data file path and the row 
position in the data file. Position deletes are encoded in a [_position delete 
file_](#position-delete-files) (V2) or [_deletion vector_](#deletion-vectors) 
(V3 or above).
 
@@ -346,26 +346,26 @@ Values for field ids which are not present in a data file 
must be resolved accor
 
 * Return the value from partition metadata if an [Identity 
Transform](#partition-transforms) exists for the field and the partition value 
is present in the `partition` struct on `data_file` object in the manifest. 
This allows for metadata only migrations of Hive tables.
 * Use `schema.name-mapping.default` metadata to map field id to columns 
without field id as described below and use the column if it is present.
-* Return the default value if it has a defined `initial-default` (See [Default 
values](#default-values) section for more details). 
+* Return the default value if it has a defined `initial-default` (See [Default 
values](#default-values) section for more details).
 * Return `null` in all other cases.
 
 For example, a file may be written with schema `1: a int, 2: b string, 3: c 
double` and read using projection schema `3: measurement, 2: name, 4: a`. This 
must select file columns `c` (renamed to `measurement`), `b` (now called 
`name`), and a column of `null` values called `a`; in that order.
 
 Tables may also define a property `schema.name-mapping.default` with a JSON 
name mapping containing a list of field mapping objects. These mappings provide 
fallback field ids to be used when a data file does not contain field id 
information. Each object should contain
 
-* `names`: A required list of 0 or more names for a field. 
+* `names`: A required list of 0 or more names for a field.
 * `field-id`: An optional Iceberg field ID used when a field's name is present 
in `names`
 * `fields`: An optional list of field mappings for child field of structs, 
maps, and lists.
 
 Field mapping fields are constrained by the following rules:
 
-* A name may contain `.` but this refers to a literal name, not a nested 
field. For example, `a.b` refers to a field named `a.b`, not child field `b` of 
field `a`. 
-* Each child field should be defined with their own field mapping under 
`fields`. 
+* A name may contain `.` but this refers to a literal name, not a nested 
field. For example, `a.b` refers to a field named `a.b`, not child field `b` of 
field `a`.
+* Each child field should be defined with their own field mapping under 
`fields`.
 * Multiple values for `names` may be mapped to a single field ID to support 
cases where a field may have different names in different data files. For 
example, all Avro field aliases should be listed in `names`.
 * Fields which exist only in the Iceberg schema and not in imported data files 
may use an empty `names` list.
 * Fields that exist in imported files but not in the Iceberg schema may omit 
`field-id`.
-* List types should contain a mapping in `fields` for `element`. 
-* Map types should contain mappings in `fields` for `key` and `value`. 
+* List types should contain a mapping in `fields` for `element`.
+* Map types should contain mappings in `fields` for `key` and `value`.
 * Struct types should contain mappings in `fields` for their child fields.
 
 For details on serialization, see [Appendix C](#name-mapping-serialization).
@@ -477,7 +477,7 @@ The snapshot then populates the total number of 
`added-rows` based on the sum of
 When the new snapshot is committed, the table's `next-row-id` must also be 
updated (even if the new snapshot is not in the main branch). Because 375 rows 
were in data files in manifests that were assigned a `first_row_id` (`added1` 
100+25, `added2` 0+100, `added3` 125+25) the new value is 1,000 + 375 = 1,375.
 
 
-##### Row Lineage for Upgraded Tables 
+##### Row Lineage for Upgraded Tables
 
 When a table is upgraded to v3, its `next-row-id` is initialized to 0 and 
existing snapshots are not modified (that is, `first-row-id` remains unset or 
null). For such snapshots without `first-row-id`, `first_row_id` values for 
data files and data manifests are null, and values for `_row_id` are read as 
null for all rows. When `first_row_id` is null, inherited row ID values are 
also null.
 
@@ -594,9 +594,9 @@ A sort order is defined by a sort order id and a list of 
sort fields. The order
 
 For details on how to serialize a sort order to JSON, see Appendix C.
 
-Order id `0` is reserved for the unsorted order. 
+Order id `0` is reserved for the unsorted order.
 
-Sorting floating-point numbers should produce the following behavior: `-NaN` < 
`-Infinity` < `-value` < `-0` < `0` < `value` < `Infinity` < `NaN`. This aligns 
with the implementation of Java floating-point types comparisons. 
+Sorting floating-point numbers should produce the following behavior: `-NaN` < 
`-Infinity` < `-value` < `-0` < `0` < `value` < `Infinity` < `NaN`. This aligns 
with the implementation of Java floating-point types comparisons.
 
 A data or delete file is associated with a sort order by the sort order's id 
within [a manifest](#manifests). Therefore, the table must declare all the sort 
orders for lookup. A table could also be configured with a default sort order 
id, indicating how the new data should be sorted by default. Writers should use 
this default sort order to sort the data on write, but are not required to if 
the default order is prohibitively expensive, as it would be for streaming 
writes.
 
@@ -645,7 +645,7 @@ When a file is replaced or deleted from the dataset, its 
manifest entry fields s
 
 Iceberg v2 adds data and file sequence numbers to the entry and makes the 
snapshot ID optional. Values for these fields are inherited from manifest 
metadata when `null`. That is, if the field is `null` for an entry, then the 
entry must inherit its value from the manifest file's metadata, stored in the 
manifest list.
 The `sequence_number` field represents the data sequence number and must never 
change after a file is added to the dataset. The data sequence number 
represents a relative age of the file content and should be used for planning 
which delete files apply to a data file.
-The `file_sequence_number` field represents the sequence number of the 
snapshot that added the file and must also remain unchanged upon assigning at 
commit. The file sequence number can't be used for pruning delete files as the 
data within the file may have an older data sequence number. 
+The `file_sequence_number` field represents the sequence number of the 
snapshot that added the file and must also remain unchanged upon assigning at 
commit. The file sequence number can't be used for pruning delete files as the 
data within the file may have an older data sequence number.
 The data and file sequence numbers are inherited only if the entry status is 1 
(added). If the entry status is 0 (existing) or 2 (deleted), the entry must 
include both sequence numbers explicitly.
 
 Notes:
@@ -712,7 +712,7 @@ Examples of valid field paths using normalized JSON path 
format are:
 * `$['event_type']` -- the `event_type` field in a Variant object
 * `$['user.name']` -- the `"user.name"` field in a Variant object
 * `$['location']['latitude']` -- the `latitude` field nested within a 
`location` object
-* `$['tags']` -- the `tags` array 
+* `$['tags']` -- the `tags` array
 * `$['addresses']['zip']` -- the `zip` field in an `addresses` array that 
contains objects
 
 For `geometry` and `geography` types, `lower_bounds` and `upper_bounds` are 
both points of the following coordinates X, Y, Z, and M (see Appendix G) which 
are the lower / upper bound of all objects in the file.
@@ -862,7 +862,7 @@ The inclusive projection for an unknown partition transform 
is _true_ because th
 
 Scan predicates are also used to filter data and delete files using column 
bounds and counts that are stored by field id in manifests. The same filter 
logic can be used for both data and delete files because both store metrics of 
the rows either inserted or deleted. If metrics show that a delete file has no 
rows that match a scan predicate, it may be ignored just as a data file would 
be ignored [2].
 
-Data files that match the query filter must be read by the scan. 
+Data files that match the query filter must be read by the scan.
 
 Note that for any snapshot, all file paths marked with "ADDED" or "EXISTING" 
may appear at most once across all manifest files in the snapshot. If a file 
path appears more than once, the results of the scan are undefined. Reader 
implementations may raise an error in this case, but are not required to do so.
 
@@ -897,7 +897,7 @@ Notes:
 
 ### Snapshot References
 
-Iceberg tables keep track of branches and tags using snapshot references. 
+Iceberg tables keep track of branches and tags using snapshot references.
 Tags are labels for individual snapshots. Branches are mutable named 
references that can be updated by committing a new snapshot as the branch's 
referenced snapshot using the [Commit Conflict Resolution and 
Retry](#commit-conflict-resolution-and-retry) procedures.
 
 The snapshot reference object records all the information of a reference 
including snapshot ID, reference type and [Snapshot Retention 
Policy](#snapshot-retention-policy).
@@ -1002,7 +1002,7 @@ Blob metadata is a struct with the following fields:
 
 #### Partition Statistics
 
-Partition statistics files are based on [partition statistics file 
spec](#partition-statistics-file). 
+Partition statistics files are based on [partition statistics file 
spec](#partition-statistics-file).
 Partition statistics are not required for reading or planning and readers may 
ignore them.
 Each table snapshot may be associated with at most one partition statistics 
file.
 A writer can optionally write the partition statistics file during each write 
operation, or it can also be computed on demand.
@@ -1040,8 +1040,8 @@ The schema of the partition statistics file is as follows:
 | _optional_ | _optional_ | _optional_ | **`12 last_updated_snapshot_id`** | 
`long` | ID of snapshot that last updated this partition |
 
 Note that partition data tuple's schema is based on the partition spec output 
using partition field ids for the struct field ids.
-The unified partition type is a struct containing all fields that have ever 
been a part of any spec in the table 
-and sorted by the field ids in ascending order.  
+The unified partition type is a struct containing all fields that have ever 
been a part of any spec in the table
+and sorted by the field ids in ascending order.
 In other words, the struct fields represent a union of all known partition 
fields sorted in ascending order by the field ids.
 
 For example,
@@ -1180,7 +1180,7 @@ When the deleted row column is present, its schema may be 
any subset of the tabl
 
 To ensure the accuracy of statistics, all delete entries must include row 
values, or the column must be omitted (this is why the column type is 
`required`).
 
-The rows in the delete file must be sorted by `file_path` then `pos` to 
optimize filtering rows while scanning. 
+The rows in the delete file must be sorted by `file_path` then `pos` to 
optimize filtering rows while scanning.
 
 *  Sorting by `file_path` allows filter pushdown by file in columnar storage 
formats.
 *  Sorting by `pos` allows filtering rows while scanning, to avoid keeping 
deletes in memory.
@@ -1541,7 +1541,7 @@ Notes:
 
 Older versions of the reference implementation can read tables with transforms 
unknown to it, ignoring them. But other implementations may break if they 
encounter unknown transforms. All v3 readers are required to read tables with 
unknown transforms, ignoring them.
 
-The following table describes the possible values for the some of the field 
within sort field: 
+The following table describes the possible values for the some of the field 
within sort field:
 
 |Field|JSON representation|Possible values|
 |--- |--- |--- |
@@ -1805,7 +1805,7 @@ This section covers topics not required by the 
specification but recommendations
 
 ### Point in Time Reads (Time Travel)
 
-Iceberg supports two types of histories for tables. A history of previous 
"current snapshots" stored in ["snapshot-log" table 
metadata](#table-metadata-fields) and [parent-child lineage stored in 
"snapshots"](#table-metadata-fields). These two histories 
+Iceberg supports two types of histories for tables. A history of previous 
"current snapshots" stored in ["snapshot-log" table 
metadata](#table-metadata-fields) and [parent-child lineage stored in 
"snapshots"](#table-metadata-fields). These two histories
 might indicate different snapshot IDs for a specific timestamp. The 
discrepancies can be caused by a variety of table operations (e.g. updating the 
`current-snapshot-id` can be used to set the snapshot of a table to any 
arbitrary snapshot, which might have a lineage derived from a table branch or 
no lineage at all).
 
 When processing point in time queries implementations should use 
"snapshot-log" metadata to lookup the table state at the given point in time. 
This ensures time-travel queries reflect the state of the table at the provided 
timestamp. For example a SQL query like `SELECT * FROM prod.db.table TIMESTAMP 
AS OF '1986-10-26 01:21:00Z';` would find the snapshot of the Iceberg table 
just prior to '1986-10-26 01:21:00 UTC' in the snapshot logs and use the 
metadata from that snapshot to perform th [...]
@@ -1847,7 +1847,7 @@ Snapshot summary can include metrics fields to track 
numeric stats of the snapsh
 | **`manifests-created`**             | Number of manifest files created in 
the snapshot                                                 |
 | **`manifests-kept`**                | Number of manifest files kept in the 
snapshot                                                    |
 | **`manifests-replaced`**            | Number of manifest files replaced in 
the snapshot                                                |
-| **`entries-processed`**             | Number of manifest entries processed 
in the snapshot                                             | 
+| **`entries-processed`**             | Number of manifest entries processed 
in the snapshot                                             |
 
 #### Other Fields
 
@@ -1869,7 +1869,7 @@ Java writes `-1` for "no current snapshot" with V1 and V2 
tables and considers t
 
 ### Naming for GZIP compressed Metadata JSON files
 
-Some implementations require that GZIP compressed files have the suffix 
`.gz.metadata.json` to be read correctly. The Java reference implementation can 
additionally read GZIP compressed files with the suffix `metadata.json.gz`.  
+Some implementations require that GZIP compressed files have the suffix 
`.gz.metadata.json` to be read correctly. The Java reference implementation can 
additionally read GZIP compressed files with the suffix `metadata.json.gz`.
 
 ### Position Delete Files with Row Data

(iceberg) branch main updated: [SPEC] Removing trailing whitespace (#14416)

Reply via email to