This is an automated email from the ASF dual-hosted git repository.
dweeks pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg.git
The following commit(s) were added to refs/heads/main by this push:
new e7bd3fa214 [SPEC] Removing trailing whitespace (#14416)
e7bd3fa214 is described below
commit e7bd3fa214db616cab22243f8b8a6ec01ad5afdf
Author: emkornfield <[email protected]>
AuthorDate: Fri Oct 24 12:40:09 2025 -0700
[SPEC] Removing trailing whitespace (#14416)
---
format/spec.md | 44 ++++++++++++++++++++++----------------------
1 file changed, 22 insertions(+), 22 deletions(-)
diff --git a/format/spec.md b/format/spec.md
index 119de1779d..3da5c1cddf 100644
--- a/format/spec.md
+++ b/format/spec.md
@@ -102,7 +102,7 @@ Inheriting the sequence number from manifest metadata
allows writing a new manif
Row-level deletes are stored in delete files.
-There are two types of row-level deletes:
+There are two types of row-level deletes:
* **Position deletes** -- Mark a row deleted by data file path and the row
position in the data file. Position deletes are encoded in a [_position delete
file_](#position-delete-files) (V2) or [_deletion vector_](#deletion-vectors)
(V3 or above).
@@ -346,26 +346,26 @@ Values for field ids which are not present in a data file
must be resolved accor
* Return the value from partition metadata if an [Identity
Transform](#partition-transforms) exists for the field and the partition value
is present in the `partition` struct on `data_file` object in the manifest.
This allows for metadata only migrations of Hive tables.
* Use `schema.name-mapping.default` metadata to map field id to columns
without field id as described below and use the column if it is present.
-* Return the default value if it has a defined `initial-default` (See [Default
values](#default-values) section for more details).
+* Return the default value if it has a defined `initial-default` (See [Default
values](#default-values) section for more details).
* Return `null` in all other cases.
For example, a file may be written with schema `1: a int, 2: b string, 3: c
double` and read using projection schema `3: measurement, 2: name, 4: a`. This
must select file columns `c` (renamed to `measurement`), `b` (now called
`name`), and a column of `null` values called `a`; in that order.
Tables may also define a property `schema.name-mapping.default` with a JSON
name mapping containing a list of field mapping objects. These mappings provide
fallback field ids to be used when a data file does not contain field id
information. Each object should contain
-* `names`: A required list of 0 or more names for a field.
+* `names`: A required list of 0 or more names for a field.
* `field-id`: An optional Iceberg field ID used when a field's name is present
in `names`
* `fields`: An optional list of field mappings for child field of structs,
maps, and lists.
Field mapping fields are constrained by the following rules:
-* A name may contain `.` but this refers to a literal name, not a nested
field. For example, `a.b` refers to a field named `a.b`, not child field `b` of
field `a`.
-* Each child field should be defined with their own field mapping under
`fields`.
+* A name may contain `.` but this refers to a literal name, not a nested
field. For example, `a.b` refers to a field named `a.b`, not child field `b` of
field `a`.
+* Each child field should be defined with their own field mapping under
`fields`.
* Multiple values for `names` may be mapped to a single field ID to support
cases where a field may have different names in different data files. For
example, all Avro field aliases should be listed in `names`.
* Fields which exist only in the Iceberg schema and not in imported data files
may use an empty `names` list.
* Fields that exist in imported files but not in the Iceberg schema may omit
`field-id`.
-* List types should contain a mapping in `fields` for `element`.
-* Map types should contain mappings in `fields` for `key` and `value`.
+* List types should contain a mapping in `fields` for `element`.
+* Map types should contain mappings in `fields` for `key` and `value`.
* Struct types should contain mappings in `fields` for their child fields.
For details on serialization, see [Appendix C](#name-mapping-serialization).
@@ -477,7 +477,7 @@ The snapshot then populates the total number of
`added-rows` based on the sum of
When the new snapshot is committed, the table's `next-row-id` must also be
updated (even if the new snapshot is not in the main branch). Because 375 rows
were in data files in manifests that were assigned a `first_row_id` (`added1`
100+25, `added2` 0+100, `added3` 125+25) the new value is 1,000 + 375 = 1,375.
-##### Row Lineage for Upgraded Tables
+##### Row Lineage for Upgraded Tables
When a table is upgraded to v3, its `next-row-id` is initialized to 0 and
existing snapshots are not modified (that is, `first-row-id` remains unset or
null). For such snapshots without `first-row-id`, `first_row_id` values for
data files and data manifests are null, and values for `_row_id` are read as
null for all rows. When `first_row_id` is null, inherited row ID values are
also null.
@@ -594,9 +594,9 @@ A sort order is defined by a sort order id and a list of
sort fields. The order
For details on how to serialize a sort order to JSON, see Appendix C.
-Order id `0` is reserved for the unsorted order.
+Order id `0` is reserved for the unsorted order.
-Sorting floating-point numbers should produce the following behavior: `-NaN` <
`-Infinity` < `-value` < `-0` < `0` < `value` < `Infinity` < `NaN`. This aligns
with the implementation of Java floating-point types comparisons.
+Sorting floating-point numbers should produce the following behavior: `-NaN` <
`-Infinity` < `-value` < `-0` < `0` < `value` < `Infinity` < `NaN`. This aligns
with the implementation of Java floating-point types comparisons.
A data or delete file is associated with a sort order by the sort order's id
within [a manifest](#manifests). Therefore, the table must declare all the sort
orders for lookup. A table could also be configured with a default sort order
id, indicating how the new data should be sorted by default. Writers should use
this default sort order to sort the data on write, but are not required to if
the default order is prohibitively expensive, as it would be for streaming
writes.
@@ -645,7 +645,7 @@ When a file is replaced or deleted from the dataset, its
manifest entry fields s
Iceberg v2 adds data and file sequence numbers to the entry and makes the
snapshot ID optional. Values for these fields are inherited from manifest
metadata when `null`. That is, if the field is `null` for an entry, then the
entry must inherit its value from the manifest file's metadata, stored in the
manifest list.
The `sequence_number` field represents the data sequence number and must never
change after a file is added to the dataset. The data sequence number
represents a relative age of the file content and should be used for planning
which delete files apply to a data file.
-The `file_sequence_number` field represents the sequence number of the
snapshot that added the file and must also remain unchanged upon assigning at
commit. The file sequence number can't be used for pruning delete files as the
data within the file may have an older data sequence number.
+The `file_sequence_number` field represents the sequence number of the
snapshot that added the file and must also remain unchanged upon assigning at
commit. The file sequence number can't be used for pruning delete files as the
data within the file may have an older data sequence number.
The data and file sequence numbers are inherited only if the entry status is 1
(added). If the entry status is 0 (existing) or 2 (deleted), the entry must
include both sequence numbers explicitly.
Notes:
@@ -712,7 +712,7 @@ Examples of valid field paths using normalized JSON path
format are:
* `$['event_type']` -- the `event_type` field in a Variant object
* `$['user.name']` -- the `"user.name"` field in a Variant object
* `$['location']['latitude']` -- the `latitude` field nested within a
`location` object
-* `$['tags']` -- the `tags` array
+* `$['tags']` -- the `tags` array
* `$['addresses']['zip']` -- the `zip` field in an `addresses` array that
contains objects
For `geometry` and `geography` types, `lower_bounds` and `upper_bounds` are
both points of the following coordinates X, Y, Z, and M (see Appendix G) which
are the lower / upper bound of all objects in the file.
@@ -862,7 +862,7 @@ The inclusive projection for an unknown partition transform
is _true_ because th
Scan predicates are also used to filter data and delete files using column
bounds and counts that are stored by field id in manifests. The same filter
logic can be used for both data and delete files because both store metrics of
the rows either inserted or deleted. If metrics show that a delete file has no
rows that match a scan predicate, it may be ignored just as a data file would
be ignored [2].
-Data files that match the query filter must be read by the scan.
+Data files that match the query filter must be read by the scan.
Note that for any snapshot, all file paths marked with "ADDED" or "EXISTING"
may appear at most once across all manifest files in the snapshot. If a file
path appears more than once, the results of the scan are undefined. Reader
implementations may raise an error in this case, but are not required to do so.
@@ -897,7 +897,7 @@ Notes:
### Snapshot References
-Iceberg tables keep track of branches and tags using snapshot references.
+Iceberg tables keep track of branches and tags using snapshot references.
Tags are labels for individual snapshots. Branches are mutable named
references that can be updated by committing a new snapshot as the branch's
referenced snapshot using the [Commit Conflict Resolution and
Retry](#commit-conflict-resolution-and-retry) procedures.
The snapshot reference object records all the information of a reference
including snapshot ID, reference type and [Snapshot Retention
Policy](#snapshot-retention-policy).
@@ -1002,7 +1002,7 @@ Blob metadata is a struct with the following fields:
#### Partition Statistics
-Partition statistics files are based on [partition statistics file
spec](#partition-statistics-file).
+Partition statistics files are based on [partition statistics file
spec](#partition-statistics-file).
Partition statistics are not required for reading or planning and readers may
ignore them.
Each table snapshot may be associated with at most one partition statistics
file.
A writer can optionally write the partition statistics file during each write
operation, or it can also be computed on demand.
@@ -1040,8 +1040,8 @@ The schema of the partition statistics file is as follows:
| _optional_ | _optional_ | _optional_ | **`12 last_updated_snapshot_id`** |
`long` | ID of snapshot that last updated this partition |
Note that partition data tuple's schema is based on the partition spec output
using partition field ids for the struct field ids.
-The unified partition type is a struct containing all fields that have ever
been a part of any spec in the table
-and sorted by the field ids in ascending order.
+The unified partition type is a struct containing all fields that have ever
been a part of any spec in the table
+and sorted by the field ids in ascending order.
In other words, the struct fields represent a union of all known partition
fields sorted in ascending order by the field ids.
For example,
@@ -1180,7 +1180,7 @@ When the deleted row column is present, its schema may be
any subset of the tabl
To ensure the accuracy of statistics, all delete entries must include row
values, or the column must be omitted (this is why the column type is
`required`).
-The rows in the delete file must be sorted by `file_path` then `pos` to
optimize filtering rows while scanning.
+The rows in the delete file must be sorted by `file_path` then `pos` to
optimize filtering rows while scanning.
* Sorting by `file_path` allows filter pushdown by file in columnar storage
formats.
* Sorting by `pos` allows filtering rows while scanning, to avoid keeping
deletes in memory.
@@ -1541,7 +1541,7 @@ Notes:
Older versions of the reference implementation can read tables with transforms
unknown to it, ignoring them. But other implementations may break if they
encounter unknown transforms. All v3 readers are required to read tables with
unknown transforms, ignoring them.
-The following table describes the possible values for the some of the field
within sort field:
+The following table describes the possible values for the some of the field
within sort field:
|Field|JSON representation|Possible values|
|--- |--- |--- |
@@ -1805,7 +1805,7 @@ This section covers topics not required by the
specification but recommendations
### Point in Time Reads (Time Travel)
-Iceberg supports two types of histories for tables. A history of previous
"current snapshots" stored in ["snapshot-log" table
metadata](#table-metadata-fields) and [parent-child lineage stored in
"snapshots"](#table-metadata-fields). These two histories
+Iceberg supports two types of histories for tables. A history of previous
"current snapshots" stored in ["snapshot-log" table
metadata](#table-metadata-fields) and [parent-child lineage stored in
"snapshots"](#table-metadata-fields). These two histories
might indicate different snapshot IDs for a specific timestamp. The
discrepancies can be caused by a variety of table operations (e.g. updating the
`current-snapshot-id` can be used to set the snapshot of a table to any
arbitrary snapshot, which might have a lineage derived from a table branch or
no lineage at all).
When processing point in time queries implementations should use
"snapshot-log" metadata to lookup the table state at the given point in time.
This ensures time-travel queries reflect the state of the table at the provided
timestamp. For example a SQL query like `SELECT * FROM prod.db.table TIMESTAMP
AS OF '1986-10-26 01:21:00Z';` would find the snapshot of the Iceberg table
just prior to '1986-10-26 01:21:00 UTC' in the snapshot logs and use the
metadata from that snapshot to perform th [...]
@@ -1847,7 +1847,7 @@ Snapshot summary can include metrics fields to track
numeric stats of the snapsh
| **`manifests-created`** | Number of manifest files created in
the snapshot |
| **`manifests-kept`** | Number of manifest files kept in the
snapshot |
| **`manifests-replaced`** | Number of manifest files replaced in
the snapshot |
-| **`entries-processed`** | Number of manifest entries processed
in the snapshot |
+| **`entries-processed`** | Number of manifest entries processed
in the snapshot |
#### Other Fields
@@ -1869,7 +1869,7 @@ Java writes `-1` for "no current snapshot" with V1 and V2
tables and considers t
### Naming for GZIP compressed Metadata JSON files
-Some implementations require that GZIP compressed files have the suffix
`.gz.metadata.json` to be read correctly. The Java reference implementation can
additionally read GZIP compressed files with the suffix `metadata.json.gz`.
+Some implementations require that GZIP compressed files have the suffix
`.gz.metadata.json` to be read correctly. The Java reference implementation can
additionally read GZIP compressed files with the suffix `metadata.json.gz`.
### Position Delete Files with Row Data