This is an automated email from the ASF dual-hosted git repository.
amoghj pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg.git
The following commit(s) were added to refs/heads/main by this push:
new 525d887811 Spec: Clarify identity partition edge cases (#10835)
525d887811 is described below
commit 525d887811b2fd2140779e125243cb70742e169c
Author: emkornfield <[email protected]>
AuthorDate: Mon Aug 5 18:06:36 2024 -0700
Spec: Clarify identity partition edge cases (#10835)
---
format/spec.md | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/format/spec.md b/format/spec.md
index daef7538e7..c3321fa699 100644
--- a/format/spec.md
+++ b/format/spec.md
@@ -150,6 +150,10 @@ Readers should be more permissive because v1 metadata
files are allowed in v2 ta
Readers may be more strict for metadata JSON files because the JSON files are
not reused and will always match the table version. Required v2 fields that
were not present in v1 or optional in v1 may be handled as required fields. For
example, a v2 table that is missing `last-sequence-number` can throw an
exception.
+##### Writing data files
+
+All columns must be written to data files even if they introduce redundancy
with metadata stored in manifest files (e.g. columns with identity partition
transforms). Writing all columns provides a backup in case of corruption or
bugs in the metadata layer.
+
### Schemas and Data Types
A table's **schema** is a list of named columns. All data types are either
primitives or nested types, which are maps, lists, or structs. A table schema
is also a struct type.
@@ -241,7 +245,14 @@ Struct evolution requires the following rules for default
values:
#### Column Projection
-Columns in Iceberg data files are selected by field id. The table schema's
column names and order may change after a data file is written, and projection
must be done using field ids. If a field id is missing from a data file, its
value for each row should be `null`.
+Columns in Iceberg data files are selected by field id. The table schema's
column names and order may change after a data file is written, and projection
must be done using field ids.
+
+Values for field ids which are not present in a data file must be resolved
according the following rules:
+
+* Return the value from partition metadata if an [Identity
Transform](#partition-transforms) exists for the field and the partition value
is present in the `partition` struct on `data_file` object in the manifest.
This allows for metadata only migrations of Hive tables.
+* Use `schema.name-mapping.default` metadata to map field id to columns
without field id as described below and use the column if it is present.
+* Return the default value if it has a defined `initial-default` (See [Default
values](#default-values) section for more details).
+* Return `null` in all other cases.
For example, a file may be written with schema `1: a int, 2: b string, 3: c
double` and read using projection schema `3: measurement, 2: name, 4: a`. This
must select file columns `c` (renamed to `measurement`), `b` (now called
`name`), and a column of `null` values called `a`; in that order.