(iceberg) branch main updated: Spec: Clarify identity partition edge cases (#10835)

amoghj Mon, 05 Aug 2024 18:06:46 -0700

This is an automated email from the ASF dual-hosted git repository.

amoghj pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg.git



The following commit(s) were added to refs/heads/main by this push:
     new 525d887811 Spec: Clarify identity partition edge cases (#10835)
525d887811 is described below

commit 525d887811b2fd2140779e125243cb70742e169c
Author: emkornfield <[email protected]>
AuthorDate: Mon Aug 5 18:06:36 2024 -0700

    Spec: Clarify identity partition edge cases (#10835)
---
 format/spec.md | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/format/spec.md b/format/spec.md
index daef7538e7..c3321fa699 100644
--- a/format/spec.md
+++ b/format/spec.md
@@ -150,6 +150,10 @@ Readers should be more permissive because v1 metadata 
files are allowed in v2 ta
 
 Readers may be more strict for metadata JSON files because the JSON files are 
not reused and will always match the table version. Required v2 fields that 
were not present in v1 or optional in v1 may be handled as required fields. For 
example, a v2 table that is missing `last-sequence-number` can throw an 
exception.
 
+##### Writing data files
+
+All columns must be written to data files even if they introduce redundancy 
with metadata stored in manifest files (e.g. columns with identity partition 
transforms). Writing all columns provides a backup in case of corruption or 
bugs in the metadata layer.
+
 ### Schemas and Data Types
 
 A table's **schema** is a list of named columns. All data types are either 
primitives or nested types, which are maps, lists, or structs. A table schema 
is also a struct type.
@@ -241,7 +245,14 @@ Struct evolution requires the following rules for default 
values:
 
 #### Column Projection
 
-Columns in Iceberg data files are selected by field id. The table schema's 
column names and order may change after a data file is written, and projection 
must be done using field ids. If a field id is missing from a data file, its 
value for each row should be `null`.
+Columns in Iceberg data files are selected by field id. The table schema's 
column names and order may change after a data file is written, and projection 
must be done using field ids.
+
+Values for field ids which are not present in a data file must be resolved 
according the following rules:
+
+* Return the value from partition metadata if an [Identity 
Transform](#partition-transforms) exists for the field and the partition value 
is present in the `partition` struct on `data_file` object in the manifest. 
This allows for metadata only migrations of Hive tables.
+* Use `schema.name-mapping.default` metadata to map field id to columns 
without field id as described below and use the column if it is present.
+* Return the default value if it has a defined `initial-default` (See [Default 
values](#default-values) section for more details). 
+* Return `null` in all other cases.
 
 For example, a file may be written with schema `1: a int, 2: b string, 3: c 
double` and read using projection schema `3: measurement, 2: name, 4: a`. This 
must select file columns `c` (renamed to `measurement`), `b` (now called 
`name`), and a column of `null` values called `a`; in that order.

(iceberg) branch main updated: Spec: Clarify identity partition edge cases (#10835)

Reply via email to