wmoustafa commented on a change in pull request #4301:
URL: https://github.com/apache/iceberg/pull/4301#discussion_r840467740
##########
File path: format/spec.md
##########
@@ -193,6 +193,17 @@ Notes:
For details on how to serialize a schema to JSON, see Appendix C.
+#### Default value
+
+Default value can be assigned to a column when the column is added to an
Iceberg table as part of the schema evolution. They are tracked at the level of
a nested field inside a struct, thus it can be used for both top-level columns
and nested columns. Iceberg tracks two default values internally:
`initial-default` and `write-default`. The `initial-default` is used to read
rows belonging to files that lack the column (i.e. the files were written
before the column is added); the `write-default` value will be used for the
automatically populating the column if user later inserts new rows without
specifying the column.
Review comment:
A couple of suggestions for the names: `file-to-record default` and
`record-to-file default`, or `read-time default` and `write-time default`. I
feel `file-to-record default` and `record-to-file default` are expressive,
symmetric, and accurately capture the function of each default. We might
explain that the former cannot be changed, while the latter can be changed
throughout the lifecycle of a table. The former cannot be changed because it is
used to read existing files while the latter can be changed because it is used
to write new files.
We might also start by explaining the semantics of default values from a
_contract_ point of view:
> A default value associated with a field is used to:
> (1) populate the field's value for all records that were written before
the field is introduced
> (2) populate the field's value for any records that will be written after
the field is introduced, when such records do not supply that field's value.
> All fields introduced at table creation time (i.e., not part of schema
evolution) only leverage the second use case. Fields introduced as part of a
schema evolution can leverage both use cases.
>
> Changes of default values apply to future records only (i.e., records
written after the default value has changed). To prevent retroactive change of
record values that are already written (before the field is introduced, where
those records lack the field in the files), internally each field is associated
with two default values ... (then we can continue with the section above).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]