rzhang10 commented on a change in pull request #4301:
URL: https://github.com/apache/iceberg/pull/4301#discussion_r825081342
##########
File path: format/spec.md
##########
@@ -193,10 +193,38 @@ Notes:
For details on how to serialize a schema to JSON, see Appendix C.
+#### Default value
+Default values can be assigned to top-level columns or nested fields. Default
values are used during schema evolution when adding a new column. The default
value is used to read rows belonging to the files that lack the column or
nested field prior to the schema evolution.
Review comment:
> Here are some more concrete evolution rules with defaults:
>
> * When creating a required field, an initial default value must be set.
Neither field values nor the default may be `null`.
> * When creating an optional field, an initial default value can be set. If
the initial default value is not set, it is `null`.
> * The initial default value may only be set when adding a field (or
through an incompatible change in the API)
> * The write default value for a field starts as the initial default value,
and is considered set if the field has an initial default
> * The write default value may be changed
> * When writing, the write default must be written into data files if a
value for the column is not supplied
> * When writing a field with no write default, the column must be supplied
or the write must fail
>
> Does that make it clear?
@rdblue
I think I just want to make a clarification question so that I can make sure
I understand the above discussions correctly.
So I assume to summarize this, you mean there should be 2 underlying
`default value` concepts: one is `initial default value` and one is `write
default value`, the `initial default value` affects old rows already written
without the column and the `write default value` affects future rows.
And we only expose `one default value APIs` to users, but that API actually
handles the underlying `two default value concepts`, it's just hidden to users.
Say I have a following scenario, there is a dataset with some rows written,
call them `R`, a user later adds a new column with default value `d1`, and then
changes the default value to `d2` (I assume that will be the `write default`
value?) , is it true that according to this spec, when the user reads the
dataset at this time, the `R` rows **must** read `d1` for that column,
regardless of whether there is an actual full rewriting happening between the
change from `d1` to `d2`? Because I think implementation-wise, we do want to
lazily do the rewriting, not eagerly rewrite we set a default value?
Is my above understanding correct?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]