[GitHub] [iceberg] rzhang10 commented on a change in pull request #4301: Docs: Default value support feature specification

GitBox Fri, 11 Mar 2022 12:39:49 -0800


rzhang10 commented on a change in pull request #4301:
URL: https://github.com/apache/iceberg/pull/4301#discussion_r825081342




##########
File path: format/spec.md
##########
@@ -193,10 +193,38 @@ Notes:
 
 For details on how to serialize a schema to JSON, see Appendix C.
 
+#### Default value
+Default values can be assigned to top-level columns or nested fields. Default 
values are used during schema evolution when adding a new column. The default 
value is used to read rows belonging to the files that lack the column or 
nested field prior to the schema evolution.

Review comment:
       > Here are some more concrete evolution rules with defaults:
   > 
   > * When creating a required field, an initial default value must be set. 
Neither field values nor the default may be `null`.
   > * When creating an optional field, an initial default value can be set. If 
the initial default value is not set, it is `null`.
   > * The initial default value may only be set when adding a field (or 
through an incompatible change in the API)
   > * The write default value for a field starts as the initial default value, 
and is considered set if the field has an initial default
   > * The write default value may be changed
   > * When writing, the write default must be written into data files if a 
value for the column is not supplied
   > * When writing a field with no write default, the column must be supplied 
or the write must fail
   > 
   > Does that make it clear?
   
   @rdblue 
   I think I just want to make a clarification question so that I can make sure 
I understand the above discussions correctly.
   
   So I assume to summarize this, you mean there should be 2 underlying 
`default value` concepts: one is `initial default value` and one is `write 
default value`, the `initial default value` affects old rows already written 
without the column and the `write default value` affects future rows.
   And we only expose `one default value APIs` to users, but that API actually 
handles the underlying `two default value concepts`, it's just hidden to users.
   
   Say I have a following scenario, there is a dataset with some rows written, 
call them `R`, a user later adds a new column with default value `d1`, and then 
changes the default value to `d2` (I assume that will be the `write default` 
value?) , is it true that according to this spec, when the user reads the 
dataset at this time, the `R` rows **must** read `d1` for that column, 
regardless of whether there is an actual full rewriting happening between the 
change from `d1` to `d2`? Because I think implementation-wise, we do want to 
lazily do the rewriting, not eagerly rewrite we set a default value?
   
   Is my above understanding correct?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rzhang10 commented on a change in pull request #4301: Docs: Default value support feature specification

Reply via email to