wmoustafa commented on a change in pull request #4301:
URL: https://github.com/apache/iceberg/pull/4301#discussion_r824410358



##########
File path: format/spec.md
##########
@@ -193,10 +193,38 @@ Notes:
 
 For details on how to serialize a schema to JSON, see Appendix C.
 
+#### Default value
+Default values can be assigned to top-level columns or nested fields. Default 
values are used during schema evolution when adding a new column. The default 
value is used to read rows belonging to the files that lack the column or 
nested field prior to the schema evolution.

Review comment:
       Ok now that I see the code snippets, the idea is much clearer. I will 
refer to code snippets as (1), (2), (3), respectively.
   
   Code snippet (1) is really nice. I think it is closer to what I was 
expressing in option (2) in my last comment. That is:
   * It introduces the column and its schema evolution default value 
atomically. That would be equivalent to a rewrite, except we are storing this 
value in a separate placeholder instead of physically storing it with each 
record (i.e., rewrite).
   * It reserves all future calls of `setColumnDefault()` to the future records 
(i.e., write side default value), and it does not override the value dedicated 
for schema evolution. `setColumnDefault()` can be used to change the write side 
default value without affecting the schema evolution default value.
   
   I also agree (2) is confusing and it is the reason I initially thought two 
values can lead to such APIs.
   
   For (3), I think it is still subject to the rules in (1), in addition to the 
general rules around dropping and re-introducing a column. Another way to 
simulate (3)'s effect without dropping and recreating the column is to issue an 
`UPDATE` statement from `34` to `35`. I think this has the side effect of not 
distinguishing between organic `34` and `34` from schema evolution, but that is 
how an RDBMS would behave anyways. In fact, dropping and recreating the column, 
along with changing default values opens many spec questions which might be 
beyond the scope of this PR. (For example, if you drop an existing column do 
you expect to wipe out the all the values (including ones on rows with 
non-default values) and start over if you re-introduce the column? If not what 
if a rewrite happens in between? Which rows does the new default value 
override? etc.) Also, I think this would get even more complex if we attempt to 
apply it on inner fields in order to update the default value of an inne
 r field. 
   
   So overall, I think what we have in (1) looks good. To update the past rows 
default value, the user can move within whatever room the rest of the spec 
allows them to (dropping the column if they can bear the general column 
dropping consequences), or running an `UPDATE` if they are okay with updating 
rows coincidentally carrying the default value (but RDBMSs would do the same). 
Finally, if really needed I think it is okay to introduce (2) to avoid jumping 
multiple hoops, but it should not be part of the spec. That can be done in the 
future as well, so we do not preemptively loosen the spec before users ask for 
it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to