wmoustafa commented on a change in pull request #4301:
URL: https://github.com/apache/iceberg/pull/4301#discussion_r824410358
##########
File path: format/spec.md
##########
@@ -193,10 +193,38 @@ Notes:
For details on how to serialize a schema to JSON, see Appendix C.
+#### Default value
+Default values can be assigned to top-level columns or nested fields. Default
values are used during schema evolution when adding a new column. The default
value is used to read rows belonging to the files that lack the column or
nested field prior to the schema evolution.
Review comment:
Ok now that I see the code snippets, the idea is much clearer. I will
refer to code snippets as (1), (2), (3), respectively.
Code snippet (1) is really nice. I think it is closer to what I was
expressing in option (2) in my last comment. That is:
* It introduces the column and its schema evolution default value
atomically. That would be equivalent to a rewrite, except we are storing this
value in a separate placeholder instead of physically storing it with each
record (i.e., rewrite).
* It reserves all future calls of `setColumnDefault()` to the future records
(i.e., write side default value), and it does not override the value dedicated
for schema evolution. `setColumnDefault()` can be used to change the write side
default value without affecting the schema evolution default value.
I also agree (2) is confusing and it is the reason I initially thought two
values can lead to such APIs.
For (3), I think it is still subject to the rules in (1), in addition to the
general rules around dropping and re-introducing a column. Another way to
simulate (3)'s effect without dropping and recreating the column is to issue an
`UPDATE` statement from `34` to `35`. I think this has the side effect of not
distinguishing between organic `34` and `34` from schema evolution, but that is
how an RDBMS would behave anyways. In fact, dropping and recreating the column,
along with changing default values opens many spec questions which might be
beyond the scope of this PR. (For example, if you drop an existing column do
you expect to wipe out the all the values (including ones on rows with
non-default values) and start over if you re-introduce the column? If not what
if a rewrite happens in between? Which rows does the new default value
override? etc.) Also, I think this would get even more complex if we attempt to
apply it on inner fields in order to update the default value of an inne
r field.
So overall, I think what we have in (1) looks good. To update the past rows
default value, the user can move within whatever room the rest of the spec
allows them to (dropping the column if they can bear the general column
dropping consequences), or running an `UPDATE` if they are okay with updating
rows coincidentally carrying the default value (but RDBMSs would do the same).
Finally, if really needed I think it is okay to introduce (2) to avoid jumping
multiple hoops, but it should not be part of the spec. That can be done in the
future as well, so we do not preemptively loosen the spec before users ask for
it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]