wmoustafa commented on pull request #2496:
URL: https://github.com/apache/iceberg/pull/2496#issuecomment-1048352760


   Regarding the default value semantics:
   
   I think this discussion can be simplified if we distinguish between two use 
cases of default values: (1) when they are used with an `INSERT INTO` that 
references a subset of columns, and (2) when they are used with schema 
evolution to compensate for missing columns in previous data. We observe that 
the two use cases operate on disjoint subsets of rows. `INSERT INTO` default 
values operate on rows where the column _exists_ in the schema, and schema 
evolution default values operate on rows where the column _does not exist_ in 
the schema. Therefore, we can discuss each separately.
   
   To start, we can discuss the schema evolution use case as the sole one to 
simplify the discussion (we will relax this assumption later). In fact, unlike 
traditional database engines, which allow `INSERT INTO` that references a 
subset of columns, compute engines in the data lake do not commonly support 
inserting rows with missing values. In most cases, Hive, Trino, and Spark will 
throw an error if the `INSERT INTO` statement does not specify values for all 
columns. If we go by this assumption (for now), it means that: A column’s 
default value does not matter for subsequent inserts. It only matters for 
previous inserts when the column did not exist in the table in the first place. 
So we can have the following events throughout the lifecycle of a table:
   
   1. **Table creation:** default value does not matter for all columns defined 
as part of the table creation.
   2. **Adding a column to an existing table and defining a default value for 
it at the same time:** Default value affects the rows that are present at the 
time of this event (i.e., adding the column and introducing the default value). 
When reading those rows, the default value will be used. Let us call this set 
of rows R1.
   3. **Altering the default value (to d2) of an existing column that has a 
default value (d1):** The same rows R1 that were assigned the default value d1, 
will now be assigned d2. Nothing else changes. Rows added after R1 are not 
affected by d1 or d2.
   4. **Rewrite:** the current default value is used for rewriting R1, and the 
default value will now be materialized. Subsequent default value changes no 
longer affect R1 as well (or the subset of rewritten rows from R1). If all rows 
are rewritten, we are back to the “table creation” state.
   
   
   So putting the "`INSERT INTO` that references a subset of columns” use case 
aside,  semantics of default values can be clear:
   
   - It is safe to introduce them and change them. The current default will 
always be used for the rows missing the respective column, until the first 
rewrite.
   
   Now, to cover the `INSERT INTO` use case, just in case it becomes a common 
one (e.g., Hive 3  is adding support as Shenoda mentions above), the behavior 
in this case can be materializing the default values for the rows that are part 
of this `INSERT `since the statement occurs on _existing columns_. This still 
does not affect the semantics of the R1 subset since this is the subset of rows 
where the _column does not exist_.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to