wmoustafa commented on pull request #2496: URL: https://github.com/apache/iceberg/pull/2496#issuecomment-1048352760
Regarding the default value semantics: I think this discussion can be simplified if we distinguish between two use cases of default values: (1) when they are used with an `INSERT INTO` that references a subset of columns, and (2) when they are used with schema evolution to compensate for missing columns in previous data. We observe that the two use cases operate on disjoint subsets of rows. `INSERT INTO` default values operate on rows where the column _exists_ in the schema, and schema evolution default values operate on rows where the column _does not exist_ in the schema. Therefore, we can discuss each separately. To start, we can discuss the schema evolution use case as the sole one to simplify the discussion (we will relax this assumption later). In fact, unlike traditional database engines, which allow `INSERT INTO` that references a subset of columns, compute engines in the data lake do not commonly support inserting rows with missing values. In most cases, Hive, Trino, and Spark will throw an error if the `INSERT INTO` statement does not specify values for all columns. If we go by this assumption (for now), it means that: A column’s default value does not matter for subsequent inserts. It only matters for previous inserts when the column did not exist in the table in the first place. So we can have the following events throughout the lifecycle of a table: 1. **Table creation:** default value does not matter for all columns defined as part of the table creation. 2. **Adding a column to an existing table and defining a default value for it at the same time:** Default value affects the rows that are present at the time of this event (i.e., adding the column and introducing the default value). When reading those rows, the default value will be used. Let us call this set of rows R1. 3. **Altering the default value (to d2) of an existing column that has a default value (d1):** The same rows R1 that were assigned the default value d1, will now be assigned d2. Nothing else changes. Rows added after R1 are not affected by d1 or d2. 4. **Rewrite:** the current default value is used for rewriting R1, and the default value will now be materialized. Subsequent default value changes no longer affect R1 as well (or the subset of rewritten rows from R1). If all rows are rewritten, we are back to the “table creation” state. So putting the "`INSERT INTO` that references a subset of columns” use case aside, semantics of default values can be clear: - It is safe to introduce them and change them. The current default will always be used for the rows missing the respective column, until the first rewrite. Now, to cover the `INSERT INTO` use case, just in case it becomes a common one (e.g., Hive 3 is adding support as Shenoda mentions above), the behavior in this case can be materializing the default values for the rows that are part of this `INSERT `since the statement occurs on _existing columns_. This still does not affect the semantics of the R1 subset since this is the subset of rows where the _column does not exist_. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
