rdblue commented on code in PR #14004:
URL: https://github.com/apache/iceberg/pull/14004#discussion_r2462226026


##########
format/spec.md:
##########
@@ -1875,6 +1875,25 @@ Some implementations require that GZIP compressed files 
have the suffix `.gz.met
 
 Although the spec allows for including the deleted row itself (in addition to 
the path and position of the row in the data file) in v2 position delete files, 
writing the row is optional and no implementation currently writes it. The 
ability to write and read the row is supported in the Java implementation but 
is deprecated in version 1.11.0.
 
+### Schema Evolution/Type Promotion
+
+Column projection rules are designed so that the table will remain readable 
even if writers use an outdated schema. At the beginning of a transaction 
Writers should load the latest schema (the schema referenced by 
`current-schema-id` from the latest table metadata) and use it for reading and 
writing data.  Note, that in the common cases of schema evolution (adding 
nullable columns, adding required columns with an `initial-default`, renaming a 
column, dropping a column, or doing type promotion), appending data with 
outdated schemas presents no issues under either SNAPSHOT or SERIALIZABLE 
isolation levels
+
+However, the less common case of updating default values may need to be 
handled depending on isolation level. Consider two concurrent transactions:
+
+* **T1** modifies the `write-default` on the column.
+* **T2** writes data that makes use of `write-default` from the changed column 
in the first transaction.
+
+If the **T1** commits before **T2** then handling **T2** depends on isolation 
level.
+
+* **SNAPSHOT**: **T2** may be commited even though it used the old 
`write-default` (this is a permitted serialization anomaly).
+* **SERIALIZABLE**: **T2** must abort.
+
+When a transaction is aborted, the transaction could be retried after updating 
to the new schema and rewriting the data using the new `write-default`. One way 
of ensuring SERIALIZABLE isolation is a two phased approach when retrying a 
transaction that does a append to the table:

Review Comment:
   Transactions can always be retried from the beginning, so I don't think it 
is valuable to call that out here.
   
   This recommendation is okay, but is a bit too specific. These behaviors are 
recommendations. Engines choose their own definitions of snapshot/serializable 
(or others) and how to align those with default values is up to the engine's 
definition. Engines could say that snapshot isolation must adhere strictly to 
commit order for write defaults. And would the Iceberg community have an issue 
if engines ignored the implications of write defaults entirely?
   
   I think in order to use "snapshot" and "serializable" here, you'd need to 
define them and make sure this is a recommendation. That's why this is in a 
note and not a requirement.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to