emkornfield commented on code in PR #13936: URL: https://github.com/apache/iceberg/pull/13936#discussion_r2325742735
########## format/spec.md: ########## @@ -1861,6 +1861,16 @@ Java writes `-1` for "no current snapshot" with V1 and V2 tables and considers t Some implementations require that GZIP compressed files have the suffix `.gz.metadata.json` to be read correctly. The Java reference implementation can additionally read GZIP compressed files with the suffix `metadata.json.gz`. +### Schema evolution and writing with old schemas + +Writers must write out all fields with the types specified from a schema present in table metadata. Writers should use the latest schema for writing. Not writing out all columns or not using the latest schema can change the semantics of the data written. The following are possible inconsistencies that can be introduced: + +* For all null columns, not writing out the column would cause `initial-default` value would be applied on reading instead of `null`. +* If `write-default` has been changed then using an out-of-date schema would result in the incorrect value being populated. +* If a `write` is the result of a partial row update (e.g. `update table set col_y = 'xyz'`) an out-of-date schema would silently drop values. Review Comment: If an old schema is used, then you implicitly end up dropping columns because you can't read columns you don't know about. Thinking more about it, this should be unlikely to happen because you probably would have to replay the transaction anyways. But effectively the sequence would be: 1. Writer A writes new schema with added columns and new data for the added column. 2. Writer B uses uses and old schema (this would have to happen strictly after step 1), and reads the new data, modifying an existing column. 3. Writer B's updates would drop the new data from the added column. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org