yyanyy opened a new pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096


   This PR adds `current-schema-id` and `schemas` to table metadata. It also 
introduces a wrapper around schema to associate table schema with id. 
   The reason to not add ID directly into `Schema` is that currently schema 
creation is widely used as a convenient method for a lot of actions that don't 
involve a "real" table schema. 
   
   Next steps:
   - adds `schema-id` to snapshot logs/history entries and populate
   - use history entries and `schemas` to look up the right schema in time 
travel queries; this may mean to add `schemas()` in `Table` API
   - spec update
   - add schema id to `historyTable` (will be mentioned later) 
   
   Open questions:
   1. Current approach writes the newly introduced fields to JSON by default 
even in v1, and there could be forward/backward compatibility concern with the 
current approach: if a new writer writes (with ID 0) and then update (with ID 
1) schema, metadata will store both schema ID 0 and 1, and default ID will be 
1. Then an old writer reads and writes the metadata for whatever change, which 
drops schema 0 in metadata. Then when a new writer picks up the metadata again, 
the original schema 0 is gone, and 1 is replaced with ID 0. This could result 
in schema ID consistency issue among different writers.    
      - Since ID is introduced in this PR, there is no metadata table that 
exposes these inconsistent schema IDs, so we may not have this problem for now. 
However when we start to add schema ID to `historyTable` metadata table, at 
different time ID 0 could mean different things in this history table. We could 
potentially workaround this by only exposing `schemaID` field in `historyTable` 
only for v2 tables, or mention this caveat on spec. 
      - Alternatively we can expose these two fields only in v2 table, and time 
travel queries in v1 always rely on looking at old table metadata files as 
implemented in #1508. This could mean in future any new changes that may depend 
on schema ID cannot be introduced in v1. 
   2. Do we want to add a `last-assigned-schema-id` to table metadata? My 
answer would be yes, for a similar reason mentioned in [this 
comment](https://github.com/apache/iceberg/pull/2089#issuecomment-761184851)
   3. I think currently when replacing a table, earlier history 
entries/`snapshotLog` will be reset to empty (second to last argument in 
[here](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableMetadata.java#L712)).
 Is this expected? do we want to fix this as a separate issue?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to