wypoon commented on pull request #2275:
URL: https://github.com/apache/iceberg/pull/2275#issuecomment-812938555


   @yyanyy thank you for responding to my question. But because I am slow, I 
needed to do some testing to understand for myself the effects of 
https://github.com/apache/iceberg/pull/2096 and this PR.
   I started with Iceberg 0.11.0 and created some tables, inserting data into 
them, and altering the schema. Then I added the commit for 
https://github.com/apache/iceberg/pull/2096 and the changes here on top 
(actually I added the commit for https://github.com/apache/iceberg/pull/2089 
first, so that the backports are clean). I made further updates to my Iceberg 
tables, adding data and altering the schema.
   From what I see, when the table is updated and a new metadata.json file for 
the table is written, the metadata.json file still has a `schema` field, but 
the schema now has a `schema-id` field; it also has a `schemas` field 
containing an array of schemas, and a `current-schema-id` field. The only 
schema known at the time of the switchover is the current schema of the table, 
so if the table change is just change to data, the existing schema is given a 
`schema-id` of 0, and the schemas array contains a single schema; if the table 
change is a schema change, then the old schema is given a `schema-id` of 0, the 
new schema is given a `schema-id` of 1, and the `schemas` array contains two 
schemas (and the `current-schema-id` field is 1). In the `snapshots` field, any 
snapshot created before the switchover does not have a `schema-id`; snapshots 
created after the switchover are given a `schema-id`, corresponding to the 
schema current at the time the snapshot is written.
   When `Snapshot#schemaId()` is called, if the snapshot is written before the 
switchover, null is returned, and if written after the switchover, a non-null 
Integer is returned.
   It is as you wrote.
   With this change, I think it is straightforward to update my PR, 
https://github.com/apache/iceberg/pull/1508. I just need to update the 
following method I'm adding to `BaseTable`
   ```
     public Schema schemaForSnapshot(long snapshotId) {
       TableMetadata current = ops.current();
       // First, check if the snapshot has a schema id associated with it
       Snapshot snapshot = current.snapshot(snapshotId);
       Integer schemaId = snapshot.schemaId();
       if (schemaId != null) {
         return current.schemasById().get(schemaId);
       }
       // Otherwise, read each of the previous metadata files until we find one 
whose current
       // snapshot id is the snapshot id
       ...
     }
   ```
   I rebuilt Iceberg with my change on top of the previous changes, and I was 
able to see that the correct schema is used when viewing any snapshot (time 
travel).
   When this change is merged, I will update my PR. I hope that it can be 
considered then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to