yyanyy commented on pull request #2275:
URL: https://github.com/apache/iceberg/pull/2275#issuecomment-813861859
> @yyanyy thank you for responding to my question. But because I am slow, I
needed to do some testing to understand for myself the effects of #2096 and
this PR.
> I started with Iceberg 0.11.0 and created some tables, inserting data into
them, and altering the schema. Then I added the commit for #2096 and the
changes here on top (actually I added the commit for #2089 first, so that the
backports are clean). I made further updates to my Iceberg tables, adding data
and altering the schema.
> From what I see, when the table is updated and a new metadata.json file
for the table is written, the metadata.json file still has a `schema` field,
but the schema now has a `schema-id` field; it also has a `schemas` field
containing an array of schemas, and a `current-schema-id` field. The only
schema known at the time of the switchover is the current schema of the table,
so if the table change is just change to data, the existing schema is given a
`schema-id` of 0, and the schemas array contains a single schema; if the table
change is a schema change, then the old schema is given a `schema-id` of 0, the
new schema is given a `schema-id` of 1, and the `schemas` array contains two
schemas (and the `current-schema-id` field is 1). In the `snapshots` field, any
snapshot created before the switchover does not have a `schema-id`; snapshots
created after the switchover are given a `schema-id`, corresponding to the
schema current at the time the snapshot is written.
> When `Snapshot#schemaId()` is called, if the snapshot is written before
the switchover, null is returned, and if written after the switchover, a
non-null Integer is returned.
> It is as you wrote.
> With this change, I think it is straightforward to update my PR, #1508. I
just need to update the following method I'm adding to `BaseTable`
>
> ```
> public Schema schemaForSnapshot(long snapshotId) {
> TableMetadata current = ops.current();
> // First, check if the snapshot has a schema id associated with it
> Snapshot snapshot = current.snapshot(snapshotId);
> Integer schemaId = snapshot.schemaId();
> if (schemaId != null) {
> return current.schemasById().get(schemaId);
> }
> // Otherwise, read each of the previous metadata files until we find
one whose current
> // snapshot id is the snapshot id
> ...
> }
> ```
>
> I rebuilt Iceberg with my change on top of the previous changes, and I was
able to see that the correct schema is used when viewing any snapshot (time
travel).
> When this change is merged, I will update my PR. I hope that it can be
considered then.
Thank you for spending time verifying the changes, and described the steps
here!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]