syun64 commented on issue #6978:
URL: https://github.com/apache/iceberg/issues/6978#issuecomment-1451946473
Hi @nastra really appreciate you taking the time to take a look at this PR.
I was a bit confused at first, and it took some time to identify the issue and
relate it to schema evolution, so here's an attempt at distilling that
confusion at anyone else taking a look at this PR...
Firstly, my understanding is that the concept of snapshot_id is consistent
between the actual iceberg table, and its corresponding state of its metadata
tables that represent it at a certain point in time. Time travel on metadata
tables based on the snapshot_id is actually an advertised feature of Iceberg
within its docs.
Your observation that the schema of the filesTable didn't evolve and that
that's the cause of the issue is absolutely correct. But I want to make a
distinction between the schema_id of the actual iceberg table, the metadata
table and their consistent snapshot_ids here. Here's a table describing the
timeline of events that is described by my test:
| event | create table | add row (id, data) | add column,
add row (id, data, data2) |
|------------------------|--------------|--------------------|---------------------------------------|
| snapshot id (random) | 6234 | 9023 | 8234
|
| actual table schema id | 0 | 0 | 1
|
| files table schema id | 0 | 0 | 0
|
The schema ID of the actual table does not change until there is a schema
evolution. And this is consistent with @szehon-ho 's observation that there is
no issue when you try to run time travel queries on the metadata table when
there is no schema evolution, even when you add an extra row and increment the
snapshot ID.
However, when the schema evolves in the actual table, because #1508 makes a
strong assumption that we are only looking at the schema ID of the actual
table, we are using that schema ID to read the files metadata table, instead of
using the schema ID of files table at that snapshot. Interesting thing is, that
we are still able to query the files table for the first two snapshots in this
series of events (6234 and 9023) even after the table has evolved, because the
schema IDs within that snapshot are consistent across it's actual table and its
metadata tables.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]