I think I'm -0 for a side-channel approach like this. It solves this immediate problem, but it feels like a band-aid.
Coming back to Ryan's initial response: > If you want to create new snapshots when the schema changes it is alright to do it, but I don’t think that we need to require it in the spec. One thing that's missing for this is there's no operation to represent a schema change in the snapshot summary. Maybe we can add a `schema-change` operation for writers that want to do this? Brian On Fri, May 16, 2025 at 4:37 PM Wing Yew Poon <wyp...@cloudera.com.invalid> wrote: > Hi Vladimir, > > I understand the problem now. In your example: > > v1: CREATE TABLE t (a) > {} > > v2: INSERT INTO t VALUES (1) > {a=1} > > v3: ALTER TABLE t ADD COLUMN b > {a=1, b=null} > > v4: UPDATE t SET b = 2 > {a=1, b=2} > > there are only 2 snapshots: at v2 and at v4. Let's call the snapshots s1 > (at v2) and s2 (at v4). > Note that there is no snapshot at v1 either. > > At the state v3, the non-time-travel query, SELECT * FROM t, applies the > current schema to the current snapshot s1. > However, time-travel queries, whether using snapshot-id or timestamp, > looks for the snapshot satisfying the criterion (either the snapshot with > that snapshot-id, or the snapshot as of that timestamp), and then uses the > *schema > as of that snapshot*. There is no way to specify schema as of a > timestamp. I think your problem can be solved if we allow for that, *without > requiring that schema changes produce new snapshots*. > > Thus an alternative proposal is: > Associate a timestamp with each schema, in addition to the schema-id. > For time travel by snapshot-id (or tag), there is no change in behavior. > We find the snapshot, and use the schema for that snapshot. > For time travel by timestamp, we find the schema as well as the snapshot > as of the timestamp. Then we use that schema with the snapshot. Without a > spec change to associate a timestamp with each schema, we can still find a > schema as of the timestamp, by going through the metadata-log to find the > metadata.json as of that timestamp and reading the metadata.json to find > the current-schema-id. But this would be slow as it involves reading > another metadata.json. > > - Wing Yew > > > > > > > On Tue, May 13, 2025 at 6:59 AM Vladimir Ozerov <voze...@querifylabs.com> > wrote: > >> In my example, the expected results for "SELECT * FROM t" are: >> >> v1: CREATE TABLE t (a) >> {} >> >> v2: INSERT INTO t VALUES (1) >> {a=1} >> >> v3: ALTER TABLE t ADD COLUMN b >> {a=1, b=null} >> >> v4: UPDATE t SET b = 2 >> {a=1, b=2} >> >> The problem is that the state v3 is reachable only if it is the last >> operation on the table (i.e., no v4 yet). But after v4 is applied, v3 is >> not reachable any more. This is counter-intuitive because users will face >> scenarios like "I created this report with that SQL yesterday and would >> like to re-execute it again on that same data ... Column not found error". >> >> Time-travel in virtually all systems come with a long list of >> limitations, so it is expected that certain scenarios will not work. >> However, the less counter-intuitive scenarios, the better. The >> problem arise because on the one hand snapshots track only physical data >> changes, but on the other hand these physical changes are also associated >> with logical schemas. And from the user perspective, DDL operations can >> change data even if they do not change physical files. ADD COLUMN is one >> example, DROP COLUMN is another - all physical files are the same, but >> table state is different. Column data type change is another example. >> >> My question arise because Iceberg design already properly ties logical >> schema to physical layout, this is why Snapshot contains schemaId. But for >> some reason some schema changes that change data from the user perspective >> do not produce new snapshots. And it seems that this could be fixed easily >> with a little change to spec and UpdateSchema operation to force it create >> a new snapshot (which is a lightweight op), because data is actually >> *changed* even though all physical files stay the same, and any reasonable >> engine already reflect it for so long this is the last operation. >> >> Surely, this could be handled at the engine level as well. But engine >> capabilities are rarely aligned. >> >> >> >> On Tue, May 13, 2025 at 2:30 AM Brian Hulette <bhule...@apache.org> >> wrote: >> >>> I've also been a bit confused by this edge case. Is {a: 1, b: NULL} the >>> correct result for querying at "time" v3? >>> >>> If so I agree it's a bit finicky for engines to produce that result. It >>> can't be determined from the snapshot-log alone, IIUC they'll need to look >>> at the metadata-log to find the active schema at that time. >>> >>> On Mon, May 12, 2025 at 3:42 PM Ryan Blue <rdb...@gmail.com> wrote: >>> >>>> Snapshots are created when data changes and there is no change to the >>>> data tree at “time” v3. If you want to create new snapshots when the schema >>>> changes it is alright to do it, but I don’t think that we need to require >>>> it in the spec. >>>> >>>> Also, it isn’t clear to me why the time travel query would resolve * >>>> as a and b when time traveling. That error shows that there is an >>>> inconsistent schema between time(v3) and the current schema (where b comes >>>> from). What happens when you run `SELECT * FROM t FOR TIMESTAMP >>>> time(v2)`? >>>> >>>> Ryan >>>> >>>> On Fri, May 9, 2025 at 4:03 AM Vladimir Ozerov <voze...@querifylabs.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> When working on a time-travel in Trino I noticed the behavior which >>>>> seems counter-intuitive from the user perspective. Consider the following >>>>> sequence of SQL queries executed from Trino (could be any other query >>>>> engine). Here we create a table, populate it with some data, and then add >>>>> another column. The third operation is implemented in Trino via >>>>> Table.updateSchema().addColumn(...).commit(). >>>>> >>>>> CREATE TABLE t (a); // v1 >>>>> INSERT INTO t VALUES (1); // v2 >>>>> ALTER TABLE t ADD COLUMN b // v3 >>>>> UPDATE t SET b = 2; // v4 >>>>> >>>>> The trick is that the third operation creates a schema, but *doesn't* >>>>> create a snapshot. After that we endup with the following table metadata: >>>>> { >>>>> snapshots: { >>>>> v1: { schemaId: 0 } // Initial create >>>>> v2: { schemaId: 0 } // Insert using the old schema >>>>> v4: { schemaId: 1 } // Update using the new schema >>>>> } >>>>> schemaIds: { 0, 1 } >>>>> currentSnapshot: v4 >>>>> currentSchema: 1 >>>>> } >>>>> >>>>> What happens next in Trino: >>>>> >>>>> SELECT * FROM t; // Executed on { snapshot=v4, >>>>> schemaId=1 }, returns { a=1, b=2 } as expected >>>>> SELECT * FROM t FOR TIMESTAMP v3; // Executed on { snapshot=v2, >>>>> schemaId=0 }, throws exception: column "b" is not found >>>>> >>>>> That is, we cannot query the table after the new column is added, but >>>>> before it is updated. This happens because there is no dedicated snapshot >>>>> for UpdateSchema action. To make things even more confusing, if we do not >>>>> have the operation v4 then we will be able to query the v3 state, because >>>>> engines will basically "reconstruct" it as a virtual snapshot using >>>>> TableMetadata.currentSnapshot and TableMetadata.currentSchema: >>>>> >>>>> SELECT * FROM t; // Executed on { snapshot=v2, >>>>> schemaId=1 }, returns { a=1, b=null } as expected >>>>> SELECT * FROM t FOR TIMESTAMP v3; // Executed on { snapshot=v2, >>>>> schemaId=0 }, throws exception: column "b" is not found >>>>> >>>>> My question is: shouldn't we always implicitly add new snapshots for >>>>> all schema changes? Because otherwise we cannot read all existing table >>>>> states which is confusing. Of course, this could be handled on the engine >>>>> side, but this will lead to inconsistent behaviour between different >>>>> engines depending on their internal implementation details. >>>>> >>>>> Regards, >>>>> -- >>>>> *Vladimir Ozerov* >>>>> >>>> >> >> -- >> *Vladimir Ozerov* >> Founder >> querifylabs.com >> >