Re: Should DDL operations always create new snapshots?

Ryan Blue Mon, 12 May 2025 16:06:47 -0700

Snapshots are created when data changes and there is no change to the data
tree at “time” v3. If you want to create new snapshots when the schema
changes it is alright to do it, but I don’t think that we need to require
it in the spec.


Also, it isn’t clear to me why the time travel query would resolve * as a
and b when time traveling. That error shows that there is an inconsistent
schema between time(v3) and the current schema (where b comes from). What
happens when you run `SELECT * FROM t FOR TIMESTAMP time(v2)`?

Ryan

On Fri, May 9, 2025 at 4:03 AM Vladimir Ozerov <voze...@querifylabs.com>
wrote:

> Hi,
>
> When working on a time-travel in Trino I noticed the behavior which seems
> counter-intuitive from the user perspective. Consider the following
> sequence of SQL queries executed from Trino (could be any other query
> engine). Here we create a table, populate it with some data, and then add
> another column. The third operation is implemented in Trino via
> Table.updateSchema().addColumn(...).commit().
>
> CREATE TABLE t (a);        // v1
> INSERT INTO t VALUES (1);  // v2
> ALTER TABLE t ADD COLUMN b // v3
> UPDATE t SET b = 2;        // v4
>
> The trick is that the third operation creates a schema, but *doesn't*
> create a snapshot. After that we endup with the following table metadata:
> {
>     snapshots: {
>         v1: { schemaId: 0 } // Initial create
>         v2: { schemaId: 0 } // Insert using the old schema
>         v4: { schemaId: 1 } // Update using the new schema
>     }
>     schemaIds: { 0, 1 }
>     currentSnapshot: v4
>     currentSchema: 1
> }
>
> What happens next in Trino:
>
> SELECT * FROM t;                  // Executed on { snapshot=v4, schemaId=1
> }, returns { a=1, b=2 } as expected
> SELECT * FROM t FOR TIMESTAMP v3; // Executed on { snapshot=v2, schemaId=0
> }, throws exception: column "b" is not found
>
> That is, we cannot query the table after the new column is added, but
> before it is updated. This happens because there is no dedicated snapshot
> for UpdateSchema action. To make things even more confusing, if we do not
> have the operation v4 then we will be able to query the v3 state, because
> engines will basically "reconstruct" it as a virtual snapshot using
> TableMetadata.currentSnapshot and TableMetadata.currentSchema:
>
> SELECT * FROM t;                  // Executed on { snapshot=v2, schemaId=1
> }, returns { a=1, b=null } as expected
> SELECT * FROM t FOR TIMESTAMP v3; // Executed on { snapshot=v2, schemaId=0
> }, throws exception: column "b" is not found
>
> My question is: shouldn't we always implicitly add new snapshots for all
> schema changes? Because otherwise we cannot read all existing table states
> which is confusing. Of course, this could be handled on the engine side,
> but this will lead to inconsistent behaviour between different engines
> depending on their internal implementation details.
>
> Regards,
> --
> *Vladimir Ozerov*
>

Re: Should DDL operations always create new snapshots?

Reply via email to