I think I'm -0 for a side-channel approach like this. It solves this
immediate problem, but it feels like a band-aid.

Coming back to Ryan's initial response:
> If you want to create new snapshots when the schema changes it is alright
to do it, but I don’t think that we need to require it in the spec.

One thing that's missing for this is there's no operation to represent a
schema change in the snapshot summary. Maybe we can add a `schema-change`
operation for writers that want to do this?

Brian

On Fri, May 16, 2025 at 4:37 PM Wing Yew Poon <wyp...@cloudera.com.invalid>
wrote:

> Hi Vladimir,
>
> I understand the problem now. In your example:
>
> v1: CREATE TABLE t (a)
> {}
>
> v2: INSERT INTO t VALUES (1)
> {a=1}
>
> v3: ALTER TABLE t ADD COLUMN b
> {a=1, b=null}
>
> v4: UPDATE t SET b = 2
> {a=1, b=2}
>
> there are only 2 snapshots: at v2 and at v4. Let's call the snapshots s1
> (at v2) and s2 (at v4).
> Note that there is no snapshot at v1 either.
>
> At the state v3, the non-time-travel query, SELECT * FROM t, applies the
> current schema to the current snapshot s1.
> However, time-travel queries, whether using snapshot-id or timestamp,
> looks for the snapshot satisfying the criterion (either the snapshot with
> that snapshot-id, or the snapshot as of that timestamp), and then uses the 
> *schema
> as of that snapshot*. There is no way to specify schema as of a
> timestamp. I think your problem can be solved if we allow for that, *without
> requiring that schema changes produce new snapshots*.
>
> Thus an alternative proposal is:
> Associate a timestamp with each schema, in addition to the schema-id.
> For time travel by snapshot-id (or tag), there is no change in behavior.
> We find the snapshot, and use the schema for that snapshot.
> For time travel by timestamp, we find the schema as well as the snapshot
> as of the timestamp. Then we use that schema with the snapshot. Without a
> spec change to associate a timestamp with each schema, we can still find a
> schema as of the timestamp, by going through the metadata-log to find the
> metadata.json as of that timestamp and reading the metadata.json to find
> the current-schema-id. But this would be slow as it involves reading
> another metadata.json.
>
> - Wing Yew
>
>
>
>
>
>
> On Tue, May 13, 2025 at 6:59 AM Vladimir Ozerov <voze...@querifylabs.com>
> wrote:
>
>> In my example, the expected results for "SELECT * FROM t" are:
>>
>> v1: CREATE TABLE t (a)
>> {}
>>
>> v2: INSERT INTO t VALUES (1)
>> {a=1}
>>
>> v3: ALTER TABLE t ADD COLUMN b
>> {a=1, b=null}
>>
>> v4: UPDATE t SET b = 2
>> {a=1, b=2}
>>
>> The problem is that the state v3 is reachable only if it is the last
>> operation on the table (i.e., no v4 yet). But after v4 is applied, v3 is
>> not reachable any more. This is counter-intuitive because users will face
>> scenarios like "I created this report with that SQL yesterday and would
>> like to re-execute it again on that same data ... Column not found error".
>>
>> Time-travel in virtually all systems come with a long list of
>> limitations, so it is expected that certain scenarios will not work.
>> However, the less counter-intuitive scenarios, the better. The
>> problem arise because on the one hand snapshots track only physical data
>> changes, but on the other hand these physical changes are also associated
>> with logical schemas. And from the user perspective, DDL operations can
>> change data even if they do not change physical files. ADD COLUMN is one
>> example, DROP COLUMN is another - all physical files are the same, but
>> table state is different. Column data type change is another example.
>>
>> My question arise because Iceberg design already properly ties logical
>> schema to physical layout, this is why Snapshot contains schemaId. But for
>> some reason some schema changes that change data from the user perspective
>> do not produce new snapshots. And it seems that this could be fixed easily
>> with a little change to spec and UpdateSchema operation to force it create
>> a new snapshot (which is a lightweight op), because data is actually
>> *changed* even though all physical files stay the same, and any reasonable
>> engine already reflect it for so long this is the last operation.
>>
>> Surely, this could be handled at the engine level as well. But engine
>> capabilities are rarely aligned.
>>
>>
>>
>> On Tue, May 13, 2025 at 2:30 AM Brian Hulette <bhule...@apache.org>
>> wrote:
>>
>>> I've also been a bit confused by this edge case. Is  {a: 1, b: NULL} the
>>> correct result for querying at "time" v3?
>>>
>>> If so I agree it's a bit finicky for engines to produce that result. It
>>> can't be determined from the snapshot-log alone, IIUC they'll need to look
>>> at the metadata-log to find the active schema at that time.
>>>
>>> On Mon, May 12, 2025 at 3:42 PM Ryan Blue <rdb...@gmail.com> wrote:
>>>
>>>> Snapshots are created when data changes and there is no change to the
>>>> data tree at “time” v3. If you want to create new snapshots when the schema
>>>> changes it is alright to do it, but I don’t think that we need to require
>>>> it in the spec.
>>>>
>>>> Also, it isn’t clear to me why the time travel query would resolve *
>>>> as a and b when time traveling. That error shows that there is an
>>>> inconsistent schema between time(v3) and the current schema (where b comes
>>>> from). What happens when you run `SELECT * FROM t FOR TIMESTAMP
>>>> time(v2)`?
>>>>
>>>> Ryan
>>>>
>>>> On Fri, May 9, 2025 at 4:03 AM Vladimir Ozerov <voze...@querifylabs.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> When working on a time-travel in Trino I noticed the behavior which
>>>>> seems counter-intuitive from the user perspective. Consider the following
>>>>> sequence of SQL queries executed from Trino (could be any other query
>>>>> engine). Here we create a table, populate it with some data, and then add
>>>>> another column. The third operation is implemented in Trino via
>>>>> Table.updateSchema().addColumn(...).commit().
>>>>>
>>>>> CREATE TABLE t (a);        // v1
>>>>> INSERT INTO t VALUES (1);  // v2
>>>>> ALTER TABLE t ADD COLUMN b // v3
>>>>> UPDATE t SET b = 2;        // v4
>>>>>
>>>>> The trick is that the third operation creates a schema, but *doesn't*
>>>>> create a snapshot. After that we endup with the following table metadata:
>>>>> {
>>>>>     snapshots: {
>>>>>         v1: { schemaId: 0 } // Initial create
>>>>>         v2: { schemaId: 0 } // Insert using the old schema
>>>>>         v4: { schemaId: 1 } // Update using the new schema
>>>>>     }
>>>>>     schemaIds: { 0, 1 }
>>>>>     currentSnapshot: v4
>>>>>     currentSchema: 1
>>>>> }
>>>>>
>>>>> What happens next in Trino:
>>>>>
>>>>> SELECT * FROM t;                  // Executed on { snapshot=v4,
>>>>> schemaId=1 }, returns { a=1, b=2 } as expected
>>>>> SELECT * FROM t FOR TIMESTAMP v3; // Executed on { snapshot=v2,
>>>>> schemaId=0 }, throws exception: column "b" is not found
>>>>>
>>>>> That is, we cannot query the table after the new column is added, but
>>>>> before it is updated. This happens because there is no dedicated snapshot
>>>>> for UpdateSchema action. To make things even more confusing, if we do not
>>>>> have the operation v4 then we will be able to query the v3 state, because
>>>>> engines will basically "reconstruct" it as a virtual snapshot using
>>>>> TableMetadata.currentSnapshot and TableMetadata.currentSchema:
>>>>>
>>>>> SELECT * FROM t;                  // Executed on { snapshot=v2,
>>>>> schemaId=1 }, returns { a=1, b=null } as expected
>>>>> SELECT * FROM t FOR TIMESTAMP v3; // Executed on { snapshot=v2,
>>>>> schemaId=0 }, throws exception: column "b" is not found
>>>>>
>>>>> My question is: shouldn't we always implicitly add new snapshots for
>>>>> all schema changes? Because otherwise we cannot read all existing table
>>>>> states which is confusing. Of course, this could be handled on the engine
>>>>> side, but this will lead to inconsistent behaviour between different
>>>>> engines depending on their internal implementation details.
>>>>>
>>>>> Regards,
>>>>> --
>>>>> *Vladimir Ozerov*
>>>>>
>>>>
>>
>> --
>> *Vladimir Ozerov*
>> Founder
>> querifylabs.com
>>
>

Reply via email to