Re: [DISCUSS] row timestamp proposal

Steven Wu Tue, 12 May 2026 13:35:40 -0700

> One thing that I think has changed since the initial proposal are column
level updates.  Have we considered the interaction between these two
features?


Micah, that is a great call out. I think the *_last_updated_sequence_number*
and *_last_updated_timestamp* should work the same way regarding column
updates. They should reflect the latest snapshot of either the base file or
the newest column file.

On Tue, May 12, 2026 at 1:22 PM Micah Kornfield <[email protected]>
wrote:

> #2 is more involved and should probably be done after the v4 metadata tree
>> (spec <https://github.com/apache/iceberg/pull/16025> and impl
>> <https://github.com/orgs/apache/projects/605/views/1>) is mostly
>> complete, as we want to plumb inheritance through only for the v4 tables.
>
>
> One thing that I think has changed since the initial proposal are column
> level updates.  Have we considered the interaction between these two
> features?
>
> Thanks,
> Micah
>
> On Mon, May 11, 2026 at 6:30 PM Steven Wu <[email protected]> wrote:
>
>> Circling back on this topic, since we have consensus on the direction. It
>> essentially has two parts
>>
>>    1. monotonic snapshot timestamp for v4 tables
>>    2. row timestamp inherited from snapshot timestamp for v4 tables
>>
>>
>> #1 is an isolated and small change. So I created the following PRs:
>> * spec: https://github.com/apache/iceberg/pull/16294
>> * impl: https://github.com/apache/iceberg/pull/16293
>>
>> #2 is more involved and should probably be done after the v4 metadata
>> tree (spec <https://github.com/apache/iceberg/pull/16025> and impl
>> <https://github.com/orgs/apache/projects/605/views/1>) is mostly
>> complete, as we want to plumb inheritance through only for the v4 tables.
>>
>>
>>
>> On Mon, Jan 26, 2026 at 10:05 AM Russell Spitzer <
>> [email protected]> wrote:
>>
>>> Sounds good to me
>>>
>>> On Mon, Jan 26, 2026 at 11:59 AM Anton Okolnychyi <[email protected]>
>>> wrote:
>>>
>>>> Cool, sounds like a plan then? Thanks for answering all the questions,
>>>> Steven!
>>>>
>>>> чт, 22 січ. 2026 р. о 18:29 Steven Wu <[email protected]> пише:
>>>>
>>>>> For row timestamp inheritance to work, I would need to implement the
>>>>> plumbing. So I would imagine existing rows would have null values because
>>>>> the inheritance plumbing was not there yet. This would be consistent with
>>>>> upgrade behavior for the V3 row lineage:
>>>>> https://iceberg.apache.org/spec/#row-lineage-for-upgraded-tables.
>>>>>
>>>>> On Thu, Jan 22, 2026 at 4:09 PM Anton Okolnychyi <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Also, do we have a concrete plan for how to handle tables that would
>>>>>> be upgraded to V4? What timestamp will we assign to existing rows?
>>>>>>
>>>>>> On Wed, Jan 21, 2026 at 3:59 PM Anton Okolnychyi <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> If we ignore temporal queries that need strict snapshot boundaries
>>>>>>> and can't be solved completely using row timestamps in case of 
>>>>>>> mutations,
>>>>>>> you mentioned other use cases when row timestamps may be helpful like 
>>>>>>> TTL
>>>>>>> and auditing. We can debate whether using CURRENT_TIMESTAMP() is enough 
>>>>>>> for
>>>>>>> them, but I don't really see a point given that we already have row 
>>>>>>> lineage
>>>>>>> in V3 and the storage overhead for one more field isn't likely to be
>>>>>>> noticable. One of the problems with CURRENT_TIMESTAMP() is the required
>>>>>>> action by the user. Having a reliable row timestamp populated 
>>>>>>> automatically
>>>>>>> is likely to be better, so +1.
>>>>>>>
>>>>>>> пт, 16 січ. 2026 р. о 14:30 Steven Wu <[email protected]> пише:
>>>>>>>
>>>>>>>> Joining with snapshot history also has significant complexity. It
>>>>>>>> requires retaining the entire snapshot history with probably trimmed
>>>>>>>> snapshot metadata. There are concerns on the size of the snapshot 
>>>>>>>> history
>>>>>>>> for tables with frequent commits (like streaming ingestion). Do we 
>>>>>>>> maintain
>>>>>>>> the unbounded trimmed snapshot history in the same table metadata, 
>>>>>>>> which
>>>>>>>> could affect table metadata.json size? or store it separately somewhere
>>>>>>>> (like in catalog), which would require the complexity of multi-entity
>>>>>>>> transaction in catalog?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jan 16, 2026 at 12:07 PM Russell Spitzer <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I've gone back and forth on the inherited columns. I think the
>>>>>>>>> thing which keeps coming back to me is that I don't
>>>>>>>>> like that the only way to determine the timestamp associated with
>>>>>>>>> a row update/creation is to do a join back
>>>>>>>>> against table metadata. While that's doable, It feels user
>>>>>>>>> unfriendly.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jan 16, 2026 at 11:54 AM Steven Wu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Anton, you are right that the row-level deletes will be a problem
>>>>>>>>>> for some of the mentioned use cases (like incremental processing). I 
>>>>>>>>>> have
>>>>>>>>>> clarified the applicability of some use cases to "tables with 
>>>>>>>>>> inserts and
>>>>>>>>>> updates only".
>>>>>>>>>>
>>>>>>>>>> Right now, we are only tracking modification/commit time (not
>>>>>>>>>> insertion time) in case of updates.
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 15, 2026 at 6:33 PM Anton Okolnychyi <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> I think there is clear consensus that making snapshot timestamps
>>>>>>>>>>> strictly increasing is a positive thing. I am also +1.
>>>>>>>>>>>
>>>>>>>>>>> - How will row timestamps allow us to reliably implement
>>>>>>>>>>> incremental consumption independent of the snapshot retention given 
>>>>>>>>>>> that
>>>>>>>>>>> rows can be added AND removed in a particular time frame? How can we
>>>>>>>>>>> capture all changes by just looking at the latest snapshot?
>>>>>>>>>>> - Some use cases in the doc need the insertion time and some
>>>>>>>>>>> need the last modification time. Do we plan to support both?
>>>>>>>>>>> - What do we expect the behavior to be in UPDATE and MERGE
>>>>>>>>>>> operations?
>>>>>>>>>>>
>>>>>>>>>>> To be clear: I am not opposed to this change, just want to make
>>>>>>>>>>> sure I understand all use cases that we aim to address and what 
>>>>>>>>>>> would be
>>>>>>>>>>> required in engines.
>>>>>>>>>>>
>>>>>>>>>>> чт, 15 січ. 2026 р. о 17:01 Maninder Parmar <
>>>>>>>>>>> [email protected]> пише:
>>>>>>>>>>>
>>>>>>>>>>>> +1 for improving how the commit timestamps are
>>>>>>>>>>>> assigned monotonically since this requirement has emerged over 
>>>>>>>>>>>> multiple
>>>>>>>>>>>> discussions like notifications, multi-table transactions, time 
>>>>>>>>>>>> travel
>>>>>>>>>>>> accuracy and row timestamps. It would be good to have a single 
>>>>>>>>>>>> consistent
>>>>>>>>>>>> way to represent and assign timestamps that could be leveraged 
>>>>>>>>>>>> across
>>>>>>>>>>>> multiple features.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jan 15, 2026 at 4:05 PM Ryan Blue <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Yeah, to add my perspective on that discussion, I think my
>>>>>>>>>>>>> primary concern is that people expect timestamps to be monotonic 
>>>>>>>>>>>>> and if
>>>>>>>>>>>>> they aren't then a `_last_update_timestamp` field just makes the 
>>>>>>>>>>>>> problem
>>>>>>>>>>>>> worse. But it is _nice_ to have row-level timestamps. So I would 
>>>>>>>>>>>>> be okay if
>>>>>>>>>>>>> we revisit how we assign commit timestamps and improve it so that 
>>>>>>>>>>>>> you get
>>>>>>>>>>>>> monotonic behavior.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jan 15, 2026 at 2:23 PM Steven Wu <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> We had an offline discussion with Ryan. I revised the
>>>>>>>>>>>>>> proposal as follows.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. V4 would require writers to generate *monotonic* snapshot
>>>>>>>>>>>>>> timestamps. The proposal doc has a section that describes a 
>>>>>>>>>>>>>> recommended
>>>>>>>>>>>>>> implementation using lamport timestamps.
>>>>>>>>>>>>>> 2. Expose *last_update_timestamp* metadata column that
>>>>>>>>>>>>>> inherits from snapshot timestamp
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is a relatively low-friction change that can fix the
>>>>>>>>>>>>>> time travel problem and enable use cases like latency tracking, 
>>>>>>>>>>>>>> temporal
>>>>>>>>>>>>>> query, TTL, auditing.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There is no accuracy requirement on the timestamp values. In
>>>>>>>>>>>>>> practice, modern servers with NTP have pretty reliable wall 
>>>>>>>>>>>>>> clocks. E.g.,
>>>>>>>>>>>>>> Java library implemented this validation
>>>>>>>>>>>>>> <https://github.com/apache/iceberg/blob/035e0fb39d2a949f6343552ade0a7d6c2967e0db/core/src/main/java/org/apache/iceberg/TableMetadata.java#L369-L377>
>>>>>>>>>>>>>>  that
>>>>>>>>>>>>>> protects against backward clock drift up to one minute for 
>>>>>>>>>>>>>> snapshot
>>>>>>>>>>>>>> timestamps. Don't think we have heard many complaints of commit 
>>>>>>>>>>>>>> failure due
>>>>>>>>>>>>>> to that clock drift validation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Would appreciate feedback on the revised proposal.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jan 13, 2026 at 8:40 PM Anton Okolnychyi <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Steven, I was referring to the fact that CURRENT_TIMESTAMP()
>>>>>>>>>>>>>>> is usually evaluated quite early in engines so we could 
>>>>>>>>>>>>>>> theoretically have
>>>>>>>>>>>>>>> another expression closer to the commit time. You are right, 
>>>>>>>>>>>>>>> though, it
>>>>>>>>>>>>>>> won't be the actual commit time given that we have to write it 
>>>>>>>>>>>>>>> into the
>>>>>>>>>>>>>>> files. Also, I don't think generating a timestamp for a row as 
>>>>>>>>>>>>>>> it is being
>>>>>>>>>>>>>>> written is going to be beneficial. To sum up, expression-based 
>>>>>>>>>>>>>>> defaults
>>>>>>>>>>>>>>> would allow us to capture the time the transaction or write 
>>>>>>>>>>>>>>> starts, but not
>>>>>>>>>>>>>>> the actual commit time.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Russell, if the goal is to know what happened to the table
>>>>>>>>>>>>>>> in a given time frame, isn't the changelog scan the way to go? 
>>>>>>>>>>>>>>> It would
>>>>>>>>>>>>>>> assign commit ordinals based on lineage and include row-level 
>>>>>>>>>>>>>>> diffs. How
>>>>>>>>>>>>>>> would you be able to determine changes with row timestamps by 
>>>>>>>>>>>>>>> just looking
>>>>>>>>>>>>>>> at the latest snapshot?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It does seem promising to make snapshot timestamps strictly
>>>>>>>>>>>>>>> increasing to avoid ambiguity during time travel.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> вт, 13 січ. 2026 р. о 16:33 Ryan Blue <[email protected]>
>>>>>>>>>>>>>>> пише:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > Whether or not "t" is an atomic clock time is not as
>>>>>>>>>>>>>>>> important as the query between time bounds making sense.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm not sure I get it then. If we want monotonically
>>>>>>>>>>>>>>>> increasing times, but they don't have to be real times then 
>>>>>>>>>>>>>>>> how do you know
>>>>>>>>>>>>>>>> what notion of "time" you care about for these filters? Or to 
>>>>>>>>>>>>>>>> put it
>>>>>>>>>>>>>>>> another way, how do you know that your "before" and "after" 
>>>>>>>>>>>>>>>> times are
>>>>>>>>>>>>>>>> reasonable? If the boundaries of these time queries can move 
>>>>>>>>>>>>>>>> around a bit,
>>>>>>>>>>>>>>>> by how much?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It seems to me that row IDs can play an important role here
>>>>>>>>>>>>>>>> because you have the order guarantee that we seem to want for 
>>>>>>>>>>>>>>>> this use
>>>>>>>>>>>>>>>> case: if snapshot A was committed before snapshot B, then the 
>>>>>>>>>>>>>>>> rows from A
>>>>>>>>>>>>>>>> have row IDs that are always less than the rows IDs of B. The 
>>>>>>>>>>>>>>>> problem is
>>>>>>>>>>>>>>>> that we don't know where those row IDs start and end once A 
>>>>>>>>>>>>>>>> and B are no
>>>>>>>>>>>>>>>> longer tracked. Using a "timestamp" seems to work, but I still 
>>>>>>>>>>>>>>>> worry that
>>>>>>>>>>>>>>>> without reliable timestamps that correspond with some 
>>>>>>>>>>>>>>>> guarantee to real
>>>>>>>>>>>>>>>> timestamps, we are creating a feature that seems reliable but 
>>>>>>>>>>>>>>>> isn't.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm somewhat open to the idea of introducing a snapshot
>>>>>>>>>>>>>>>> timestamp that the catalog guarantees is monotonically 
>>>>>>>>>>>>>>>> increasing. But if
>>>>>>>>>>>>>>>> we did that, wouldn't we still need to know the association 
>>>>>>>>>>>>>>>> between these
>>>>>>>>>>>>>>>> timestamps and snapshots after the snapshot metadata expires? 
>>>>>>>>>>>>>>>> My mental
>>>>>>>>>>>>>>>> model is that this would be used to look for data that 
>>>>>>>>>>>>>>>> arrived, say, 3
>>>>>>>>>>>>>>>> weeks ago on Dec 24th. Since the snapshots metadata is no 
>>>>>>>>>>>>>>>> longer around we
>>>>>>>>>>>>>>>> could use the row timestamp to find those rows. But how do we 
>>>>>>>>>>>>>>>> know that the
>>>>>>>>>>>>>>>> snapshot timestamps correspond to the actual timestamp range 
>>>>>>>>>>>>>>>> of Dec 24th?
>>>>>>>>>>>>>>>> Is it just "close enough" as long as we don't have out of 
>>>>>>>>>>>>>>>> order timestamps?
>>>>>>>>>>>>>>>> This is what I mean by needing to keep track of the 
>>>>>>>>>>>>>>>> association between
>>>>>>>>>>>>>>>> timestamps and snapshots after the metadata expires. Seems 
>>>>>>>>>>>>>>>> like you either
>>>>>>>>>>>>>>>> need to keep track of what the catalog's clock was for events 
>>>>>>>>>>>>>>>> you care
>>>>>>>>>>>>>>>> about, or you don't really care about exact timestamps.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Jan 13, 2026 at 2:22 PM Russell Spitzer <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The key goal here is the ability to answer the question
>>>>>>>>>>>>>>>>> "what happened to the table in some time window. (before < t 
>>>>>>>>>>>>>>>>> < after)?"
>>>>>>>>>>>>>>>>> Whether or not "t" is an atomic clock time is not as
>>>>>>>>>>>>>>>>> important as the query between time bounds making sense.
>>>>>>>>>>>>>>>>> Downstream applications (from what I know) are mostly
>>>>>>>>>>>>>>>>> sensitive to getting discrete and well defined answers to
>>>>>>>>>>>>>>>>> this question like:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1 < t < 2 should be exclusive of
>>>>>>>>>>>>>>>>> 2 < t < 3 should be exclusive of
>>>>>>>>>>>>>>>>> 3 < t < 4
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> And the union of these should be the same as the query
>>>>>>>>>>>>>>>>> asking for 1 < t < 4
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Currently this is not possible because we have no
>>>>>>>>>>>>>>>>> guarantee of ordering in our timestamps
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Snapshots
>>>>>>>>>>>>>>>>> A -> B -> C
>>>>>>>>>>>>>>>>> Sequence numbers
>>>>>>>>>>>>>>>>> 50 -> 51 ->  52
>>>>>>>>>>>>>>>>> Timestamp
>>>>>>>>>>>>>>>>> 3 -> 1 -> 2
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This makes time travel always a little wrong to start with.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The Java implementation only allows one minute of negative
>>>>>>>>>>>>>>>>> time on commit so we actually kind of do have this as a
>>>>>>>>>>>>>>>>> "light monotonicity" requirement but as noted above there
>>>>>>>>>>>>>>>>> is no spec requirement for this.  While we do have sequence
>>>>>>>>>>>>>>>>> number and row id, we still don't have a stable way of
>>>>>>>>>>>>>>>>> associating these with a consistent time in an engine 
>>>>>>>>>>>>>>>>> independent way.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ideally we just want to have one consistent way of
>>>>>>>>>>>>>>>>> answering the question "what did the table look like at time 
>>>>>>>>>>>>>>>>> t"
>>>>>>>>>>>>>>>>> which I think we get by adding in a new field that is a
>>>>>>>>>>>>>>>>> timestamp, set by the Catalog close to commit time,
>>>>>>>>>>>>>>>>> that always goes up.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm not sure we can really do this with an engine
>>>>>>>>>>>>>>>>> expression since they won't know when the data is actually 
>>>>>>>>>>>>>>>>> committed
>>>>>>>>>>>>>>>>> when writing files?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Jan 13, 2026 at 3:35 PM Anton Okolnychyi <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This seems like a lot of new complexity in the format. I
>>>>>>>>>>>>>>>>>> would like us to explore whether we can build the considered 
>>>>>>>>>>>>>>>>>> use cases on
>>>>>>>>>>>>>>>>>> top of expression-based defaults instead.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We already plan to support CURRENT_TIMESTAMP() and
>>>>>>>>>>>>>>>>>> similar functions that are part of the SQL standard 
>>>>>>>>>>>>>>>>>> definition for default
>>>>>>>>>>>>>>>>>> values. This would provide us a way to know the relative row 
>>>>>>>>>>>>>>>>>> order. True,
>>>>>>>>>>>>>>>>>> this usually will represent the start of the operation. We 
>>>>>>>>>>>>>>>>>> may define
>>>>>>>>>>>>>>>>>> COMMIT_TIMESTAMP() or a similar expression for the actual 
>>>>>>>>>>>>>>>>>> commit time, if
>>>>>>>>>>>>>>>>>> there are use cases that need that. Plus, we may explore an 
>>>>>>>>>>>>>>>>>> approach
>>>>>>>>>>>>>>>>>> similar to MySQL that allows users to reset the default 
>>>>>>>>>>>>>>>>>> value on update.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> вт, 13 січ. 2026 р. о 11:04 Russell Spitzer <
>>>>>>>>>>>>>>>>>> [email protected]> пише:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think this is the right step forward. Our current
>>>>>>>>>>>>>>>>>>> "timestamp" definition is too ambiguous to be useful so 
>>>>>>>>>>>>>>>>>>> establishing
>>>>>>>>>>>>>>>>>>> a well defined and monotonic timestamp could be really
>>>>>>>>>>>>>>>>>>> great. I also like the ability for row's to know this value 
>>>>>>>>>>>>>>>>>>> without
>>>>>>>>>>>>>>>>>>> having to rely on snapshot information which can be
>>>>>>>>>>>>>>>>>>> expired.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Jan 12, 2026 at 11:03 AM Steven Wu <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I have revised the row timestamp proposal with the
>>>>>>>>>>>>>>>>>>>> following changes.
>>>>>>>>>>>>>>>>>>>> * a new commit_timestamp field in snapshot metadata
>>>>>>>>>>>>>>>>>>>> that has nano-second precision.
>>>>>>>>>>>>>>>>>>>> * this optional field is only set by the REST catalog
>>>>>>>>>>>>>>>>>>>> server
>>>>>>>>>>>>>>>>>>>> * it needs to be monotonic (e.g. implemented using
>>>>>>>>>>>>>>>>>>>> Lamport timestamp)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks for the clarification, Ryan.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> For long-running streaming jobs that commit
>>>>>>>>>>>>>>>>>>>>> periodically, it is difficult to establish the constant 
>>>>>>>>>>>>>>>>>>>>> value of
>>>>>>>>>>>>>>>>>>>>> current_timestamp across all writer tasks for each commit 
>>>>>>>>>>>>>>>>>>>>> cycle. I guess
>>>>>>>>>>>>>>>>>>>>> streaming writers may just need to write the wall clock 
>>>>>>>>>>>>>>>>>>>>> time when appending
>>>>>>>>>>>>>>>>>>>>> a row to a data file for the default value of 
>>>>>>>>>>>>>>>>>>>>> current_timestamp.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I don't think that every row would have a different
>>>>>>>>>>>>>>>>>>>>>> value. That would be up to the engine, but I would 
>>>>>>>>>>>>>>>>>>>>>> expect engines to insert
>>>>>>>>>>>>>>>>>>>>>> `CURRENT_TIMESTAMP` into the plan and then replace it 
>>>>>>>>>>>>>>>>>>>>>> with a constant,
>>>>>>>>>>>>>>>>>>>>>> resulting in a consistent value for all rows.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> You're right that this would not necessarily be the
>>>>>>>>>>>>>>>>>>>>>> commit time. But neither is the commit timestamp from 
>>>>>>>>>>>>>>>>>>>>>> Iceberg's snapshot.
>>>>>>>>>>>>>>>>>>>>>> I'm not sure how we are going to define "good enough" 
>>>>>>>>>>>>>>>>>>>>>> for this purpose. I
>>>>>>>>>>>>>>>>>>>>>> think at least `CURRENT_TIMESTAMP` has reliable and 
>>>>>>>>>>>>>>>>>>>>>> known behavior when you
>>>>>>>>>>>>>>>>>>>>>> look at how it is handled in engines. And if you want 
>>>>>>>>>>>>>>>>>>>>>> the Iceberg
>>>>>>>>>>>>>>>>>>>>>> timestamp, then use a periodic query of the snapshot 
>>>>>>>>>>>>>>>>>>>>>> stable to keep track
>>>>>>>>>>>>>>>>>>>>>> of them in a table you can join to. I don't think this 
>>>>>>>>>>>>>>>>>>>>>> rises to the need
>>>>>>>>>>>>>>>>>>>>>> for a table feature unless we can guarantee that it is 
>>>>>>>>>>>>>>>>>>>>>> correct.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> > Postgres `current_timestamp` captures the
>>>>>>>>>>>>>>>>>>>>>>> transaction start time [1, 2]. Should we extend the 
>>>>>>>>>>>>>>>>>>>>>>> same semantic to
>>>>>>>>>>>>>>>>>>>>>>> Iceberg: all rows added in the same snapshot should 
>>>>>>>>>>>>>>>>>>>>>>> have the same timestamp
>>>>>>>>>>>>>>>>>>>>>>> value?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Let me clarify my last comment.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT
>>>>>>>>>>>>>>>>>>>>>>> CURRENT_TIMESTAMP)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Since Postgres current_timestamp captures the
>>>>>>>>>>>>>>>>>>>>>>> transaction start time, all rows added in the same 
>>>>>>>>>>>>>>>>>>>>>>> insert transaction would
>>>>>>>>>>>>>>>>>>>>>>> have the same value as the transaction timestamp with 
>>>>>>>>>>>>>>>>>>>>>>> the column
>>>>>>>>>>>>>>>>>>>>>>> definition above.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> If we extend a similar semantic to Iceberg, all rows
>>>>>>>>>>>>>>>>>>>>>>> added in the same Iceberg transaction/snapshot should 
>>>>>>>>>>>>>>>>>>>>>>> have the same
>>>>>>>>>>>>>>>>>>>>>>> timestamp?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Ryan, I understand your comment for using
>>>>>>>>>>>>>>>>>>>>>>> current_timestamp expression as column default value, 
>>>>>>>>>>>>>>>>>>>>>>> you were thinking
>>>>>>>>>>>>>>>>>>>>>>> that the engine would set the column value to the wall 
>>>>>>>>>>>>>>>>>>>>>>> clock time when
>>>>>>>>>>>>>>>>>>>>>>> appending a row to a data file, right? every row would 
>>>>>>>>>>>>>>>>>>>>>>> almost have a
>>>>>>>>>>>>>>>>>>>>>>> different timestamp value.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu <
>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> `current_timestamp` expression may not always carry
>>>>>>>>>>>>>>>>>>>>>>>> the right semantic for the use cases. E.g., latency 
>>>>>>>>>>>>>>>>>>>>>>>> tracking is interested
>>>>>>>>>>>>>>>>>>>>>>>> in when records are added / committed to the table, 
>>>>>>>>>>>>>>>>>>>>>>>> not when the record was
>>>>>>>>>>>>>>>>>>>>>>>> appended to an uncommitted data file in the processing 
>>>>>>>>>>>>>>>>>>>>>>>> engine.
>>>>>>>>>>>>>>>>>>>>>>>> Record creation and Iceberg commit can be minutes or 
>>>>>>>>>>>>>>>>>>>>>>>> even hours apart.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Row timestamp inherited from snapshot timestamp has
>>>>>>>>>>>>>>>>>>>>>>>> no overhead with the initial commit and has very 
>>>>>>>>>>>>>>>>>>>>>>>> minimal storage overhead
>>>>>>>>>>>>>>>>>>>>>>>> during file rewrite. Per-row current_timestamp would 
>>>>>>>>>>>>>>>>>>>>>>>> have distinct values
>>>>>>>>>>>>>>>>>>>>>>>> for every row and has more storage overhead.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> OLTP databases deal with small row-level
>>>>>>>>>>>>>>>>>>>>>>>> transactions. Postgres `current_timestamp` captures 
>>>>>>>>>>>>>>>>>>>>>>>> the transaction start
>>>>>>>>>>>>>>>>>>>>>>>> time [1, 2]. Should we extend the same semantic to 
>>>>>>>>>>>>>>>>>>>>>>>> Iceberg: all rows added
>>>>>>>>>>>>>>>>>>>>>>>> in the same snapshot should have the same timestamp 
>>>>>>>>>>>>>>>>>>>>>>>> value?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>> https://www.postgresql.org/docs/current/functions-datetime.html
>>>>>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>>>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this
>>>>>>>>>>>>>>>>>>>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>>>>>>>>>> To support the created_by timestamp, we would
>>>>>>>>>>>>>>>>>>>>>>>>>> need to implement the following row lineage behavior
>>>>>>>>>>>>>>>>>>>>>>>>>> * Initially, it inherits from the snapshot
>>>>>>>>>>>>>>>>>>>>>>>>>> timestamp
>>>>>>>>>>>>>>>>>>>>>>>>>> * during rewrite (like compaction), it should be
>>>>>>>>>>>>>>>>>>>>>>>>>> persisted into data files.
>>>>>>>>>>>>>>>>>>>>>>>>>> * during update, it needs to be carried over from
>>>>>>>>>>>>>>>>>>>>>>>>>> the previous row. This is similar to the row_id 
>>>>>>>>>>>>>>>>>>>>>>>>>> carry over for row updates.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for the short hand.  These are not the same:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> 1.  Insertion time - time the row was inserted.
>>>>>>>>>>>>>>>>>>>>>>>>> 2.  Create by - The system that created the record.
>>>>>>>>>>>>>>>>>>>>>>>>> 3.  Updated by - The system that last updated the
>>>>>>>>>>>>>>>>>>>>>>>>> record.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Depending on the exact use-case these might or
>>>>>>>>>>>>>>>>>>>>>>>>> might not have utility.  I'm just wondering if there 
>>>>>>>>>>>>>>>>>>>>>>>>> will be more example
>>>>>>>>>>>>>>>>>>>>>>>>> like this in the future.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> created_by column would incur likely significantly
>>>>>>>>>>>>>>>>>>>>>>>>>> higher storage overhead compared to the updated_by 
>>>>>>>>>>>>>>>>>>>>>>>>>> column. As rows are
>>>>>>>>>>>>>>>>>>>>>>>>>> updated overtime, the cardinality for this column in 
>>>>>>>>>>>>>>>>>>>>>>>>>> data files can be
>>>>>>>>>>>>>>>>>>>>>>>>>> high. Hence, the created_by column may not compress 
>>>>>>>>>>>>>>>>>>>>>>>>>> well. This is a similar
>>>>>>>>>>>>>>>>>>>>>>>>>> problem for the row_id column. One side effect of 
>>>>>>>>>>>>>>>>>>>>>>>>>> enabling row lineage by
>>>>>>>>>>>>>>>>>>>>>>>>>> default for V3 tables is the storage overhead of 
>>>>>>>>>>>>>>>>>>>>>>>>>> row_id column after
>>>>>>>>>>>>>>>>>>>>>>>>>> compaction especially for narrow tables with few 
>>>>>>>>>>>>>>>>>>>>>>>>>> columns.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I agree.  I think this analysis also shows that
>>>>>>>>>>>>>>>>>>>>>>>>> some consumers of Iceberg might not necessarily want 
>>>>>>>>>>>>>>>>>>>>>>>>> to have all these
>>>>>>>>>>>>>>>>>>>>>>>>> columns, so we might want to make them configurable, 
>>>>>>>>>>>>>>>>>>>>>>>>> rather than mandating
>>>>>>>>>>>>>>>>>>>>>>>>> them for all tables. Ryan's thought on default values 
>>>>>>>>>>>>>>>>>>>>>>>>> seems like it would
>>>>>>>>>>>>>>>>>>>>>>>>> solve the issues I was raising.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>> Micah
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <
>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> > An explicit timestamp column adds more burden
>>>>>>>>>>>>>>>>>>>>>>>>>> to application developers. While some databases 
>>>>>>>>>>>>>>>>>>>>>>>>>> require an explicit column
>>>>>>>>>>>>>>>>>>>>>>>>>> in the schema, those databases provide triggers to 
>>>>>>>>>>>>>>>>>>>>>>>>>> auto set the column
>>>>>>>>>>>>>>>>>>>>>>>>>> value. For Iceberg, the snapshot timestamp is the 
>>>>>>>>>>>>>>>>>>>>>>>>>> closest to the trigger
>>>>>>>>>>>>>>>>>>>>>>>>>> timestamp.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Since the use cases don't require an exact
>>>>>>>>>>>>>>>>>>>>>>>>>> timestamp, this seems like the best solution to get 
>>>>>>>>>>>>>>>>>>>>>>>>>> what people want (an
>>>>>>>>>>>>>>>>>>>>>>>>>> insertion timestamp) that has clear and well-defined 
>>>>>>>>>>>>>>>>>>>>>>>>>> behavior. Since
>>>>>>>>>>>>>>>>>>>>>>>>>> `current_timestamp` is defined by the SQL spec, it 
>>>>>>>>>>>>>>>>>>>>>>>>>> makes sense to me that
>>>>>>>>>>>>>>>>>>>>>>>>>> we could use it and have reasonable behavior.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I've talked with Anton about this before and
>>>>>>>>>>>>>>>>>>>>>>>>>> maybe he'll jump in on this thread. I think that we 
>>>>>>>>>>>>>>>>>>>>>>>>>> may need to extend
>>>>>>>>>>>>>>>>>>>>>>>>>> default values to include default value expressions, 
>>>>>>>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>>>>>>> `current_timestamp` that is allowed by the SQL spec. 
>>>>>>>>>>>>>>>>>>>>>>>>>> That would solve the
>>>>>>>>>>>>>>>>>>>>>>>>>> problem as well as some others (like `current_date` 
>>>>>>>>>>>>>>>>>>>>>>>>>> or `current_user`) and
>>>>>>>>>>>>>>>>>>>>>>>>>> would not create a potentially misleading (and 
>>>>>>>>>>>>>>>>>>>>>>>>>> heavyweight) timestamp
>>>>>>>>>>>>>>>>>>>>>>>>>> feature in the format.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> > Also some environments may have stronger clock
>>>>>>>>>>>>>>>>>>>>>>>>>> service, like Spanner TrueTime service.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Even in cases like this, commit retries can
>>>>>>>>>>>>>>>>>>>>>>>>>> reorder commits and make timestamps out of order. I 
>>>>>>>>>>>>>>>>>>>>>>>>>> don't think that we
>>>>>>>>>>>>>>>>>>>>>>>>>> should be making guarantees or even exposing 
>>>>>>>>>>>>>>>>>>>>>>>>>> metadata that people might
>>>>>>>>>>>>>>>>>>>>>>>>>> mistake as having those guarantees.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu <
>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Ryan, thanks a lot for the feedback!
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the concern for reliable timestamps,
>>>>>>>>>>>>>>>>>>>>>>>>>>> we are not proposing using timestamps for ordering. 
>>>>>>>>>>>>>>>>>>>>>>>>>>> With NTP in modern
>>>>>>>>>>>>>>>>>>>>>>>>>>> computers, they are generally reliable enough for 
>>>>>>>>>>>>>>>>>>>>>>>>>>> the intended use cases.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Also some environments may have stronger clock 
>>>>>>>>>>>>>>>>>>>>>>>>>>> service, like Spanner
>>>>>>>>>>>>>>>>>>>>>>>>>>> TrueTime service
>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency>
>>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> >  joining to timestamps from the snapshots
>>>>>>>>>>>>>>>>>>>>>>>>>>> metadata table.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> As you also mentioned, it depends on the
>>>>>>>>>>>>>>>>>>>>>>>>>>> snapshot history, which is often retained for a few 
>>>>>>>>>>>>>>>>>>>>>>>>>>> days due to performance
>>>>>>>>>>>>>>>>>>>>>>>>>>> reasons.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> > embedding a timestamp in DML (like
>>>>>>>>>>>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an 
>>>>>>>>>>>>>>>>>>>>>>>>>>> implicit one from table
>>>>>>>>>>>>>>>>>>>>>>>>>>> metadata.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> An explicit timestamp column adds more burden to
>>>>>>>>>>>>>>>>>>>>>>>>>>> application developers. While some databases 
>>>>>>>>>>>>>>>>>>>>>>>>>>> require an explicit column in
>>>>>>>>>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to 
>>>>>>>>>>>>>>>>>>>>>>>>>>> auto set the column value.
>>>>>>>>>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest 
>>>>>>>>>>>>>>>>>>>>>>>>>>> to the trigger timestamp.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, the timestamp set during computation (like
>>>>>>>>>>>>>>>>>>>>>>>>>>> streaming ingestion or relative long batch 
>>>>>>>>>>>>>>>>>>>>>>>>>>> computation) doesn't capture the
>>>>>>>>>>>>>>>>>>>>>>>>>>> time the rows/files are added to the Iceberg table 
>>>>>>>>>>>>>>>>>>>>>>>>>>> in a batch fashion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> > And for those use cases, you could also keep a
>>>>>>>>>>>>>>>>>>>>>>>>>>> longer history of snapshot timestamps, like storing 
>>>>>>>>>>>>>>>>>>>>>>>>>>> a catalog's event log
>>>>>>>>>>>>>>>>>>>>>>>>>>> for long-term access to timestamp info
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> this is not really consumable by joining the
>>>>>>>>>>>>>>>>>>>>>>>>>>> regular table query with catalog event log. I would 
>>>>>>>>>>>>>>>>>>>>>>>>>>> also imagine catalog
>>>>>>>>>>>>>>>>>>>>>>>>>>> event log is capped at shorter retention (maybe a 
>>>>>>>>>>>>>>>>>>>>>>>>>>> few months) compared to
>>>>>>>>>>>>>>>>>>>>>>>>>>> data retention (could be a few years).
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <
>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't think it is a good idea to expose
>>>>>>>>>>>>>>>>>>>>>>>>>>>> timestamps at the row level. Timestamps in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> metadata that would be carried
>>>>>>>>>>>>>>>>>>>>>>>>>>>> down to the row level already confuse people that 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> expect them to be useful
>>>>>>>>>>>>>>>>>>>>>>>>>>>> or reliable, rather than for debugging. I think 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> extending this to the row
>>>>>>>>>>>>>>>>>>>>>>>>>>>> level would only make the problem worse.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> You can already get this information by
>>>>>>>>>>>>>>>>>>>>>>>>>>>> projecting the last updated sequence number, which 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> is reliable, and joining
>>>>>>>>>>>>>>>>>>>>>>>>>>>> to timestamps from the snapshots metadata table. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Of course, the drawback
>>>>>>>>>>>>>>>>>>>>>>>>>>>> there is losing the timestamp information when 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> snapshots expire, but since
>>>>>>>>>>>>>>>>>>>>>>>>>>>> it isn't reliable anyway I'd be fine with that.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Some of the use cases, like auditing and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> compliance, are probably better served by 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> embedding a timestamp in DML
>>>>>>>>>>>>>>>>>>>>>>>>>>>> (like `current_timestamp`) rather than relying on 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> an implicit one from
>>>>>>>>>>>>>>>>>>>>>>>>>>>> table metadata. And for those use cases, you could 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> also keep a longer
>>>>>>>>>>>>>>>>>>>>>>>>>>>> history of snapshot timestamps, like storing a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> catalog's event log for
>>>>>>>>>>>>>>>>>>>>>>>>>>>> long-term access to timestamp info. I think that 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be better than
>>>>>>>>>>>>>>>>>>>>>>>>>>>> storing it at the row level.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For V4 spec, I have a small proposal [1] to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> expose the row timestamp concept that can help 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with many use cases like
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> temporal queries, latency tracking, TTL, auditing 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and compliance.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This *_last_updated_timestamp_ms * metadata
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> column behaves very similarly to the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *_last_updated_sequence_number* for row
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lineage.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Initially, it inherits from the snapshot
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    timestamp.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - During rewrite (like compaction), its
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    values are persisted in the data files.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear what you think.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] row timestamp proposal

Reply via email to