Re: [DISCUSS] row timestamp proposal

Maninder Parmar Thu, 15 Jan 2026 17:02:38 -0800

+1 for improving how the commit timestamps are assigned monotonically since
this requirement has emerged over multiple discussions like notifications,
multi-table transactions, time travel accuracy and row timestamps. It would
be good to have a single consistent way to represent and assign timestamps
that could be leveraged across multiple features.


On Thu, Jan 15, 2026 at 4:05 PM Ryan Blue <[email protected]> wrote:

> Yeah, to add my perspective on that discussion, I think my primary concern
> is that people expect timestamps to be monotonic and if they aren't then a
> `_last_update_timestamp` field just makes the problem worse. But it is
> _nice_ to have row-level timestamps. So I would be okay if we revisit how
> we assign commit timestamps and improve it so that you get monotonic
> behavior.
>
> On Thu, Jan 15, 2026 at 2:23 PM Steven Wu <[email protected]> wrote:
>
>> We had an offline discussion with Ryan. I revised the proposal as follows.
>>
>> 1. V4 would require writers to generate *monotonic* snapshot timestamps.
>> The proposal doc has a section that describes a recommended implementation
>> using lamport timestamps.
>> 2. Expose *last_update_timestamp* metadata column that inherits from
>> snapshot timestamp
>>
>> This is a relatively low-friction change that can fix the time travel
>> problem and enable use cases like latency tracking, temporal query, TTL,
>> auditing.
>>
>> There is no accuracy requirement on the timestamp values. In practice,
>> modern servers with NTP have pretty reliable wall clocks. E.g., Java
>> library implemented this validation
>> <https://github.com/apache/iceberg/blob/035e0fb39d2a949f6343552ade0a7d6c2967e0db/core/src/main/java/org/apache/iceberg/TableMetadata.java#L369-L377>
>>  that
>> protects against backward clock drift up to one minute for snapshot
>> timestamps. Don't think we have heard many complaints of commit failure due
>> to that clock drift validation.
>>
>> Would appreciate feedback on the revised proposal.
>>
>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0
>>
>> Thanks,
>> Steven
>>
>> On Tue, Jan 13, 2026 at 8:40 PM Anton Okolnychyi <[email protected]>
>> wrote:
>>
>>> Steven, I was referring to the fact that CURRENT_TIMESTAMP() is usually
>>> evaluated quite early in engines so we could theoretically have another
>>> expression closer to the commit time. You are right, though, it won't be
>>> the actual commit time given that we have to write it into the files. Also,
>>> I don't think generating a timestamp for a row as it is being written is
>>> going to be beneficial. To sum up, expression-based defaults would allow us
>>> to capture the time the transaction or write starts, but not the actual
>>> commit time.
>>>
>>> Russell, if the goal is to know what happened to the table in a given
>>> time frame, isn't the changelog scan the way to go? It would assign commit
>>> ordinals based on lineage and include row-level diffs. How would you be
>>> able to determine changes with row timestamps by just looking at the latest
>>> snapshot?
>>>
>>> It does seem promising to make snapshot timestamps strictly increasing
>>> to avoid ambiguity during time travel.
>>>
>>> вт, 13 січ. 2026 р. о 16:33 Ryan Blue <[email protected]> пише:
>>>
>>>> > Whether or not "t" is an atomic clock time is not as important as the
>>>> query between time bounds making sense.
>>>>
>>>> I'm not sure I get it then. If we want monotonically increasing times,
>>>> but they don't have to be real times then how do you know what notion of
>>>> "time" you care about for these filters? Or to put it another way, how do
>>>> you know that your "before" and "after" times are reasonable? If the
>>>> boundaries of these time queries can move around a bit, by how much?
>>>>
>>>> It seems to me that row IDs can play an important role here because you
>>>> have the order guarantee that we seem to want for this use case: if
>>>> snapshot A was committed before snapshot B, then the rows from A have row
>>>> IDs that are always less than the rows IDs of B. The problem is that we
>>>> don't know where those row IDs start and end once A and B are no longer
>>>> tracked. Using a "timestamp" seems to work, but I still worry that without
>>>> reliable timestamps that correspond with some guarantee to real timestamps,
>>>> we are creating a feature that seems reliable but isn't.
>>>>
>>>> I'm somewhat open to the idea of introducing a snapshot timestamp that
>>>> the catalog guarantees is monotonically increasing. But if we did that,
>>>> wouldn't we still need to know the association between these timestamps and
>>>> snapshots after the snapshot metadata expires? My mental model is that this
>>>> would be used to look for data that arrived, say, 3 weeks ago on Dec 24th.
>>>> Since the snapshots metadata is no longer around we could use the row
>>>> timestamp to find those rows. But how do we know that the snapshot
>>>> timestamps correspond to the actual timestamp range of Dec 24th? Is it just
>>>> "close enough" as long as we don't have out of order timestamps? This is
>>>> what I mean by needing to keep track of the association between timestamps
>>>> and snapshots after the metadata expires. Seems like you either need to
>>>> keep track of what the catalog's clock was for events you care about, or
>>>> you don't really care about exact timestamps.
>>>>
>>>> On Tue, Jan 13, 2026 at 2:22 PM Russell Spitzer <
>>>> [email protected]> wrote:
>>>>
>>>>> The key goal here is the ability to answer the question "what happened
>>>>> to the table in some time window. (before < t < after)?"
>>>>> Whether or not "t" is an atomic clock time is not as important as the
>>>>> query between time bounds making sense.
>>>>> Downstream applications (from what I know) are mostly sensitive to
>>>>> getting discrete and well defined answers to
>>>>> this question like:
>>>>>
>>>>> 1 < t < 2 should be exclusive of
>>>>> 2 < t < 3 should be exclusive of
>>>>> 3 < t < 4
>>>>>
>>>>> And the union of these should be the same as the query asking for 1 <
>>>>> t < 4
>>>>>
>>>>> Currently this is not possible because we have no guarantee of
>>>>> ordering in our timestamps
>>>>>
>>>>> Snapshots
>>>>> A -> B -> C
>>>>> Sequence numbers
>>>>> 50 -> 51 ->  52
>>>>> Timestamp
>>>>> 3 -> 1 -> 2
>>>>>
>>>>> This makes time travel always a little wrong to start with.
>>>>>
>>>>> The Java implementation only allows one minute of negative time on
>>>>> commit so we actually kind of do have this as a
>>>>> "light monotonicity" requirement but as noted above there is no spec
>>>>> requirement for this.  While we do have sequence
>>>>> number and row id, we still don't have a stable way of associating
>>>>> these with a consistent time in an engine independent way.
>>>>>
>>>>> Ideally we just want to have one consistent way of answering the
>>>>> question "what did the table look like at time t"
>>>>> which I think we get by adding in a new field that is a timestamp, set
>>>>> by the Catalog close to commit time,
>>>>> that always goes up.
>>>>>
>>>>> I'm not sure we can really do this with an engine expression since
>>>>> they won't know when the data is actually committed
>>>>> when writing files?
>>>>>
>>>>>
>>>>> On Tue, Jan 13, 2026 at 3:35 PM Anton Okolnychyi <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> This seems like a lot of new complexity in the format. I would like
>>>>>> us to explore whether we can build the considered use cases on top of
>>>>>> expression-based defaults instead.
>>>>>>
>>>>>> We already plan to support CURRENT_TIMESTAMP() and similar functions
>>>>>> that are part of the SQL standard definition for default values. This 
>>>>>> would
>>>>>> provide us a way to know the relative row order. True, this usually will
>>>>>> represent the start of the operation. We may define COMMIT_TIMESTAMP() 
>>>>>> or a
>>>>>> similar expression for the actual commit time, if there are use cases 
>>>>>> that
>>>>>> need that. Plus, we may explore an approach similar to MySQL that allows
>>>>>> users to reset the default value on update.
>>>>>>
>>>>>> - Anton
>>>>>>
>>>>>> вт, 13 січ. 2026 р. о 11:04 Russell Spitzer <
>>>>>> [email protected]> пише:
>>>>>>
>>>>>>> I think this is the right step forward. Our current "timestamp"
>>>>>>> definition is too ambiguous to be useful so establishing
>>>>>>> a well defined and monotonic timestamp could be really great. I also
>>>>>>> like the ability for row's to know this value without
>>>>>>> having to rely on snapshot information which can be expired.
>>>>>>>
>>>>>>> On Mon, Jan 12, 2026 at 11:03 AM Steven Wu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I have revised the row timestamp proposal with the following
>>>>>>>> changes.
>>>>>>>> * a new commit_timestamp field in snapshot metadata that has
>>>>>>>> nano-second precision.
>>>>>>>> * this optional field is only set by the REST catalog server
>>>>>>>> * it needs to be monotonic (e.g. implemented using Lamport
>>>>>>>> timestamp)
>>>>>>>>
>>>>>>>>
>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Steven
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the clarification, Ryan.
>>>>>>>>>
>>>>>>>>> For long-running streaming jobs that commit periodically, it is
>>>>>>>>> difficult to establish the constant value of current_timestamp across 
>>>>>>>>> all
>>>>>>>>> writer tasks for each commit cycle. I guess streaming writers may 
>>>>>>>>> just need
>>>>>>>>> to write the wall clock time when appending a row to a data file for 
>>>>>>>>> the
>>>>>>>>> default value of current_timestamp.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I don't think that every row would have a different value. That
>>>>>>>>>> would be up to the engine, but I would expect engines to insert
>>>>>>>>>> `CURRENT_TIMESTAMP` into the plan and then replace it with a 
>>>>>>>>>> constant,
>>>>>>>>>> resulting in a consistent value for all rows.
>>>>>>>>>>
>>>>>>>>>> You're right that this would not necessarily be the commit time.
>>>>>>>>>> But neither is the commit timestamp from Iceberg's snapshot. I'm not 
>>>>>>>>>> sure
>>>>>>>>>> how we are going to define "good enough" for this purpose. I think 
>>>>>>>>>> at least
>>>>>>>>>> `CURRENT_TIMESTAMP` has reliable and known behavior when you look at 
>>>>>>>>>> how it
>>>>>>>>>> is handled in engines. And if you want the Iceberg timestamp, then 
>>>>>>>>>> use a
>>>>>>>>>> periodic query of the snapshot stable to keep track of them in a 
>>>>>>>>>> table you
>>>>>>>>>> can join to. I don't think this rises to the need for a table feature
>>>>>>>>>> unless we can guarantee that it is correct.
>>>>>>>>>>
>>>>>>>>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> > Postgres `current_timestamp` captures the transaction start
>>>>>>>>>>> time [1, 2]. Should we extend the same semantic to Iceberg: all 
>>>>>>>>>>> rows added
>>>>>>>>>>> in the same snapshot should have the same timestamp value?
>>>>>>>>>>>
>>>>>>>>>>> Let me clarify my last comment.
>>>>>>>>>>>
>>>>>>>>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP)
>>>>>>>>>>>
>>>>>>>>>>> Since Postgres current_timestamp captures the transaction start
>>>>>>>>>>> time, all rows added in the same insert transaction would have the 
>>>>>>>>>>> same
>>>>>>>>>>> value as the transaction timestamp with the column definition above.
>>>>>>>>>>>
>>>>>>>>>>> If we extend a similar semantic to Iceberg, all rows added in
>>>>>>>>>>> the same Iceberg transaction/snapshot should have the same 
>>>>>>>>>>> timestamp?
>>>>>>>>>>>
>>>>>>>>>>> Ryan, I understand your comment for using current_timestamp
>>>>>>>>>>> expression as column default value, you were thinking that the 
>>>>>>>>>>> engine would
>>>>>>>>>>> set the column value to the wall clock time when appending a row to 
>>>>>>>>>>> a data
>>>>>>>>>>> file, right? every row would almost have a different timestamp 
>>>>>>>>>>> value.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> `current_timestamp` expression may not always carry the right
>>>>>>>>>>>> semantic for the use cases. E.g., latency tracking is interested 
>>>>>>>>>>>> in when
>>>>>>>>>>>> records are added / committed to the table, not when the record was
>>>>>>>>>>>> appended to an uncommitted data file in the processing engine.
>>>>>>>>>>>> Record creation and Iceberg commit can be minutes or even hours 
>>>>>>>>>>>> apart.
>>>>>>>>>>>>
>>>>>>>>>>>> Row timestamp inherited from snapshot timestamp has no overhead
>>>>>>>>>>>> with the initial commit and has very minimal storage overhead 
>>>>>>>>>>>> during file
>>>>>>>>>>>> rewrite. Per-row current_timestamp would have distinct values for 
>>>>>>>>>>>> every row
>>>>>>>>>>>> and has more storage overhead.
>>>>>>>>>>>>
>>>>>>>>>>>> OLTP databases deal with small row-level transactions. Postgres
>>>>>>>>>>>> `current_timestamp` captures the transaction start time [1, 2]. 
>>>>>>>>>>>> Should we
>>>>>>>>>>>> extend the same semantic to Iceberg: all rows added in the same 
>>>>>>>>>>>> snapshot
>>>>>>>>>>>> should have the same timestamp value?
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://www.postgresql.org/docs/current/functions-datetime.html
>>>>>>>>>>>> [2]
>>>>>>>>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this proposal.
>>>>>>>>>>>>>> To support the created_by timestamp, we would need to
>>>>>>>>>>>>>> implement the following row lineage behavior
>>>>>>>>>>>>>> * Initially, it inherits from the snapshot timestamp
>>>>>>>>>>>>>> * during rewrite (like compaction), it should be persisted
>>>>>>>>>>>>>> into data files.
>>>>>>>>>>>>>> * during update, it needs to be carried over from the
>>>>>>>>>>>>>> previous row. This is similar to the row_id carry over for row 
>>>>>>>>>>>>>> updates.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry for the short hand.  These are not the same:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1.  Insertion time - time the row was inserted.
>>>>>>>>>>>>> 2.  Create by - The system that created the record.
>>>>>>>>>>>>> 3.  Updated by - The system that last updated the record.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Depending on the exact use-case these might or might not have
>>>>>>>>>>>>> utility.  I'm just wondering if there will be more example like 
>>>>>>>>>>>>> this in the
>>>>>>>>>>>>> future.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> created_by column would incur likely significantly higher
>>>>>>>>>>>>>> storage overhead compared to the updated_by column. As rows are 
>>>>>>>>>>>>>> updated
>>>>>>>>>>>>>> overtime, the cardinality for this column in data files can be 
>>>>>>>>>>>>>> high. Hence,
>>>>>>>>>>>>>> the created_by column may not compress well. This is a similar 
>>>>>>>>>>>>>> problem for
>>>>>>>>>>>>>> the row_id column. One side effect of enabling row lineage by 
>>>>>>>>>>>>>> default for
>>>>>>>>>>>>>> V3 tables is the storage overhead of row_id column after 
>>>>>>>>>>>>>> compaction
>>>>>>>>>>>>>> especially for narrow tables with few columns.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I agree.  I think this analysis also shows that some consumers
>>>>>>>>>>>>> of Iceberg might not necessarily want to have all these columns, 
>>>>>>>>>>>>> so we
>>>>>>>>>>>>> might want to make them configurable, rather than mandating them 
>>>>>>>>>>>>> for all
>>>>>>>>>>>>> tables. Ryan's thought on default values seems like it would 
>>>>>>>>>>>>> solve the
>>>>>>>>>>>>> issues I was raising.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Micah
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> > An explicit timestamp column adds more burden to
>>>>>>>>>>>>>> application developers. While some databases require an explicit 
>>>>>>>>>>>>>> column in
>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set the 
>>>>>>>>>>>>>> column value.
>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to the 
>>>>>>>>>>>>>> trigger timestamp.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Since the use cases don't require an exact timestamp, this
>>>>>>>>>>>>>> seems like the best solution to get what people want (an 
>>>>>>>>>>>>>> insertion
>>>>>>>>>>>>>> timestamp) that has clear and well-defined behavior. Since
>>>>>>>>>>>>>> `current_timestamp` is defined by the SQL spec, it makes sense 
>>>>>>>>>>>>>> to me that
>>>>>>>>>>>>>> we could use it and have reasonable behavior.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've talked with Anton about this before and maybe he'll jump
>>>>>>>>>>>>>> in on this thread. I think that we may need to extend default 
>>>>>>>>>>>>>> values to
>>>>>>>>>>>>>> include default value expressions, like `current_timestamp` that 
>>>>>>>>>>>>>> is allowed
>>>>>>>>>>>>>> by the SQL spec. That would solve the problem as well as some 
>>>>>>>>>>>>>> others (like
>>>>>>>>>>>>>> `current_date` or `current_user`) and would not create a 
>>>>>>>>>>>>>> potentially
>>>>>>>>>>>>>> misleading (and heavyweight) timestamp feature in the format.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> > Also some environments may have stronger clock service,
>>>>>>>>>>>>>> like Spanner TrueTime service.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Even in cases like this, commit retries can reorder commits
>>>>>>>>>>>>>> and make timestamps out of order. I don't think that we should 
>>>>>>>>>>>>>> be making
>>>>>>>>>>>>>> guarantees or even exposing metadata that people might mistake 
>>>>>>>>>>>>>> as having
>>>>>>>>>>>>>> those guarantees.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ryan, thanks a lot for the feedback!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regarding the concern for reliable timestamps, we are not
>>>>>>>>>>>>>>> proposing using timestamps for ordering. With NTP in modern 
>>>>>>>>>>>>>>> computers, they
>>>>>>>>>>>>>>> are generally reliable enough for the intended use cases. Also 
>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>> environments may have stronger clock service, like Spanner
>>>>>>>>>>>>>>> TrueTime service
>>>>>>>>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency>
>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> >  joining to timestamps from the snapshots metadata table.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As you also mentioned, it depends on the snapshot history,
>>>>>>>>>>>>>>> which is often retained for a few days due to performance 
>>>>>>>>>>>>>>> reasons.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> > embedding a timestamp in DML (like `current_timestamp`)
>>>>>>>>>>>>>>> rather than relying on an implicit one from table metadata.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> An explicit timestamp column adds more burden to application
>>>>>>>>>>>>>>> developers. While some databases require an explicit column in 
>>>>>>>>>>>>>>> the schema,
>>>>>>>>>>>>>>> those databases provide triggers to auto set the column value. 
>>>>>>>>>>>>>>> For Iceberg,
>>>>>>>>>>>>>>> the snapshot timestamp is the closest to the trigger timestamp.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also, the timestamp set during computation (like streaming
>>>>>>>>>>>>>>> ingestion or relative long batch computation) doesn't capture 
>>>>>>>>>>>>>>> the time the
>>>>>>>>>>>>>>> rows/files are added to the Iceberg table in a batch fashion.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> > And for those use cases, you could also keep a longer
>>>>>>>>>>>>>>> history of snapshot timestamps, like storing a catalog's event 
>>>>>>>>>>>>>>> log for
>>>>>>>>>>>>>>> long-term access to timestamp info
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> this is not really consumable by joining the regular table
>>>>>>>>>>>>>>> query with catalog event log. I would also imagine catalog 
>>>>>>>>>>>>>>> event log is
>>>>>>>>>>>>>>> capped at shorter retention (maybe a few months) compared to 
>>>>>>>>>>>>>>> data retention
>>>>>>>>>>>>>>> (could be a few years).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I don't think it is a good idea to expose timestamps at the
>>>>>>>>>>>>>>>> row level. Timestamps in metadata that would be carried down 
>>>>>>>>>>>>>>>> to the row
>>>>>>>>>>>>>>>> level already confuse people that expect them to be useful or 
>>>>>>>>>>>>>>>> reliable,
>>>>>>>>>>>>>>>> rather than for debugging. I think extending this to the row 
>>>>>>>>>>>>>>>> level would
>>>>>>>>>>>>>>>> only make the problem worse.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You can already get this information by projecting the last
>>>>>>>>>>>>>>>> updated sequence number, which is reliable, and joining to 
>>>>>>>>>>>>>>>> timestamps from
>>>>>>>>>>>>>>>> the snapshots metadata table. Of course, the drawback there is 
>>>>>>>>>>>>>>>> losing the
>>>>>>>>>>>>>>>> timestamp information when snapshots expire, but since it 
>>>>>>>>>>>>>>>> isn't reliable
>>>>>>>>>>>>>>>> anyway I'd be fine with that.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Some of the use cases, like auditing and compliance, are
>>>>>>>>>>>>>>>> probably better served by embedding a timestamp in DML (like
>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit one 
>>>>>>>>>>>>>>>> from table
>>>>>>>>>>>>>>>> metadata. And for those use cases, you could also keep a 
>>>>>>>>>>>>>>>> longer history of
>>>>>>>>>>>>>>>> snapshot timestamps, like storing a catalog's event log for 
>>>>>>>>>>>>>>>> long-term
>>>>>>>>>>>>>>>> access to timestamp info. I think that would be better than 
>>>>>>>>>>>>>>>> storing it at
>>>>>>>>>>>>>>>> the row level.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For V4 spec, I have a small proposal [1] to expose the row
>>>>>>>>>>>>>>>>> timestamp concept that can help with many use cases like 
>>>>>>>>>>>>>>>>> temporal queries,
>>>>>>>>>>>>>>>>> latency tracking, TTL, auditing and compliance.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This *_last_updated_timestamp_ms * metadata column
>>>>>>>>>>>>>>>>> behaves very similarly to the
>>>>>>>>>>>>>>>>> *_last_updated_sequence_number* for row lineage.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - Initially, it inherits from the snapshot timestamp.
>>>>>>>>>>>>>>>>>    - During rewrite (like compaction), its values are
>>>>>>>>>>>>>>>>>    persisted in the data files.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Would love to hear what you think.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>

Re: [DISCUSS] row timestamp proposal

Reply via email to