Re: [DISCUSS] row timestamp proposal

Steven Wu Thu, 15 Jan 2026 14:23:35 -0800

We had an offline discussion with Ryan. I revised the proposal as follows.

1. V4 would require writers to generate *monotonic* snapshot timestamps.
The proposal doc has a section that describes a recommended implementation
using lamport timestamps.
2. Expose *last_update_timestamp* metadata column that inherits from
snapshot timestamp


This is a relatively low-friction change that can fix the time travel
problem and enable use cases like latency tracking, temporal query, TTL,
auditing.

There is no accuracy requirement on the timestamp values. In practice,
modern servers with NTP have pretty reliable wall clocks. E.g., Java
library implemented this validation
<https://github.com/apache/iceberg/blob/035e0fb39d2a949f6343552ade0a7d6c2967e0db/core/src/main/java/org/apache/iceberg/TableMetadata.java#L369-L377>
that
protects against backward clock drift up to one minute for snapshot
timestamps. Don't think we have heard many complaints of commit failure due
to that clock drift validation.

Would appreciate feedback on the revised proposal.
https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0

Thanks,
Steven

On Tue, Jan 13, 2026 at 8:40 PM Anton Okolnychyi <[email protected]>
wrote:

> Steven, I was referring to the fact that CURRENT_TIMESTAMP() is usually
> evaluated quite early in engines so we could theoretically have another
> expression closer to the commit time. You are right, though, it won't be
> the actual commit time given that we have to write it into the files. Also,
> I don't think generating a timestamp for a row as it is being written is
> going to be beneficial. To sum up, expression-based defaults would allow us
> to capture the time the transaction or write starts, but not the actual
> commit time.
>
> Russell, if the goal is to know what happened to the table in a given time
> frame, isn't the changelog scan the way to go? It would assign commit
> ordinals based on lineage and include row-level diffs. How would you be
> able to determine changes with row timestamps by just looking at the latest
> snapshot?
>
> It does seem promising to make snapshot timestamps strictly increasing to
> avoid ambiguity during time travel.
>
> вт, 13 січ. 2026 р. о 16:33 Ryan Blue <[email protected]> пише:
>
>> > Whether or not "t" is an atomic clock time is not as important as the
>> query between time bounds making sense.
>>
>> I'm not sure I get it then. If we want monotonically increasing times,
>> but they don't have to be real times then how do you know what notion of
>> "time" you care about for these filters? Or to put it another way, how do
>> you know that your "before" and "after" times are reasonable? If the
>> boundaries of these time queries can move around a bit, by how much?
>>
>> It seems to me that row IDs can play an important role here because you
>> have the order guarantee that we seem to want for this use case: if
>> snapshot A was committed before snapshot B, then the rows from A have row
>> IDs that are always less than the rows IDs of B. The problem is that we
>> don't know where those row IDs start and end once A and B are no longer
>> tracked. Using a "timestamp" seems to work, but I still worry that without
>> reliable timestamps that correspond with some guarantee to real timestamps,
>> we are creating a feature that seems reliable but isn't.
>>
>> I'm somewhat open to the idea of introducing a snapshot timestamp that
>> the catalog guarantees is monotonically increasing. But if we did that,
>> wouldn't we still need to know the association between these timestamps and
>> snapshots after the snapshot metadata expires? My mental model is that this
>> would be used to look for data that arrived, say, 3 weeks ago on Dec 24th.
>> Since the snapshots metadata is no longer around we could use the row
>> timestamp to find those rows. But how do we know that the snapshot
>> timestamps correspond to the actual timestamp range of Dec 24th? Is it just
>> "close enough" as long as we don't have out of order timestamps? This is
>> what I mean by needing to keep track of the association between timestamps
>> and snapshots after the metadata expires. Seems like you either need to
>> keep track of what the catalog's clock was for events you care about, or
>> you don't really care about exact timestamps.
>>
>> On Tue, Jan 13, 2026 at 2:22 PM Russell Spitzer <
>> [email protected]> wrote:
>>
>>> The key goal here is the ability to answer the question "what happened
>>> to the table in some time window. (before < t < after)?"
>>> Whether or not "t" is an atomic clock time is not as important as the
>>> query between time bounds making sense.
>>> Downstream applications (from what I know) are mostly sensitive to
>>> getting discrete and well defined answers to
>>> this question like:
>>>
>>> 1 < t < 2 should be exclusive of
>>> 2 < t < 3 should be exclusive of
>>> 3 < t < 4
>>>
>>> And the union of these should be the same as the query asking for 1 < t
>>> < 4
>>>
>>> Currently this is not possible because we have no guarantee of ordering
>>> in our timestamps
>>>
>>> Snapshots
>>> A -> B -> C
>>> Sequence numbers
>>> 50 -> 51 ->  52
>>> Timestamp
>>> 3 -> 1 -> 2
>>>
>>> This makes time travel always a little wrong to start with.
>>>
>>> The Java implementation only allows one minute of negative time on
>>> commit so we actually kind of do have this as a
>>> "light monotonicity" requirement but as noted above there is no spec
>>> requirement for this.  While we do have sequence
>>> number and row id, we still don't have a stable way of associating these
>>> with a consistent time in an engine independent way.
>>>
>>> Ideally we just want to have one consistent way of answering the
>>> question "what did the table look like at time t"
>>> which I think we get by adding in a new field that is a timestamp, set
>>> by the Catalog close to commit time,
>>> that always goes up.
>>>
>>> I'm not sure we can really do this with an engine expression since they
>>> won't know when the data is actually committed
>>> when writing files?
>>>
>>>
>>> On Tue, Jan 13, 2026 at 3:35 PM Anton Okolnychyi <[email protected]>
>>> wrote:
>>>
>>>> This seems like a lot of new complexity in the format. I would like us
>>>> to explore whether we can build the considered use cases on top of
>>>> expression-based defaults instead.
>>>>
>>>> We already plan to support CURRENT_TIMESTAMP() and similar functions
>>>> that are part of the SQL standard definition for default values. This would
>>>> provide us a way to know the relative row order. True, this usually will
>>>> represent the start of the operation. We may define COMMIT_TIMESTAMP() or a
>>>> similar expression for the actual commit time, if there are use cases that
>>>> need that. Plus, we may explore an approach similar to MySQL that allows
>>>> users to reset the default value on update.
>>>>
>>>> - Anton
>>>>
>>>> вт, 13 січ. 2026 р. о 11:04 Russell Spitzer <[email protected]>
>>>> пише:
>>>>
>>>>> I think this is the right step forward. Our current "timestamp"
>>>>> definition is too ambiguous to be useful so establishing
>>>>> a well defined and monotonic timestamp could be really great. I also
>>>>> like the ability for row's to know this value without
>>>>> having to rely on snapshot information which can be expired.
>>>>>
>>>>> On Mon, Jan 12, 2026 at 11:03 AM Steven Wu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I have revised the row timestamp proposal with the following changes.
>>>>>> * a new commit_timestamp field in snapshot metadata that has
>>>>>> nano-second precision.
>>>>>> * this optional field is only set by the REST catalog server
>>>>>> * it needs to be monotonic (e.g. implemented using Lamport timestamp)
>>>>>>
>>>>>>
>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh
>>>>>>
>>>>>> Thanks,
>>>>>> Steven
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the clarification, Ryan.
>>>>>>>
>>>>>>> For long-running streaming jobs that commit periodically, it is
>>>>>>> difficult to establish the constant value of current_timestamp across 
>>>>>>> all
>>>>>>> writer tasks for each commit cycle. I guess streaming writers may just 
>>>>>>> need
>>>>>>> to write the wall clock time when appending a row to a data file for the
>>>>>>> default value of current_timestamp.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <[email protected]> wrote:
>>>>>>>
>>>>>>>> I don't think that every row would have a different value. That
>>>>>>>> would be up to the engine, but I would expect engines to insert
>>>>>>>> `CURRENT_TIMESTAMP` into the plan and then replace it with a constant,
>>>>>>>> resulting in a consistent value for all rows.
>>>>>>>>
>>>>>>>> You're right that this would not necessarily be the commit time.
>>>>>>>> But neither is the commit timestamp from Iceberg's snapshot. I'm not 
>>>>>>>> sure
>>>>>>>> how we are going to define "good enough" for this purpose. I think at 
>>>>>>>> least
>>>>>>>> `CURRENT_TIMESTAMP` has reliable and known behavior when you look at 
>>>>>>>> how it
>>>>>>>> is handled in engines. And if you want the Iceberg timestamp, then use 
>>>>>>>> a
>>>>>>>> periodic query of the snapshot stable to keep track of them in a table 
>>>>>>>> you
>>>>>>>> can join to. I don't think this rises to the need for a table feature
>>>>>>>> unless we can guarantee that it is correct.
>>>>>>>>
>>>>>>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> > Postgres `current_timestamp` captures the transaction start time
>>>>>>>>> [1, 2]. Should we extend the same semantic to Iceberg: all rows added 
>>>>>>>>> in
>>>>>>>>> the same snapshot should have the same timestamp value?
>>>>>>>>>
>>>>>>>>> Let me clarify my last comment.
>>>>>>>>>
>>>>>>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP)
>>>>>>>>>
>>>>>>>>> Since Postgres current_timestamp captures the transaction start
>>>>>>>>> time, all rows added in the same insert transaction would have the 
>>>>>>>>> same
>>>>>>>>> value as the transaction timestamp with the column definition above.
>>>>>>>>>
>>>>>>>>> If we extend a similar semantic to Iceberg, all rows added in the
>>>>>>>>> same Iceberg transaction/snapshot should have the same timestamp?
>>>>>>>>>
>>>>>>>>> Ryan, I understand your comment for using current_timestamp
>>>>>>>>> expression as column default value, you were thinking that the engine 
>>>>>>>>> would
>>>>>>>>> set the column value to the wall clock time when appending a row to a 
>>>>>>>>> data
>>>>>>>>> file, right? every row would almost have a different timestamp value.
>>>>>>>>>
>>>>>>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> `current_timestamp` expression may not always carry the right
>>>>>>>>>> semantic for the use cases. E.g., latency tracking is interested in 
>>>>>>>>>> when
>>>>>>>>>> records are added / committed to the table, not when the record was
>>>>>>>>>> appended to an uncommitted data file in the processing engine.
>>>>>>>>>> Record creation and Iceberg commit can be minutes or even hours 
>>>>>>>>>> apart.
>>>>>>>>>>
>>>>>>>>>> Row timestamp inherited from snapshot timestamp has no overhead
>>>>>>>>>> with the initial commit and has very minimal storage overhead during 
>>>>>>>>>> file
>>>>>>>>>> rewrite. Per-row current_timestamp would have distinct values for 
>>>>>>>>>> every row
>>>>>>>>>> and has more storage overhead.
>>>>>>>>>>
>>>>>>>>>> OLTP databases deal with small row-level transactions. Postgres
>>>>>>>>>> `current_timestamp` captures the transaction start time [1, 2]. 
>>>>>>>>>> Should we
>>>>>>>>>> extend the same semantic to Iceberg: all rows added in the same 
>>>>>>>>>> snapshot
>>>>>>>>>> should have the same timestamp value?
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://www.postgresql.org/docs/current/functions-datetime.html
>>>>>>>>>> [2]
>>>>>>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this proposal.
>>>>>>>>>>>> To support the created_by timestamp, we would need to
>>>>>>>>>>>> implement the following row lineage behavior
>>>>>>>>>>>> * Initially, it inherits from the snapshot timestamp
>>>>>>>>>>>> * during rewrite (like compaction), it should be persisted into
>>>>>>>>>>>> data files.
>>>>>>>>>>>> * during update, it needs to be carried over from the previous
>>>>>>>>>>>> row. This is similar to the row_id carry over for row updates.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Sorry for the short hand.  These are not the same:
>>>>>>>>>>>
>>>>>>>>>>> 1.  Insertion time - time the row was inserted.
>>>>>>>>>>> 2.  Create by - The system that created the record.
>>>>>>>>>>> 3.  Updated by - The system that last updated the record.
>>>>>>>>>>>
>>>>>>>>>>> Depending on the exact use-case these might or might not have
>>>>>>>>>>> utility.  I'm just wondering if there will be more example like 
>>>>>>>>>>> this in the
>>>>>>>>>>> future.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> created_by column would incur likely significantly higher
>>>>>>>>>>>> storage overhead compared to the updated_by column. As rows are 
>>>>>>>>>>>> updated
>>>>>>>>>>>> overtime, the cardinality for this column in data files can be 
>>>>>>>>>>>> high. Hence,
>>>>>>>>>>>> the created_by column may not compress well. This is a similar 
>>>>>>>>>>>> problem for
>>>>>>>>>>>> the row_id column. One side effect of enabling row lineage by 
>>>>>>>>>>>> default for
>>>>>>>>>>>> V3 tables is the storage overhead of row_id column after compaction
>>>>>>>>>>>> especially for narrow tables with few columns.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I agree.  I think this analysis also shows that some consumers
>>>>>>>>>>> of Iceberg might not necessarily want to have all these columns, so 
>>>>>>>>>>> we
>>>>>>>>>>> might want to make them configurable, rather than mandating them 
>>>>>>>>>>> for all
>>>>>>>>>>> tables. Ryan's thought on default values seems like it would solve 
>>>>>>>>>>> the
>>>>>>>>>>> issues I was raising.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Micah
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> > An explicit timestamp column adds more burden to application
>>>>>>>>>>>> developers. While some databases require an explicit column in the 
>>>>>>>>>>>> schema,
>>>>>>>>>>>> those databases provide triggers to auto set the column value. For 
>>>>>>>>>>>> Iceberg,
>>>>>>>>>>>> the snapshot timestamp is the closest to the trigger timestamp.
>>>>>>>>>>>>
>>>>>>>>>>>> Since the use cases don't require an exact timestamp, this
>>>>>>>>>>>> seems like the best solution to get what people want (an insertion
>>>>>>>>>>>> timestamp) that has clear and well-defined behavior. Since
>>>>>>>>>>>> `current_timestamp` is defined by the SQL spec, it makes sense to 
>>>>>>>>>>>> me that
>>>>>>>>>>>> we could use it and have reasonable behavior.
>>>>>>>>>>>>
>>>>>>>>>>>> I've talked with Anton about this before and maybe he'll jump
>>>>>>>>>>>> in on this thread. I think that we may need to extend default 
>>>>>>>>>>>> values to
>>>>>>>>>>>> include default value expressions, like `current_timestamp` that 
>>>>>>>>>>>> is allowed
>>>>>>>>>>>> by the SQL spec. That would solve the problem as well as some 
>>>>>>>>>>>> others (like
>>>>>>>>>>>> `current_date` or `current_user`) and would not create a 
>>>>>>>>>>>> potentially
>>>>>>>>>>>> misleading (and heavyweight) timestamp feature in the format.
>>>>>>>>>>>>
>>>>>>>>>>>> > Also some environments may have stronger clock service, like
>>>>>>>>>>>> Spanner TrueTime service.
>>>>>>>>>>>>
>>>>>>>>>>>> Even in cases like this, commit retries can reorder commits and
>>>>>>>>>>>> make timestamps out of order. I don't think that we should be 
>>>>>>>>>>>> making
>>>>>>>>>>>> guarantees or even exposing metadata that people might mistake as 
>>>>>>>>>>>> having
>>>>>>>>>>>> those guarantees.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Ryan, thanks a lot for the feedback!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regarding the concern for reliable timestamps, we are not
>>>>>>>>>>>>> proposing using timestamps for ordering. With NTP in modern 
>>>>>>>>>>>>> computers, they
>>>>>>>>>>>>> are generally reliable enough for the intended use cases. Also 
>>>>>>>>>>>>> some
>>>>>>>>>>>>> environments may have stronger clock service, like Spanner
>>>>>>>>>>>>> TrueTime service
>>>>>>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency>
>>>>>>>>>>>>> .
>>>>>>>>>>>>>
>>>>>>>>>>>>> >  joining to timestamps from the snapshots metadata table.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As you also mentioned, it depends on the snapshot history,
>>>>>>>>>>>>> which is often retained for a few days due to performance reasons.
>>>>>>>>>>>>>
>>>>>>>>>>>>> > embedding a timestamp in DML (like `current_timestamp`)
>>>>>>>>>>>>> rather than relying on an implicit one from table metadata.
>>>>>>>>>>>>>
>>>>>>>>>>>>> An explicit timestamp column adds more burden to application
>>>>>>>>>>>>> developers. While some databases require an explicit column in 
>>>>>>>>>>>>> the schema,
>>>>>>>>>>>>> those databases provide triggers to auto set the column value. 
>>>>>>>>>>>>> For Iceberg,
>>>>>>>>>>>>> the snapshot timestamp is the closest to the trigger timestamp.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also, the timestamp set during computation (like streaming
>>>>>>>>>>>>> ingestion or relative long batch computation) doesn't capture the 
>>>>>>>>>>>>> time the
>>>>>>>>>>>>> rows/files are added to the Iceberg table in a batch fashion.
>>>>>>>>>>>>>
>>>>>>>>>>>>> > And for those use cases, you could also keep a longer
>>>>>>>>>>>>> history of snapshot timestamps, like storing a catalog's event 
>>>>>>>>>>>>> log for
>>>>>>>>>>>>> long-term access to timestamp info
>>>>>>>>>>>>>
>>>>>>>>>>>>> this is not really consumable by joining the regular table
>>>>>>>>>>>>> query with catalog event log. I would also imagine catalog event 
>>>>>>>>>>>>> log is
>>>>>>>>>>>>> capped at shorter retention (maybe a few months) compared to data 
>>>>>>>>>>>>> retention
>>>>>>>>>>>>> (could be a few years).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think it is a good idea to expose timestamps at the
>>>>>>>>>>>>>> row level. Timestamps in metadata that would be carried down to 
>>>>>>>>>>>>>> the row
>>>>>>>>>>>>>> level already confuse people that expect them to be useful or 
>>>>>>>>>>>>>> reliable,
>>>>>>>>>>>>>> rather than for debugging. I think extending this to the row 
>>>>>>>>>>>>>> level would
>>>>>>>>>>>>>> only make the problem worse.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You can already get this information by projecting the last
>>>>>>>>>>>>>> updated sequence number, which is reliable, and joining to 
>>>>>>>>>>>>>> timestamps from
>>>>>>>>>>>>>> the snapshots metadata table. Of course, the drawback there is 
>>>>>>>>>>>>>> losing the
>>>>>>>>>>>>>> timestamp information when snapshots expire, but since it isn't 
>>>>>>>>>>>>>> reliable
>>>>>>>>>>>>>> anyway I'd be fine with that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Some of the use cases, like auditing and compliance, are
>>>>>>>>>>>>>> probably better served by embedding a timestamp in DML (like
>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit one from 
>>>>>>>>>>>>>> table
>>>>>>>>>>>>>> metadata. And for those use cases, you could also keep a longer 
>>>>>>>>>>>>>> history of
>>>>>>>>>>>>>> snapshot timestamps, like storing a catalog's event log for 
>>>>>>>>>>>>>> long-term
>>>>>>>>>>>>>> access to timestamp info. I think that would be better than 
>>>>>>>>>>>>>> storing it at
>>>>>>>>>>>>>> the row level.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For V4 spec, I have a small proposal [1] to expose the row
>>>>>>>>>>>>>>> timestamp concept that can help with many use cases like 
>>>>>>>>>>>>>>> temporal queries,
>>>>>>>>>>>>>>> latency tracking, TTL, auditing and compliance.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This *_last_updated_timestamp_ms * metadata column behaves
>>>>>>>>>>>>>>> very similarly to the *_last_updated_sequence_number* for
>>>>>>>>>>>>>>> row lineage.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - Initially, it inherits from the snapshot timestamp.
>>>>>>>>>>>>>>>    - During rewrite (like compaction), its values are
>>>>>>>>>>>>>>>    persisted in the data files.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Would love to hear what you think.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Re: [DISCUSS] row timestamp proposal

Reply via email to