Re: [DISCUSS] row timestamp proposal

Steven Wu Fri, 16 Jan 2026 09:52:24 -0800

Anton, you are right that the row-level deletes will be a problem for some
of the mentioned use cases (like incremental processing). I have
clarified the applicability of some use cases to "tables with inserts and
updates only".


Right now, we are only tracking modification/commit time (not insertion
time) in case of updates.

On Thu, Jan 15, 2026 at 6:33 PM Anton Okolnychyi <[email protected]>
wrote:

> I think there is clear consensus that making snapshot timestamps strictly
> increasing is a positive thing. I am also +1.
>
> - How will row timestamps allow us to reliably implement incremental
> consumption independent of the snapshot retention given that rows can be
> added AND removed in a particular time frame? How can we capture all
> changes by just looking at the latest snapshot?
> - Some use cases in the doc need the insertion time and some need the last
> modification time. Do we plan to support both?
> - What do we expect the behavior to be in UPDATE and MERGE operations?
>
> To be clear: I am not opposed to this change, just want to make sure I
> understand all use cases that we aim to address and what would be required
> in engines.
>
> чт, 15 січ. 2026 р. о 17:01 Maninder Parmar <[email protected]>
> пише:
>
>> +1 for improving how the commit timestamps are assigned monotonically
>> since this requirement has emerged over multiple discussions like
>> notifications, multi-table transactions, time travel accuracy and row
>> timestamps. It would be good to have a single consistent way to represent
>> and assign timestamps that could be leveraged across multiple features.
>>
>> On Thu, Jan 15, 2026 at 4:05 PM Ryan Blue <[email protected]> wrote:
>>
>>> Yeah, to add my perspective on that discussion, I think my primary
>>> concern is that people expect timestamps to be monotonic and if they aren't
>>> then a `_last_update_timestamp` field just makes the problem worse. But it
>>> is _nice_ to have row-level timestamps. So I would be okay if we revisit
>>> how we assign commit timestamps and improve it so that you get monotonic
>>> behavior.
>>>
>>> On Thu, Jan 15, 2026 at 2:23 PM Steven Wu <[email protected]> wrote:
>>>
>>>> We had an offline discussion with Ryan. I revised the proposal as
>>>> follows.
>>>>
>>>> 1. V4 would require writers to generate *monotonic* snapshot
>>>> timestamps. The proposal doc has a section that describes a recommended
>>>> implementation using lamport timestamps.
>>>> 2. Expose *last_update_timestamp* metadata column that inherits from
>>>> snapshot timestamp
>>>>
>>>> This is a relatively low-friction change that can fix the time travel
>>>> problem and enable use cases like latency tracking, temporal query, TTL,
>>>> auditing.
>>>>
>>>> There is no accuracy requirement on the timestamp values. In practice,
>>>> modern servers with NTP have pretty reliable wall clocks. E.g., Java
>>>> library implemented this validation
>>>> <https://github.com/apache/iceberg/blob/035e0fb39d2a949f6343552ade0a7d6c2967e0db/core/src/main/java/org/apache/iceberg/TableMetadata.java#L369-L377>
>>>>  that
>>>> protects against backward clock drift up to one minute for snapshot
>>>> timestamps. Don't think we have heard many complaints of commit failure due
>>>> to that clock drift validation.
>>>>
>>>> Would appreciate feedback on the revised proposal.
>>>>
>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0
>>>>
>>>> Thanks,
>>>> Steven
>>>>
>>>> On Tue, Jan 13, 2026 at 8:40 PM Anton Okolnychyi <[email protected]>
>>>> wrote:
>>>>
>>>>> Steven, I was referring to the fact that CURRENT_TIMESTAMP() is
>>>>> usually evaluated quite early in engines so we could theoretically have
>>>>> another expression closer to the commit time. You are right, though, it
>>>>> won't be the actual commit time given that we have to write it into the
>>>>> files. Also, I don't think generating a timestamp for a row as it is being
>>>>> written is going to be beneficial. To sum up, expression-based defaults
>>>>> would allow us to capture the time the transaction or write starts, but 
>>>>> not
>>>>> the actual commit time.
>>>>>
>>>>> Russell, if the goal is to know what happened to the table in a given
>>>>> time frame, isn't the changelog scan the way to go? It would assign commit
>>>>> ordinals based on lineage and include row-level diffs. How would you be
>>>>> able to determine changes with row timestamps by just looking at the 
>>>>> latest
>>>>> snapshot?
>>>>>
>>>>> It does seem promising to make snapshot timestamps strictly increasing
>>>>> to avoid ambiguity during time travel.
>>>>>
>>>>> вт, 13 січ. 2026 р. о 16:33 Ryan Blue <[email protected]> пише:
>>>>>
>>>>>> > Whether or not "t" is an atomic clock time is not as important as
>>>>>> the query between time bounds making sense.
>>>>>>
>>>>>> I'm not sure I get it then. If we want monotonically increasing
>>>>>> times, but they don't have to be real times then how do you know what
>>>>>> notion of "time" you care about for these filters? Or to put it another
>>>>>> way, how do you know that your "before" and "after" times are reasonable?
>>>>>> If the boundaries of these time queries can move around a bit, by how 
>>>>>> much?
>>>>>>
>>>>>> It seems to me that row IDs can play an important role here because
>>>>>> you have the order guarantee that we seem to want for this use case: if
>>>>>> snapshot A was committed before snapshot B, then the rows from A have row
>>>>>> IDs that are always less than the rows IDs of B. The problem is that we
>>>>>> don't know where those row IDs start and end once A and B are no longer
>>>>>> tracked. Using a "timestamp" seems to work, but I still worry that 
>>>>>> without
>>>>>> reliable timestamps that correspond with some guarantee to real 
>>>>>> timestamps,
>>>>>> we are creating a feature that seems reliable but isn't.
>>>>>>
>>>>>> I'm somewhat open to the idea of introducing a snapshot timestamp
>>>>>> that the catalog guarantees is monotonically increasing. But if we did
>>>>>> that, wouldn't we still need to know the association between these
>>>>>> timestamps and snapshots after the snapshot metadata expires? My mental
>>>>>> model is that this would be used to look for data that arrived, say, 3
>>>>>> weeks ago on Dec 24th. Since the snapshots metadata is no longer around 
>>>>>> we
>>>>>> could use the row timestamp to find those rows. But how do we know that 
>>>>>> the
>>>>>> snapshot timestamps correspond to the actual timestamp range of Dec 24th?
>>>>>> Is it just "close enough" as long as we don't have out of order 
>>>>>> timestamps?
>>>>>> This is what I mean by needing to keep track of the association between
>>>>>> timestamps and snapshots after the metadata expires. Seems like you 
>>>>>> either
>>>>>> need to keep track of what the catalog's clock was for events you care
>>>>>> about, or you don't really care about exact timestamps.
>>>>>>
>>>>>> On Tue, Jan 13, 2026 at 2:22 PM Russell Spitzer <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> The key goal here is the ability to answer the question "what
>>>>>>> happened to the table in some time window. (before < t < after)?"
>>>>>>> Whether or not "t" is an atomic clock time is not as important as
>>>>>>> the query between time bounds making sense.
>>>>>>> Downstream applications (from what I know) are mostly sensitive to
>>>>>>> getting discrete and well defined answers to
>>>>>>> this question like:
>>>>>>>
>>>>>>> 1 < t < 2 should be exclusive of
>>>>>>> 2 < t < 3 should be exclusive of
>>>>>>> 3 < t < 4
>>>>>>>
>>>>>>> And the union of these should be the same as the query asking for 1
>>>>>>> < t < 4
>>>>>>>
>>>>>>> Currently this is not possible because we have no guarantee of
>>>>>>> ordering in our timestamps
>>>>>>>
>>>>>>> Snapshots
>>>>>>> A -> B -> C
>>>>>>> Sequence numbers
>>>>>>> 50 -> 51 ->  52
>>>>>>> Timestamp
>>>>>>> 3 -> 1 -> 2
>>>>>>>
>>>>>>> This makes time travel always a little wrong to start with.
>>>>>>>
>>>>>>> The Java implementation only allows one minute of negative time on
>>>>>>> commit so we actually kind of do have this as a
>>>>>>> "light monotonicity" requirement but as noted above there is no spec
>>>>>>> requirement for this.  While we do have sequence
>>>>>>> number and row id, we still don't have a stable way of associating
>>>>>>> these with a consistent time in an engine independent way.
>>>>>>>
>>>>>>> Ideally we just want to have one consistent way of answering the
>>>>>>> question "what did the table look like at time t"
>>>>>>> which I think we get by adding in a new field that is a timestamp,
>>>>>>> set by the Catalog close to commit time,
>>>>>>> that always goes up.
>>>>>>>
>>>>>>> I'm not sure we can really do this with an engine expression since
>>>>>>> they won't know when the data is actually committed
>>>>>>> when writing files?
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 13, 2026 at 3:35 PM Anton Okolnychyi <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> This seems like a lot of new complexity in the format. I would like
>>>>>>>> us to explore whether we can build the considered use cases on top of
>>>>>>>> expression-based defaults instead.
>>>>>>>>
>>>>>>>> We already plan to support CURRENT_TIMESTAMP() and similar
>>>>>>>> functions that are part of the SQL standard definition for default 
>>>>>>>> values.
>>>>>>>> This would provide us a way to know the relative row order. True, this
>>>>>>>> usually will represent the start of the operation. We may define
>>>>>>>> COMMIT_TIMESTAMP() or a similar expression for the actual commit time, 
>>>>>>>> if
>>>>>>>> there are use cases that need that. Plus, we may explore an approach
>>>>>>>> similar to MySQL that allows users to reset the default value on 
>>>>>>>> update.
>>>>>>>>
>>>>>>>> - Anton
>>>>>>>>
>>>>>>>> вт, 13 січ. 2026 р. о 11:04 Russell Spitzer <
>>>>>>>> [email protected]> пише:
>>>>>>>>
>>>>>>>>> I think this is the right step forward. Our current "timestamp"
>>>>>>>>> definition is too ambiguous to be useful so establishing
>>>>>>>>> a well defined and monotonic timestamp could be really great. I
>>>>>>>>> also like the ability for row's to know this value without
>>>>>>>>> having to rely on snapshot information which can be expired.
>>>>>>>>>
>>>>>>>>> On Mon, Jan 12, 2026 at 11:03 AM Steven Wu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I have revised the row timestamp proposal with the following
>>>>>>>>>> changes.
>>>>>>>>>> * a new commit_timestamp field in snapshot metadata that has
>>>>>>>>>> nano-second precision.
>>>>>>>>>> * this optional field is only set by the REST catalog server
>>>>>>>>>> * it needs to be monotonic (e.g. implemented using Lamport
>>>>>>>>>> timestamp)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Steven
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for the clarification, Ryan.
>>>>>>>>>>>
>>>>>>>>>>> For long-running streaming jobs that commit periodically, it is
>>>>>>>>>>> difficult to establish the constant value of current_timestamp 
>>>>>>>>>>> across all
>>>>>>>>>>> writer tasks for each commit cycle. I guess streaming writers may 
>>>>>>>>>>> just need
>>>>>>>>>>> to write the wall clock time when appending a row to a data file 
>>>>>>>>>>> for the
>>>>>>>>>>> default value of current_timestamp.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I don't think that every row would have a different value. That
>>>>>>>>>>>> would be up to the engine, but I would expect engines to insert
>>>>>>>>>>>> `CURRENT_TIMESTAMP` into the plan and then replace it with a 
>>>>>>>>>>>> constant,
>>>>>>>>>>>> resulting in a consistent value for all rows.
>>>>>>>>>>>>
>>>>>>>>>>>> You're right that this would not necessarily be the commit
>>>>>>>>>>>> time. But neither is the commit timestamp from Iceberg's snapshot. 
>>>>>>>>>>>> I'm not
>>>>>>>>>>>> sure how we are going to define "good enough" for this purpose. I 
>>>>>>>>>>>> think at
>>>>>>>>>>>> least `CURRENT_TIMESTAMP` has reliable and known behavior when you 
>>>>>>>>>>>> look at
>>>>>>>>>>>> how it is handled in engines. And if you want the Iceberg 
>>>>>>>>>>>> timestamp, then
>>>>>>>>>>>> use a periodic query of the snapshot stable to keep track of them 
>>>>>>>>>>>> in a
>>>>>>>>>>>> table you can join to. I don't think this rises to the need for a 
>>>>>>>>>>>> table
>>>>>>>>>>>> feature unless we can guarantee that it is correct.
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> > Postgres `current_timestamp` captures the transaction start
>>>>>>>>>>>>> time [1, 2]. Should we extend the same semantic to Iceberg: all 
>>>>>>>>>>>>> rows added
>>>>>>>>>>>>> in the same snapshot should have the same timestamp value?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let me clarify my last comment.
>>>>>>>>>>>>>
>>>>>>>>>>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Since Postgres current_timestamp captures the transaction
>>>>>>>>>>>>> start time, all rows added in the same insert transaction would 
>>>>>>>>>>>>> have the
>>>>>>>>>>>>> same value as the transaction timestamp with the column 
>>>>>>>>>>>>> definition above.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we extend a similar semantic to Iceberg, all rows added in
>>>>>>>>>>>>> the same Iceberg transaction/snapshot should have the same 
>>>>>>>>>>>>> timestamp?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ryan, I understand your comment for using current_timestamp
>>>>>>>>>>>>> expression as column default value, you were thinking that the 
>>>>>>>>>>>>> engine would
>>>>>>>>>>>>> set the column value to the wall clock time when appending a row 
>>>>>>>>>>>>> to a data
>>>>>>>>>>>>> file, right? every row would almost have a different timestamp 
>>>>>>>>>>>>> value.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> `current_timestamp` expression may not always carry the right
>>>>>>>>>>>>>> semantic for the use cases. E.g., latency tracking is interested 
>>>>>>>>>>>>>> in when
>>>>>>>>>>>>>> records are added / committed to the table, not when the record 
>>>>>>>>>>>>>> was
>>>>>>>>>>>>>> appended to an uncommitted data file in the processing engine.
>>>>>>>>>>>>>> Record creation and Iceberg commit can be minutes or even hours 
>>>>>>>>>>>>>> apart.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Row timestamp inherited from snapshot timestamp has no
>>>>>>>>>>>>>> overhead with the initial commit and has very minimal storage 
>>>>>>>>>>>>>> overhead
>>>>>>>>>>>>>> during file rewrite. Per-row current_timestamp would have 
>>>>>>>>>>>>>> distinct values
>>>>>>>>>>>>>> for every row and has more storage overhead.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> OLTP databases deal with small row-level transactions.
>>>>>>>>>>>>>> Postgres `current_timestamp` captures the transaction start time 
>>>>>>>>>>>>>> [1, 2].
>>>>>>>>>>>>>> Should we extend the same semantic to Iceberg: all rows added in 
>>>>>>>>>>>>>> the same
>>>>>>>>>>>>>> snapshot should have the same timestamp value?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> https://www.postgresql.org/docs/current/functions-datetime.html
>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this proposal.
>>>>>>>>>>>>>>>> To support the created_by timestamp, we would need to
>>>>>>>>>>>>>>>> implement the following row lineage behavior
>>>>>>>>>>>>>>>> * Initially, it inherits from the snapshot timestamp
>>>>>>>>>>>>>>>> * during rewrite (like compaction), it should be persisted
>>>>>>>>>>>>>>>> into data files.
>>>>>>>>>>>>>>>> * during update, it needs to be carried over from the
>>>>>>>>>>>>>>>> previous row. This is similar to the row_id carry over for row 
>>>>>>>>>>>>>>>> updates.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sorry for the short hand.  These are not the same:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1.  Insertion time - time the row was inserted.
>>>>>>>>>>>>>>> 2.  Create by - The system that created the record.
>>>>>>>>>>>>>>> 3.  Updated by - The system that last updated the record.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Depending on the exact use-case these might or might not
>>>>>>>>>>>>>>> have utility.  I'm just wondering if there will be more example 
>>>>>>>>>>>>>>> like this
>>>>>>>>>>>>>>> in the future.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> created_by column would incur likely significantly higher
>>>>>>>>>>>>>>>> storage overhead compared to the updated_by column. As rows 
>>>>>>>>>>>>>>>> are updated
>>>>>>>>>>>>>>>> overtime, the cardinality for this column in data files can be 
>>>>>>>>>>>>>>>> high. Hence,
>>>>>>>>>>>>>>>> the created_by column may not compress well. This is a similar 
>>>>>>>>>>>>>>>> problem for
>>>>>>>>>>>>>>>> the row_id column. One side effect of enabling row lineage by 
>>>>>>>>>>>>>>>> default for
>>>>>>>>>>>>>>>> V3 tables is the storage overhead of row_id column after 
>>>>>>>>>>>>>>>> compaction
>>>>>>>>>>>>>>>> especially for narrow tables with few columns.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I agree.  I think this analysis also shows that some
>>>>>>>>>>>>>>> consumers of Iceberg might not necessarily want to have all 
>>>>>>>>>>>>>>> these columns,
>>>>>>>>>>>>>>> so we might want to make them configurable, rather than 
>>>>>>>>>>>>>>> mandating them for
>>>>>>>>>>>>>>> all tables. Ryan's thought on default values seems like it 
>>>>>>>>>>>>>>> would solve the
>>>>>>>>>>>>>>> issues I was raising.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Micah
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > An explicit timestamp column adds more burden to
>>>>>>>>>>>>>>>> application developers. While some databases require an 
>>>>>>>>>>>>>>>> explicit column in
>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set the 
>>>>>>>>>>>>>>>> column value.
>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to the 
>>>>>>>>>>>>>>>> trigger timestamp.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Since the use cases don't require an exact timestamp, this
>>>>>>>>>>>>>>>> seems like the best solution to get what people want (an 
>>>>>>>>>>>>>>>> insertion
>>>>>>>>>>>>>>>> timestamp) that has clear and well-defined behavior. Since
>>>>>>>>>>>>>>>> `current_timestamp` is defined by the SQL spec, it makes sense 
>>>>>>>>>>>>>>>> to me that
>>>>>>>>>>>>>>>> we could use it and have reasonable behavior.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I've talked with Anton about this before and maybe he'll
>>>>>>>>>>>>>>>> jump in on this thread. I think that we may need to extend 
>>>>>>>>>>>>>>>> default values
>>>>>>>>>>>>>>>> to include default value expressions, like `current_timestamp` 
>>>>>>>>>>>>>>>> that is
>>>>>>>>>>>>>>>> allowed by the SQL spec. That would solve the problem as well 
>>>>>>>>>>>>>>>> as some
>>>>>>>>>>>>>>>> others (like `current_date` or `current_user`) and would not 
>>>>>>>>>>>>>>>> create a
>>>>>>>>>>>>>>>> potentially misleading (and heavyweight) timestamp feature in 
>>>>>>>>>>>>>>>> the format.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > Also some environments may have stronger clock service,
>>>>>>>>>>>>>>>> like Spanner TrueTime service.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Even in cases like this, commit retries can reorder commits
>>>>>>>>>>>>>>>> and make timestamps out of order. I don't think that we should 
>>>>>>>>>>>>>>>> be making
>>>>>>>>>>>>>>>> guarantees or even exposing metadata that people might mistake 
>>>>>>>>>>>>>>>> as having
>>>>>>>>>>>>>>>> those guarantees.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ryan, thanks a lot for the feedback!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Regarding the concern for reliable timestamps, we are not
>>>>>>>>>>>>>>>>> proposing using timestamps for ordering. With NTP in modern 
>>>>>>>>>>>>>>>>> computers, they
>>>>>>>>>>>>>>>>> are generally reliable enough for the intended use cases. 
>>>>>>>>>>>>>>>>> Also some
>>>>>>>>>>>>>>>>> environments may have stronger clock service, like Spanner
>>>>>>>>>>>>>>>>> TrueTime service
>>>>>>>>>>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency>
>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> >  joining to timestamps from the snapshots metadata
>>>>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> As you also mentioned, it depends on the snapshot history,
>>>>>>>>>>>>>>>>> which is often retained for a few days due to performance 
>>>>>>>>>>>>>>>>> reasons.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> > embedding a timestamp in DML (like `current_timestamp`)
>>>>>>>>>>>>>>>>> rather than relying on an implicit one from table metadata.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> An explicit timestamp column adds more burden to
>>>>>>>>>>>>>>>>> application developers. While some databases require an 
>>>>>>>>>>>>>>>>> explicit column in
>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set the 
>>>>>>>>>>>>>>>>> column value.
>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to the 
>>>>>>>>>>>>>>>>> trigger timestamp.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Also, the timestamp set during computation (like streaming
>>>>>>>>>>>>>>>>> ingestion or relative long batch computation) doesn't capture 
>>>>>>>>>>>>>>>>> the time the
>>>>>>>>>>>>>>>>> rows/files are added to the Iceberg table in a batch fashion.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> > And for those use cases, you could also keep a longer
>>>>>>>>>>>>>>>>> history of snapshot timestamps, like storing a catalog's 
>>>>>>>>>>>>>>>>> event log for
>>>>>>>>>>>>>>>>> long-term access to timestamp info
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> this is not really consumable by joining the regular table
>>>>>>>>>>>>>>>>> query with catalog event log. I would also imagine catalog 
>>>>>>>>>>>>>>>>> event log is
>>>>>>>>>>>>>>>>> capped at shorter retention (maybe a few months) compared to 
>>>>>>>>>>>>>>>>> data retention
>>>>>>>>>>>>>>>>> (could be a few years).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I don't think it is a good idea to expose timestamps at
>>>>>>>>>>>>>>>>>> the row level. Timestamps in metadata that would be carried 
>>>>>>>>>>>>>>>>>> down to the row
>>>>>>>>>>>>>>>>>> level already confuse people that expect them to be useful 
>>>>>>>>>>>>>>>>>> or reliable,
>>>>>>>>>>>>>>>>>> rather than for debugging. I think extending this to the row 
>>>>>>>>>>>>>>>>>> level would
>>>>>>>>>>>>>>>>>> only make the problem worse.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> You can already get this information by projecting the
>>>>>>>>>>>>>>>>>> last updated sequence number, which is reliable, and joining 
>>>>>>>>>>>>>>>>>> to timestamps
>>>>>>>>>>>>>>>>>> from the snapshots metadata table. Of course, the drawback 
>>>>>>>>>>>>>>>>>> there is losing
>>>>>>>>>>>>>>>>>> the timestamp information when snapshots expire, but since 
>>>>>>>>>>>>>>>>>> it isn't
>>>>>>>>>>>>>>>>>> reliable anyway I'd be fine with that.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Some of the use cases, like auditing and compliance, are
>>>>>>>>>>>>>>>>>> probably better served by embedding a timestamp in DML (like
>>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit one 
>>>>>>>>>>>>>>>>>> from table
>>>>>>>>>>>>>>>>>> metadata. And for those use cases, you could also keep a 
>>>>>>>>>>>>>>>>>> longer history of
>>>>>>>>>>>>>>>>>> snapshot timestamps, like storing a catalog's event log for 
>>>>>>>>>>>>>>>>>> long-term
>>>>>>>>>>>>>>>>>> access to timestamp info. I think that would be better than 
>>>>>>>>>>>>>>>>>> storing it at
>>>>>>>>>>>>>>>>>> the row level.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For V4 spec, I have a small proposal [1] to expose the
>>>>>>>>>>>>>>>>>>> row timestamp concept that can help with many use cases 
>>>>>>>>>>>>>>>>>>> like temporal
>>>>>>>>>>>>>>>>>>> queries, latency tracking, TTL, auditing and compliance.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This *_last_updated_timestamp_ms * metadata column
>>>>>>>>>>>>>>>>>>> behaves very similarly to the
>>>>>>>>>>>>>>>>>>> *_last_updated_sequence_number* for row lineage.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>    - Initially, it inherits from the snapshot timestamp.
>>>>>>>>>>>>>>>>>>>    - During rewrite (like compaction), its values are
>>>>>>>>>>>>>>>>>>>    persisted in the data files.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Would love to hear what you think.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] row timestamp proposal

Reply via email to