Re: [DISCUSS] row timestamp proposal

Steven Wu Fri, 16 Jan 2026 14:30:12 -0800

Joining with snapshot history also has significant complexity. It requires
retaining the entire snapshot history with probably trimmed snapshot
metadata. There are concerns on the size of the snapshot history for tables
with frequent commits (like streaming ingestion). Do we maintain the
unbounded trimmed snapshot history in the same table metadata, which could
affect table metadata.json size? or store it separately somewhere (like in
catalog), which would require the complexity of multi-entity transaction in
catalog?



On Fri, Jan 16, 2026 at 12:07 PM Russell Spitzer <[email protected]>
wrote:

> I've gone back and forth on the inherited columns. I think the thing which
> keeps coming back to me is that I don't
> like that the only way to determine the timestamp associated with a row
> update/creation is to do a join back
> against table metadata. While that's doable, It feels user unfriendly.
>
>
>
> On Fri, Jan 16, 2026 at 11:54 AM Steven Wu <[email protected]> wrote:
>
>> Anton, you are right that the row-level deletes will be a problem for
>> some of the mentioned use cases (like incremental processing). I have
>> clarified the applicability of some use cases to "tables with inserts and
>> updates only".
>>
>> Right now, we are only tracking modification/commit time (not insertion
>> time) in case of updates.
>>
>> On Thu, Jan 15, 2026 at 6:33 PM Anton Okolnychyi <[email protected]>
>> wrote:
>>
>>> I think there is clear consensus that making snapshot timestamps
>>> strictly increasing is a positive thing. I am also +1.
>>>
>>> - How will row timestamps allow us to reliably implement incremental
>>> consumption independent of the snapshot retention given that rows can be
>>> added AND removed in a particular time frame? How can we capture all
>>> changes by just looking at the latest snapshot?
>>> - Some use cases in the doc need the insertion time and some need the
>>> last modification time. Do we plan to support both?
>>> - What do we expect the behavior to be in UPDATE and MERGE operations?
>>>
>>> To be clear: I am not opposed to this change, just want to make sure I
>>> understand all use cases that we aim to address and what would be required
>>> in engines.
>>>
>>> чт, 15 січ. 2026 р. о 17:01 Maninder Parmar <
>>> [email protected]> пише:
>>>
>>>> +1 for improving how the commit timestamps are assigned monotonically
>>>> since this requirement has emerged over multiple discussions like
>>>> notifications, multi-table transactions, time travel accuracy and row
>>>> timestamps. It would be good to have a single consistent way to represent
>>>> and assign timestamps that could be leveraged across multiple features.
>>>>
>>>> On Thu, Jan 15, 2026 at 4:05 PM Ryan Blue <[email protected]> wrote:
>>>>
>>>>> Yeah, to add my perspective on that discussion, I think my primary
>>>>> concern is that people expect timestamps to be monotonic and if they 
>>>>> aren't
>>>>> then a `_last_update_timestamp` field just makes the problem worse. But it
>>>>> is _nice_ to have row-level timestamps. So I would be okay if we revisit
>>>>> how we assign commit timestamps and improve it so that you get monotonic
>>>>> behavior.
>>>>>
>>>>> On Thu, Jan 15, 2026 at 2:23 PM Steven Wu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> We had an offline discussion with Ryan. I revised the proposal as
>>>>>> follows.
>>>>>>
>>>>>> 1. V4 would require writers to generate *monotonic* snapshot
>>>>>> timestamps. The proposal doc has a section that describes a recommended
>>>>>> implementation using lamport timestamps.
>>>>>> 2. Expose *last_update_timestamp* metadata column that inherits from
>>>>>> snapshot timestamp
>>>>>>
>>>>>> This is a relatively low-friction change that can fix the time travel
>>>>>> problem and enable use cases like latency tracking, temporal query, TTL,
>>>>>> auditing.
>>>>>>
>>>>>> There is no accuracy requirement on the timestamp values. In
>>>>>> practice, modern servers with NTP have pretty reliable wall clocks. E.g.,
>>>>>> Java library implemented this validation
>>>>>> <https://github.com/apache/iceberg/blob/035e0fb39d2a949f6343552ade0a7d6c2967e0db/core/src/main/java/org/apache/iceberg/TableMetadata.java#L369-L377>
>>>>>>  that
>>>>>> protects against backward clock drift up to one minute for snapshot
>>>>>> timestamps. Don't think we have heard many complaints of commit failure 
>>>>>> due
>>>>>> to that clock drift validation.
>>>>>>
>>>>>> Would appreciate feedback on the revised proposal.
>>>>>>
>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0
>>>>>>
>>>>>> Thanks,
>>>>>> Steven
>>>>>>
>>>>>> On Tue, Jan 13, 2026 at 8:40 PM Anton Okolnychyi <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Steven, I was referring to the fact that CURRENT_TIMESTAMP() is
>>>>>>> usually evaluated quite early in engines so we could theoretically have
>>>>>>> another expression closer to the commit time. You are right, though, it
>>>>>>> won't be the actual commit time given that we have to write it into the
>>>>>>> files. Also, I don't think generating a timestamp for a row as it is 
>>>>>>> being
>>>>>>> written is going to be beneficial. To sum up, expression-based defaults
>>>>>>> would allow us to capture the time the transaction or write starts, but 
>>>>>>> not
>>>>>>> the actual commit time.
>>>>>>>
>>>>>>> Russell, if the goal is to know what happened to the table in a
>>>>>>> given time frame, isn't the changelog scan the way to go? It would 
>>>>>>> assign
>>>>>>> commit ordinals based on lineage and include row-level diffs. How would 
>>>>>>> you
>>>>>>> be able to determine changes with row timestamps by just looking at the
>>>>>>> latest snapshot?
>>>>>>>
>>>>>>> It does seem promising to make snapshot timestamps strictly
>>>>>>> increasing to avoid ambiguity during time travel.
>>>>>>>
>>>>>>> вт, 13 січ. 2026 р. о 16:33 Ryan Blue <[email protected]> пише:
>>>>>>>
>>>>>>>> > Whether or not "t" is an atomic clock time is not as important as
>>>>>>>> the query between time bounds making sense.
>>>>>>>>
>>>>>>>> I'm not sure I get it then. If we want monotonically increasing
>>>>>>>> times, but they don't have to be real times then how do you know what
>>>>>>>> notion of "time" you care about for these filters? Or to put it another
>>>>>>>> way, how do you know that your "before" and "after" times are 
>>>>>>>> reasonable?
>>>>>>>> If the boundaries of these time queries can move around a bit, by how 
>>>>>>>> much?
>>>>>>>>
>>>>>>>> It seems to me that row IDs can play an important role here because
>>>>>>>> you have the order guarantee that we seem to want for this use case: if
>>>>>>>> snapshot A was committed before snapshot B, then the rows from A have 
>>>>>>>> row
>>>>>>>> IDs that are always less than the rows IDs of B. The problem is that we
>>>>>>>> don't know where those row IDs start and end once A and B are no longer
>>>>>>>> tracked. Using a "timestamp" seems to work, but I still worry that 
>>>>>>>> without
>>>>>>>> reliable timestamps that correspond with some guarantee to real 
>>>>>>>> timestamps,
>>>>>>>> we are creating a feature that seems reliable but isn't.
>>>>>>>>
>>>>>>>> I'm somewhat open to the idea of introducing a snapshot timestamp
>>>>>>>> that the catalog guarantees is monotonically increasing. But if we did
>>>>>>>> that, wouldn't we still need to know the association between these
>>>>>>>> timestamps and snapshots after the snapshot metadata expires? My mental
>>>>>>>> model is that this would be used to look for data that arrived, say, 3
>>>>>>>> weeks ago on Dec 24th. Since the snapshots metadata is no longer 
>>>>>>>> around we
>>>>>>>> could use the row timestamp to find those rows. But how do we know 
>>>>>>>> that the
>>>>>>>> snapshot timestamps correspond to the actual timestamp range of Dec 
>>>>>>>> 24th?
>>>>>>>> Is it just "close enough" as long as we don't have out of order 
>>>>>>>> timestamps?
>>>>>>>> This is what I mean by needing to keep track of the association between
>>>>>>>> timestamps and snapshots after the metadata expires. Seems like you 
>>>>>>>> either
>>>>>>>> need to keep track of what the catalog's clock was for events you care
>>>>>>>> about, or you don't really care about exact timestamps.
>>>>>>>>
>>>>>>>> On Tue, Jan 13, 2026 at 2:22 PM Russell Spitzer <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> The key goal here is the ability to answer the question "what
>>>>>>>>> happened to the table in some time window. (before < t < after)?"
>>>>>>>>> Whether or not "t" is an atomic clock time is not as important as
>>>>>>>>> the query between time bounds making sense.
>>>>>>>>> Downstream applications (from what I know) are mostly sensitive to
>>>>>>>>> getting discrete and well defined answers to
>>>>>>>>> this question like:
>>>>>>>>>
>>>>>>>>> 1 < t < 2 should be exclusive of
>>>>>>>>> 2 < t < 3 should be exclusive of
>>>>>>>>> 3 < t < 4
>>>>>>>>>
>>>>>>>>> And the union of these should be the same as the query asking for
>>>>>>>>> 1 < t < 4
>>>>>>>>>
>>>>>>>>> Currently this is not possible because we have no guarantee of
>>>>>>>>> ordering in our timestamps
>>>>>>>>>
>>>>>>>>> Snapshots
>>>>>>>>> A -> B -> C
>>>>>>>>> Sequence numbers
>>>>>>>>> 50 -> 51 ->  52
>>>>>>>>> Timestamp
>>>>>>>>> 3 -> 1 -> 2
>>>>>>>>>
>>>>>>>>> This makes time travel always a little wrong to start with.
>>>>>>>>>
>>>>>>>>> The Java implementation only allows one minute of negative time on
>>>>>>>>> commit so we actually kind of do have this as a
>>>>>>>>> "light monotonicity" requirement but as noted above there is no
>>>>>>>>> spec requirement for this.  While we do have sequence
>>>>>>>>> number and row id, we still don't have a stable way of associating
>>>>>>>>> these with a consistent time in an engine independent way.
>>>>>>>>>
>>>>>>>>> Ideally we just want to have one consistent way of answering the
>>>>>>>>> question "what did the table look like at time t"
>>>>>>>>> which I think we get by adding in a new field that is a timestamp,
>>>>>>>>> set by the Catalog close to commit time,
>>>>>>>>> that always goes up.
>>>>>>>>>
>>>>>>>>> I'm not sure we can really do this with an engine expression since
>>>>>>>>> they won't know when the data is actually committed
>>>>>>>>> when writing files?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jan 13, 2026 at 3:35 PM Anton Okolnychyi <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> This seems like a lot of new complexity in the format. I would
>>>>>>>>>> like us to explore whether we can build the considered use cases on 
>>>>>>>>>> top of
>>>>>>>>>> expression-based defaults instead.
>>>>>>>>>>
>>>>>>>>>> We already plan to support CURRENT_TIMESTAMP() and similar
>>>>>>>>>> functions that are part of the SQL standard definition for default 
>>>>>>>>>> values.
>>>>>>>>>> This would provide us a way to know the relative row order. True, 
>>>>>>>>>> this
>>>>>>>>>> usually will represent the start of the operation. We may define
>>>>>>>>>> COMMIT_TIMESTAMP() or a similar expression for the actual commit 
>>>>>>>>>> time, if
>>>>>>>>>> there are use cases that need that. Plus, we may explore an approach
>>>>>>>>>> similar to MySQL that allows users to reset the default value on 
>>>>>>>>>> update.
>>>>>>>>>>
>>>>>>>>>> - Anton
>>>>>>>>>>
>>>>>>>>>> вт, 13 січ. 2026 р. о 11:04 Russell Spitzer <
>>>>>>>>>> [email protected]> пише:
>>>>>>>>>>
>>>>>>>>>>> I think this is the right step forward. Our current "timestamp"
>>>>>>>>>>> definition is too ambiguous to be useful so establishing
>>>>>>>>>>> a well defined and monotonic timestamp could be really great. I
>>>>>>>>>>> also like the ability for row's to know this value without
>>>>>>>>>>> having to rely on snapshot information which can be expired.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jan 12, 2026 at 11:03 AM Steven Wu <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> I have revised the row timestamp proposal with the following
>>>>>>>>>>>> changes.
>>>>>>>>>>>> * a new commit_timestamp field in snapshot metadata that has
>>>>>>>>>>>> nano-second precision.
>>>>>>>>>>>> * this optional field is only set by the REST catalog server
>>>>>>>>>>>> * it needs to be monotonic (e.g. implemented using Lamport
>>>>>>>>>>>> timestamp)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Steven
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the clarification, Ryan.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For long-running streaming jobs that commit periodically, it
>>>>>>>>>>>>> is difficult to establish the constant value of current_timestamp 
>>>>>>>>>>>>> across
>>>>>>>>>>>>> all writer tasks for each commit cycle. I guess streaming writers 
>>>>>>>>>>>>> may just
>>>>>>>>>>>>> need to write the wall clock time when appending a row to a data 
>>>>>>>>>>>>> file for
>>>>>>>>>>>>> the default value of current_timestamp.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think that every row would have a different value.
>>>>>>>>>>>>>> That would be up to the engine, but I would expect engines to 
>>>>>>>>>>>>>> insert
>>>>>>>>>>>>>> `CURRENT_TIMESTAMP` into the plan and then replace it with a 
>>>>>>>>>>>>>> constant,
>>>>>>>>>>>>>> resulting in a consistent value for all rows.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You're right that this would not necessarily be the commit
>>>>>>>>>>>>>> time. But neither is the commit timestamp from Iceberg's 
>>>>>>>>>>>>>> snapshot. I'm not
>>>>>>>>>>>>>> sure how we are going to define "good enough" for this purpose. 
>>>>>>>>>>>>>> I think at
>>>>>>>>>>>>>> least `CURRENT_TIMESTAMP` has reliable and known behavior when 
>>>>>>>>>>>>>> you look at
>>>>>>>>>>>>>> how it is handled in engines. And if you want the Iceberg 
>>>>>>>>>>>>>> timestamp, then
>>>>>>>>>>>>>> use a periodic query of the snapshot stable to keep track of 
>>>>>>>>>>>>>> them in a
>>>>>>>>>>>>>> table you can join to. I don't think this rises to the need for 
>>>>>>>>>>>>>> a table
>>>>>>>>>>>>>> feature unless we can guarantee that it is correct.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> > Postgres `current_timestamp` captures the
>>>>>>>>>>>>>>> transaction start time [1, 2]. Should we extend the same 
>>>>>>>>>>>>>>> semantic to
>>>>>>>>>>>>>>> Iceberg: all rows added in the same snapshot should have the 
>>>>>>>>>>>>>>> same timestamp
>>>>>>>>>>>>>>> value?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Let me clarify my last comment.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT
>>>>>>>>>>>>>>> CURRENT_TIMESTAMP)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Since Postgres current_timestamp captures the transaction
>>>>>>>>>>>>>>> start time, all rows added in the same insert transaction would 
>>>>>>>>>>>>>>> have the
>>>>>>>>>>>>>>> same value as the transaction timestamp with the column 
>>>>>>>>>>>>>>> definition above.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If we extend a similar semantic to Iceberg, all rows added
>>>>>>>>>>>>>>> in the same Iceberg transaction/snapshot should have the same 
>>>>>>>>>>>>>>> timestamp?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ryan, I understand your comment for using current_timestamp
>>>>>>>>>>>>>>> expression as column default value, you were thinking that the 
>>>>>>>>>>>>>>> engine would
>>>>>>>>>>>>>>> set the column value to the wall clock time when appending a 
>>>>>>>>>>>>>>> row to a data
>>>>>>>>>>>>>>> file, right? every row would almost have a different timestamp 
>>>>>>>>>>>>>>> value.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> `current_timestamp` expression may not always carry the
>>>>>>>>>>>>>>>> right semantic for the use cases. E.g., latency tracking is 
>>>>>>>>>>>>>>>> interested in
>>>>>>>>>>>>>>>> when records are added / committed to the table, not when the 
>>>>>>>>>>>>>>>> record was
>>>>>>>>>>>>>>>> appended to an uncommitted data file in the processing engine.
>>>>>>>>>>>>>>>> Record creation and Iceberg commit can be minutes or even 
>>>>>>>>>>>>>>>> hours apart.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Row timestamp inherited from snapshot timestamp has no
>>>>>>>>>>>>>>>> overhead with the initial commit and has very minimal storage 
>>>>>>>>>>>>>>>> overhead
>>>>>>>>>>>>>>>> during file rewrite. Per-row current_timestamp would have 
>>>>>>>>>>>>>>>> distinct values
>>>>>>>>>>>>>>>> for every row and has more storage overhead.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> OLTP databases deal with small row-level transactions.
>>>>>>>>>>>>>>>> Postgres `current_timestamp` captures the transaction start 
>>>>>>>>>>>>>>>> time [1, 2].
>>>>>>>>>>>>>>>> Should we extend the same semantic to Iceberg: all rows added 
>>>>>>>>>>>>>>>> in the same
>>>>>>>>>>>>>>>> snapshot should have the same timestamp value?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>> https://www.postgresql.org/docs/current/functions-datetime.html
>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this
>>>>>>>>>>>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>> To support the created_by timestamp, we would need to
>>>>>>>>>>>>>>>>>> implement the following row lineage behavior
>>>>>>>>>>>>>>>>>> * Initially, it inherits from the snapshot timestamp
>>>>>>>>>>>>>>>>>> * during rewrite (like compaction), it should be
>>>>>>>>>>>>>>>>>> persisted into data files.
>>>>>>>>>>>>>>>>>> * during update, it needs to be carried over from the
>>>>>>>>>>>>>>>>>> previous row. This is similar to the row_id carry over for 
>>>>>>>>>>>>>>>>>> row updates.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Sorry for the short hand.  These are not the same:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1.  Insertion time - time the row was inserted.
>>>>>>>>>>>>>>>>> 2.  Create by - The system that created the record.
>>>>>>>>>>>>>>>>> 3.  Updated by - The system that last updated the record.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Depending on the exact use-case these might or might not
>>>>>>>>>>>>>>>>> have utility.  I'm just wondering if there will be more 
>>>>>>>>>>>>>>>>> example like this
>>>>>>>>>>>>>>>>> in the future.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> created_by column would incur likely significantly higher
>>>>>>>>>>>>>>>>>> storage overhead compared to the updated_by column. As rows 
>>>>>>>>>>>>>>>>>> are updated
>>>>>>>>>>>>>>>>>> overtime, the cardinality for this column in data files can 
>>>>>>>>>>>>>>>>>> be high. Hence,
>>>>>>>>>>>>>>>>>> the created_by column may not compress well. This is a 
>>>>>>>>>>>>>>>>>> similar problem for
>>>>>>>>>>>>>>>>>> the row_id column. One side effect of enabling row lineage 
>>>>>>>>>>>>>>>>>> by default for
>>>>>>>>>>>>>>>>>> V3 tables is the storage overhead of row_id column after 
>>>>>>>>>>>>>>>>>> compaction
>>>>>>>>>>>>>>>>>> especially for narrow tables with few columns.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I agree.  I think this analysis also shows that some
>>>>>>>>>>>>>>>>> consumers of Iceberg might not necessarily want to have all 
>>>>>>>>>>>>>>>>> these columns,
>>>>>>>>>>>>>>>>> so we might want to make them configurable, rather than 
>>>>>>>>>>>>>>>>> mandating them for
>>>>>>>>>>>>>>>>> all tables. Ryan's thought on default values seems like it 
>>>>>>>>>>>>>>>>> would solve the
>>>>>>>>>>>>>>>>> issues I was raising.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Micah
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> > An explicit timestamp column adds more burden to
>>>>>>>>>>>>>>>>>> application developers. While some databases require an 
>>>>>>>>>>>>>>>>>> explicit column in
>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set the 
>>>>>>>>>>>>>>>>>> column value.
>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to the 
>>>>>>>>>>>>>>>>>> trigger timestamp.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Since the use cases don't require an exact timestamp,
>>>>>>>>>>>>>>>>>> this seems like the best solution to get what people want 
>>>>>>>>>>>>>>>>>> (an insertion
>>>>>>>>>>>>>>>>>> timestamp) that has clear and well-defined behavior. Since
>>>>>>>>>>>>>>>>>> `current_timestamp` is defined by the SQL spec, it makes 
>>>>>>>>>>>>>>>>>> sense to me that
>>>>>>>>>>>>>>>>>> we could use it and have reasonable behavior.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I've talked with Anton about this before and maybe he'll
>>>>>>>>>>>>>>>>>> jump in on this thread. I think that we may need to extend 
>>>>>>>>>>>>>>>>>> default values
>>>>>>>>>>>>>>>>>> to include default value expressions, like 
>>>>>>>>>>>>>>>>>> `current_timestamp` that is
>>>>>>>>>>>>>>>>>> allowed by the SQL spec. That would solve the problem as 
>>>>>>>>>>>>>>>>>> well as some
>>>>>>>>>>>>>>>>>> others (like `current_date` or `current_user`) and would not 
>>>>>>>>>>>>>>>>>> create a
>>>>>>>>>>>>>>>>>> potentially misleading (and heavyweight) timestamp feature 
>>>>>>>>>>>>>>>>>> in the format.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> > Also some environments may have stronger clock service,
>>>>>>>>>>>>>>>>>> like Spanner TrueTime service.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Even in cases like this, commit retries can reorder
>>>>>>>>>>>>>>>>>> commits and make timestamps out of order. I don't think that 
>>>>>>>>>>>>>>>>>> we should be
>>>>>>>>>>>>>>>>>> making guarantees or even exposing metadata that people 
>>>>>>>>>>>>>>>>>> might mistake as
>>>>>>>>>>>>>>>>>> having those guarantees.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Ryan, thanks a lot for the feedback!
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Regarding the concern for reliable timestamps, we are
>>>>>>>>>>>>>>>>>>> not proposing using timestamps for ordering. With NTP in 
>>>>>>>>>>>>>>>>>>> modern computers,
>>>>>>>>>>>>>>>>>>> they are generally reliable enough for the intended use 
>>>>>>>>>>>>>>>>>>> cases. Also some
>>>>>>>>>>>>>>>>>>> environments may have stronger clock service, like Spanner
>>>>>>>>>>>>>>>>>>> TrueTime service
>>>>>>>>>>>>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency>
>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> >  joining to timestamps from the snapshots metadata
>>>>>>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> As you also mentioned, it depends on the snapshot
>>>>>>>>>>>>>>>>>>> history, which is often retained for a few days due to 
>>>>>>>>>>>>>>>>>>> performance reasons.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> > embedding a timestamp in DML (like
>>>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit one 
>>>>>>>>>>>>>>>>>>> from table
>>>>>>>>>>>>>>>>>>> metadata.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> An explicit timestamp column adds more burden to
>>>>>>>>>>>>>>>>>>> application developers. While some databases require an 
>>>>>>>>>>>>>>>>>>> explicit column in
>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set 
>>>>>>>>>>>>>>>>>>> the column value.
>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to the 
>>>>>>>>>>>>>>>>>>> trigger timestamp.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Also, the timestamp set during computation (like
>>>>>>>>>>>>>>>>>>> streaming ingestion or relative long batch computation) 
>>>>>>>>>>>>>>>>>>> doesn't capture the
>>>>>>>>>>>>>>>>>>> time the rows/files are added to the Iceberg table in a 
>>>>>>>>>>>>>>>>>>> batch fashion.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> > And for those use cases, you could also keep a longer
>>>>>>>>>>>>>>>>>>> history of snapshot timestamps, like storing a catalog's 
>>>>>>>>>>>>>>>>>>> event log for
>>>>>>>>>>>>>>>>>>> long-term access to timestamp info
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> this is not really consumable by joining the regular
>>>>>>>>>>>>>>>>>>> table query with catalog event log. I would also imagine 
>>>>>>>>>>>>>>>>>>> catalog event log
>>>>>>>>>>>>>>>>>>> is capped at shorter retention (maybe a few months) 
>>>>>>>>>>>>>>>>>>> compared to data
>>>>>>>>>>>>>>>>>>> retention (could be a few years).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I don't think it is a good idea to expose timestamps at
>>>>>>>>>>>>>>>>>>>> the row level. Timestamps in metadata that would be 
>>>>>>>>>>>>>>>>>>>> carried down to the row
>>>>>>>>>>>>>>>>>>>> level already confuse people that expect them to be useful 
>>>>>>>>>>>>>>>>>>>> or reliable,
>>>>>>>>>>>>>>>>>>>> rather than for debugging. I think extending this to the 
>>>>>>>>>>>>>>>>>>>> row level would
>>>>>>>>>>>>>>>>>>>> only make the problem worse.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> You can already get this information by projecting the
>>>>>>>>>>>>>>>>>>>> last updated sequence number, which is reliable, and 
>>>>>>>>>>>>>>>>>>>> joining to timestamps
>>>>>>>>>>>>>>>>>>>> from the snapshots metadata table. Of course, the drawback 
>>>>>>>>>>>>>>>>>>>> there is losing
>>>>>>>>>>>>>>>>>>>> the timestamp information when snapshots expire, but since 
>>>>>>>>>>>>>>>>>>>> it isn't
>>>>>>>>>>>>>>>>>>>> reliable anyway I'd be fine with that.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Some of the use cases, like auditing and compliance,
>>>>>>>>>>>>>>>>>>>> are probably better served by embedding a timestamp in DML 
>>>>>>>>>>>>>>>>>>>> (like
>>>>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit 
>>>>>>>>>>>>>>>>>>>> one from table
>>>>>>>>>>>>>>>>>>>> metadata. And for those use cases, you could also keep a 
>>>>>>>>>>>>>>>>>>>> longer history of
>>>>>>>>>>>>>>>>>>>> snapshot timestamps, like storing a catalog's event log 
>>>>>>>>>>>>>>>>>>>> for long-term
>>>>>>>>>>>>>>>>>>>> access to timestamp info. I think that would be better 
>>>>>>>>>>>>>>>>>>>> than storing it at
>>>>>>>>>>>>>>>>>>>> the row level.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> For V4 spec, I have a small proposal [1] to expose the
>>>>>>>>>>>>>>>>>>>>> row timestamp concept that can help with many use cases 
>>>>>>>>>>>>>>>>>>>>> like temporal
>>>>>>>>>>>>>>>>>>>>> queries, latency tracking, TTL, auditing and compliance.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> This *_last_updated_timestamp_ms * metadata column
>>>>>>>>>>>>>>>>>>>>> behaves very similarly to the
>>>>>>>>>>>>>>>>>>>>> *_last_updated_sequence_number* for row lineage.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>    - Initially, it inherits from the snapshot
>>>>>>>>>>>>>>>>>>>>>    timestamp.
>>>>>>>>>>>>>>>>>>>>>    - During rewrite (like compaction), its values are
>>>>>>>>>>>>>>>>>>>>>    persisted in the data files.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Would love to hear what you think.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] row timestamp proposal

Reply via email to