Re: [DISCUSS] row timestamp proposal

Russell Spitzer Fri, 16 Jan 2026 12:08:01 -0800

I've gone back and forth on the inherited columns. I think the thing which
keeps coming back to me is that I don't
like that the only way to determine the timestamp associated with a row
update/creation is to do a join back
against table metadata. While that's doable, It feels user unfriendly.




On Fri, Jan 16, 2026 at 11:54 AM Steven Wu <[email protected]> wrote:

> Anton, you are right that the row-level deletes will be a problem for some
> of the mentioned use cases (like incremental processing). I have
> clarified the applicability of some use cases to "tables with inserts and
> updates only".
>
> Right now, we are only tracking modification/commit time (not insertion
> time) in case of updates.
>
> On Thu, Jan 15, 2026 at 6:33 PM Anton Okolnychyi <[email protected]>
> wrote:
>
>> I think there is clear consensus that making snapshot timestamps strictly
>> increasing is a positive thing. I am also +1.
>>
>> - How will row timestamps allow us to reliably implement incremental
>> consumption independent of the snapshot retention given that rows can be
>> added AND removed in a particular time frame? How can we capture all
>> changes by just looking at the latest snapshot?
>> - Some use cases in the doc need the insertion time and some need the
>> last modification time. Do we plan to support both?
>> - What do we expect the behavior to be in UPDATE and MERGE operations?
>>
>> To be clear: I am not opposed to this change, just want to make sure I
>> understand all use cases that we aim to address and what would be required
>> in engines.
>>
>> чт, 15 січ. 2026 р. о 17:01 Maninder Parmar <[email protected]>
>> пише:
>>
>>> +1 for improving how the commit timestamps are assigned monotonically
>>> since this requirement has emerged over multiple discussions like
>>> notifications, multi-table transactions, time travel accuracy and row
>>> timestamps. It would be good to have a single consistent way to represent
>>> and assign timestamps that could be leveraged across multiple features.
>>>
>>> On Thu, Jan 15, 2026 at 4:05 PM Ryan Blue <[email protected]> wrote:
>>>
>>>> Yeah, to add my perspective on that discussion, I think my primary
>>>> concern is that people expect timestamps to be monotonic and if they aren't
>>>> then a `_last_update_timestamp` field just makes the problem worse. But it
>>>> is _nice_ to have row-level timestamps. So I would be okay if we revisit
>>>> how we assign commit timestamps and improve it so that you get monotonic
>>>> behavior.
>>>>
>>>> On Thu, Jan 15, 2026 at 2:23 PM Steven Wu <[email protected]> wrote:
>>>>
>>>>> We had an offline discussion with Ryan. I revised the proposal as
>>>>> follows.
>>>>>
>>>>> 1. V4 would require writers to generate *monotonic* snapshot
>>>>> timestamps. The proposal doc has a section that describes a recommended
>>>>> implementation using lamport timestamps.
>>>>> 2. Expose *last_update_timestamp* metadata column that inherits from
>>>>> snapshot timestamp
>>>>>
>>>>> This is a relatively low-friction change that can fix the time travel
>>>>> problem and enable use cases like latency tracking, temporal query, TTL,
>>>>> auditing.
>>>>>
>>>>> There is no accuracy requirement on the timestamp values. In practice,
>>>>> modern servers with NTP have pretty reliable wall clocks. E.g., Java
>>>>> library implemented this validation
>>>>> <https://github.com/apache/iceberg/blob/035e0fb39d2a949f6343552ade0a7d6c2967e0db/core/src/main/java/org/apache/iceberg/TableMetadata.java#L369-L377>
>>>>>  that
>>>>> protects against backward clock drift up to one minute for snapshot
>>>>> timestamps. Don't think we have heard many complaints of commit failure 
>>>>> due
>>>>> to that clock drift validation.
>>>>>
>>>>> Would appreciate feedback on the revised proposal.
>>>>>
>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0
>>>>>
>>>>> Thanks,
>>>>> Steven
>>>>>
>>>>> On Tue, Jan 13, 2026 at 8:40 PM Anton Okolnychyi <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Steven, I was referring to the fact that CURRENT_TIMESTAMP() is
>>>>>> usually evaluated quite early in engines so we could theoretically have
>>>>>> another expression closer to the commit time. You are right, though, it
>>>>>> won't be the actual commit time given that we have to write it into the
>>>>>> files. Also, I don't think generating a timestamp for a row as it is 
>>>>>> being
>>>>>> written is going to be beneficial. To sum up, expression-based defaults
>>>>>> would allow us to capture the time the transaction or write starts, but 
>>>>>> not
>>>>>> the actual commit time.
>>>>>>
>>>>>> Russell, if the goal is to know what happened to the table in a given
>>>>>> time frame, isn't the changelog scan the way to go? It would assign 
>>>>>> commit
>>>>>> ordinals based on lineage and include row-level diffs. How would you be
>>>>>> able to determine changes with row timestamps by just looking at the 
>>>>>> latest
>>>>>> snapshot?
>>>>>>
>>>>>> It does seem promising to make snapshot timestamps strictly
>>>>>> increasing to avoid ambiguity during time travel.
>>>>>>
>>>>>> вт, 13 січ. 2026 р. о 16:33 Ryan Blue <[email protected]> пише:
>>>>>>
>>>>>>> > Whether or not "t" is an atomic clock time is not as important as
>>>>>>> the query between time bounds making sense.
>>>>>>>
>>>>>>> I'm not sure I get it then. If we want monotonically increasing
>>>>>>> times, but they don't have to be real times then how do you know what
>>>>>>> notion of "time" you care about for these filters? Or to put it another
>>>>>>> way, how do you know that your "before" and "after" times are 
>>>>>>> reasonable?
>>>>>>> If the boundaries of these time queries can move around a bit, by how 
>>>>>>> much?
>>>>>>>
>>>>>>> It seems to me that row IDs can play an important role here because
>>>>>>> you have the order guarantee that we seem to want for this use case: if
>>>>>>> snapshot A was committed before snapshot B, then the rows from A have 
>>>>>>> row
>>>>>>> IDs that are always less than the rows IDs of B. The problem is that we
>>>>>>> don't know where those row IDs start and end once A and B are no longer
>>>>>>> tracked. Using a "timestamp" seems to work, but I still worry that 
>>>>>>> without
>>>>>>> reliable timestamps that correspond with some guarantee to real 
>>>>>>> timestamps,
>>>>>>> we are creating a feature that seems reliable but isn't.
>>>>>>>
>>>>>>> I'm somewhat open to the idea of introducing a snapshot timestamp
>>>>>>> that the catalog guarantees is monotonically increasing. But if we did
>>>>>>> that, wouldn't we still need to know the association between these
>>>>>>> timestamps and snapshots after the snapshot metadata expires? My mental
>>>>>>> model is that this would be used to look for data that arrived, say, 3
>>>>>>> weeks ago on Dec 24th. Since the snapshots metadata is no longer around 
>>>>>>> we
>>>>>>> could use the row timestamp to find those rows. But how do we know that 
>>>>>>> the
>>>>>>> snapshot timestamps correspond to the actual timestamp range of Dec 
>>>>>>> 24th?
>>>>>>> Is it just "close enough" as long as we don't have out of order 
>>>>>>> timestamps?
>>>>>>> This is what I mean by needing to keep track of the association between
>>>>>>> timestamps and snapshots after the metadata expires. Seems like you 
>>>>>>> either
>>>>>>> need to keep track of what the catalog's clock was for events you care
>>>>>>> about, or you don't really care about exact timestamps.
>>>>>>>
>>>>>>> On Tue, Jan 13, 2026 at 2:22 PM Russell Spitzer <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> The key goal here is the ability to answer the question "what
>>>>>>>> happened to the table in some time window. (before < t < after)?"
>>>>>>>> Whether or not "t" is an atomic clock time is not as important as
>>>>>>>> the query between time bounds making sense.
>>>>>>>> Downstream applications (from what I know) are mostly sensitive to
>>>>>>>> getting discrete and well defined answers to
>>>>>>>> this question like:
>>>>>>>>
>>>>>>>> 1 < t < 2 should be exclusive of
>>>>>>>> 2 < t < 3 should be exclusive of
>>>>>>>> 3 < t < 4
>>>>>>>>
>>>>>>>> And the union of these should be the same as the query asking for 1
>>>>>>>> < t < 4
>>>>>>>>
>>>>>>>> Currently this is not possible because we have no guarantee of
>>>>>>>> ordering in our timestamps
>>>>>>>>
>>>>>>>> Snapshots
>>>>>>>> A -> B -> C
>>>>>>>> Sequence numbers
>>>>>>>> 50 -> 51 ->  52
>>>>>>>> Timestamp
>>>>>>>> 3 -> 1 -> 2
>>>>>>>>
>>>>>>>> This makes time travel always a little wrong to start with.
>>>>>>>>
>>>>>>>> The Java implementation only allows one minute of negative time on
>>>>>>>> commit so we actually kind of do have this as a
>>>>>>>> "light monotonicity" requirement but as noted above there is no
>>>>>>>> spec requirement for this.  While we do have sequence
>>>>>>>> number and row id, we still don't have a stable way of associating
>>>>>>>> these with a consistent time in an engine independent way.
>>>>>>>>
>>>>>>>> Ideally we just want to have one consistent way of answering the
>>>>>>>> question "what did the table look like at time t"
>>>>>>>> which I think we get by adding in a new field that is a timestamp,
>>>>>>>> set by the Catalog close to commit time,
>>>>>>>> that always goes up.
>>>>>>>>
>>>>>>>> I'm not sure we can really do this with an engine expression since
>>>>>>>> they won't know when the data is actually committed
>>>>>>>> when writing files?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jan 13, 2026 at 3:35 PM Anton Okolnychyi <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> This seems like a lot of new complexity in the format. I would
>>>>>>>>> like us to explore whether we can build the considered use cases on 
>>>>>>>>> top of
>>>>>>>>> expression-based defaults instead.
>>>>>>>>>
>>>>>>>>> We already plan to support CURRENT_TIMESTAMP() and similar
>>>>>>>>> functions that are part of the SQL standard definition for default 
>>>>>>>>> values.
>>>>>>>>> This would provide us a way to know the relative row order. True, this
>>>>>>>>> usually will represent the start of the operation. We may define
>>>>>>>>> COMMIT_TIMESTAMP() or a similar expression for the actual commit 
>>>>>>>>> time, if
>>>>>>>>> there are use cases that need that. Plus, we may explore an approach
>>>>>>>>> similar to MySQL that allows users to reset the default value on 
>>>>>>>>> update.
>>>>>>>>>
>>>>>>>>> - Anton
>>>>>>>>>
>>>>>>>>> вт, 13 січ. 2026 р. о 11:04 Russell Spitzer <
>>>>>>>>> [email protected]> пише:
>>>>>>>>>
>>>>>>>>>> I think this is the right step forward. Our current "timestamp"
>>>>>>>>>> definition is too ambiguous to be useful so establishing
>>>>>>>>>> a well defined and monotonic timestamp could be really great. I
>>>>>>>>>> also like the ability for row's to know this value without
>>>>>>>>>> having to rely on snapshot information which can be expired.
>>>>>>>>>>
>>>>>>>>>> On Mon, Jan 12, 2026 at 11:03 AM Steven Wu <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> I have revised the row timestamp proposal with the following
>>>>>>>>>>> changes.
>>>>>>>>>>> * a new commit_timestamp field in snapshot metadata that has
>>>>>>>>>>> nano-second precision.
>>>>>>>>>>> * this optional field is only set by the REST catalog server
>>>>>>>>>>> * it needs to be monotonic (e.g. implemented using Lamport
>>>>>>>>>>> timestamp)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Steven
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the clarification, Ryan.
>>>>>>>>>>>>
>>>>>>>>>>>> For long-running streaming jobs that commit periodically, it is
>>>>>>>>>>>> difficult to establish the constant value of current_timestamp 
>>>>>>>>>>>> across all
>>>>>>>>>>>> writer tasks for each commit cycle. I guess streaming writers may 
>>>>>>>>>>>> just need
>>>>>>>>>>>> to write the wall clock time when appending a row to a data file 
>>>>>>>>>>>> for the
>>>>>>>>>>>> default value of current_timestamp.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I don't think that every row would have a different value.
>>>>>>>>>>>>> That would be up to the engine, but I would expect engines to 
>>>>>>>>>>>>> insert
>>>>>>>>>>>>> `CURRENT_TIMESTAMP` into the plan and then replace it with a 
>>>>>>>>>>>>> constant,
>>>>>>>>>>>>> resulting in a consistent value for all rows.
>>>>>>>>>>>>>
>>>>>>>>>>>>> You're right that this would not necessarily be the commit
>>>>>>>>>>>>> time. But neither is the commit timestamp from Iceberg's 
>>>>>>>>>>>>> snapshot. I'm not
>>>>>>>>>>>>> sure how we are going to define "good enough" for this purpose. I 
>>>>>>>>>>>>> think at
>>>>>>>>>>>>> least `CURRENT_TIMESTAMP` has reliable and known behavior when 
>>>>>>>>>>>>> you look at
>>>>>>>>>>>>> how it is handled in engines. And if you want the Iceberg 
>>>>>>>>>>>>> timestamp, then
>>>>>>>>>>>>> use a periodic query of the snapshot stable to keep track of them 
>>>>>>>>>>>>> in a
>>>>>>>>>>>>> table you can join to. I don't think this rises to the need for a 
>>>>>>>>>>>>> table
>>>>>>>>>>>>> feature unless we can guarantee that it is correct.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> > Postgres `current_timestamp` captures the transaction start
>>>>>>>>>>>>>> time [1, 2]. Should we extend the same semantic to Iceberg: all 
>>>>>>>>>>>>>> rows added
>>>>>>>>>>>>>> in the same snapshot should have the same timestamp value?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Let me clarify my last comment.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Since Postgres current_timestamp captures the transaction
>>>>>>>>>>>>>> start time, all rows added in the same insert transaction would 
>>>>>>>>>>>>>> have the
>>>>>>>>>>>>>> same value as the transaction timestamp with the column 
>>>>>>>>>>>>>> definition above.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If we extend a similar semantic to Iceberg, all rows added in
>>>>>>>>>>>>>> the same Iceberg transaction/snapshot should have the same 
>>>>>>>>>>>>>> timestamp?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ryan, I understand your comment for using current_timestamp
>>>>>>>>>>>>>> expression as column default value, you were thinking that the 
>>>>>>>>>>>>>> engine would
>>>>>>>>>>>>>> set the column value to the wall clock time when appending a row 
>>>>>>>>>>>>>> to a data
>>>>>>>>>>>>>> file, right? every row would almost have a different timestamp 
>>>>>>>>>>>>>> value.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> `current_timestamp` expression may not always carry the
>>>>>>>>>>>>>>> right semantic for the use cases. E.g., latency tracking is 
>>>>>>>>>>>>>>> interested in
>>>>>>>>>>>>>>> when records are added / committed to the table, not when the 
>>>>>>>>>>>>>>> record was
>>>>>>>>>>>>>>> appended to an uncommitted data file in the processing engine.
>>>>>>>>>>>>>>> Record creation and Iceberg commit can be minutes or even hours 
>>>>>>>>>>>>>>> apart.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Row timestamp inherited from snapshot timestamp has no
>>>>>>>>>>>>>>> overhead with the initial commit and has very minimal storage 
>>>>>>>>>>>>>>> overhead
>>>>>>>>>>>>>>> during file rewrite. Per-row current_timestamp would have 
>>>>>>>>>>>>>>> distinct values
>>>>>>>>>>>>>>> for every row and has more storage overhead.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> OLTP databases deal with small row-level transactions.
>>>>>>>>>>>>>>> Postgres `current_timestamp` captures the transaction start 
>>>>>>>>>>>>>>> time [1, 2].
>>>>>>>>>>>>>>> Should we extend the same semantic to Iceberg: all rows added 
>>>>>>>>>>>>>>> in the same
>>>>>>>>>>>>>>> snapshot should have the same timestamp value?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> https://www.postgresql.org/docs/current/functions-datetime.html
>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this proposal.
>>>>>>>>>>>>>>>>> To support the created_by timestamp, we would need to
>>>>>>>>>>>>>>>>> implement the following row lineage behavior
>>>>>>>>>>>>>>>>> * Initially, it inherits from the snapshot timestamp
>>>>>>>>>>>>>>>>> * during rewrite (like compaction), it should be persisted
>>>>>>>>>>>>>>>>> into data files.
>>>>>>>>>>>>>>>>> * during update, it needs to be carried over from the
>>>>>>>>>>>>>>>>> previous row. This is similar to the row_id carry over for 
>>>>>>>>>>>>>>>>> row updates.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sorry for the short hand.  These are not the same:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1.  Insertion time - time the row was inserted.
>>>>>>>>>>>>>>>> 2.  Create by - The system that created the record.
>>>>>>>>>>>>>>>> 3.  Updated by - The system that last updated the record.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Depending on the exact use-case these might or might not
>>>>>>>>>>>>>>>> have utility.  I'm just wondering if there will be more 
>>>>>>>>>>>>>>>> example like this
>>>>>>>>>>>>>>>> in the future.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> created_by column would incur likely significantly higher
>>>>>>>>>>>>>>>>> storage overhead compared to the updated_by column. As rows 
>>>>>>>>>>>>>>>>> are updated
>>>>>>>>>>>>>>>>> overtime, the cardinality for this column in data files can 
>>>>>>>>>>>>>>>>> be high. Hence,
>>>>>>>>>>>>>>>>> the created_by column may not compress well. This is a 
>>>>>>>>>>>>>>>>> similar problem for
>>>>>>>>>>>>>>>>> the row_id column. One side effect of enabling row lineage by 
>>>>>>>>>>>>>>>>> default for
>>>>>>>>>>>>>>>>> V3 tables is the storage overhead of row_id column after 
>>>>>>>>>>>>>>>>> compaction
>>>>>>>>>>>>>>>>> especially for narrow tables with few columns.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree.  I think this analysis also shows that some
>>>>>>>>>>>>>>>> consumers of Iceberg might not necessarily want to have all 
>>>>>>>>>>>>>>>> these columns,
>>>>>>>>>>>>>>>> so we might want to make them configurable, rather than 
>>>>>>>>>>>>>>>> mandating them for
>>>>>>>>>>>>>>>> all tables. Ryan's thought on default values seems like it 
>>>>>>>>>>>>>>>> would solve the
>>>>>>>>>>>>>>>> issues I was raising.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Micah
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> > An explicit timestamp column adds more burden to
>>>>>>>>>>>>>>>>> application developers. While some databases require an 
>>>>>>>>>>>>>>>>> explicit column in
>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set the 
>>>>>>>>>>>>>>>>> column value.
>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to the 
>>>>>>>>>>>>>>>>> trigger timestamp.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Since the use cases don't require an exact timestamp, this
>>>>>>>>>>>>>>>>> seems like the best solution to get what people want (an 
>>>>>>>>>>>>>>>>> insertion
>>>>>>>>>>>>>>>>> timestamp) that has clear and well-defined behavior. Since
>>>>>>>>>>>>>>>>> `current_timestamp` is defined by the SQL spec, it makes 
>>>>>>>>>>>>>>>>> sense to me that
>>>>>>>>>>>>>>>>> we could use it and have reasonable behavior.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I've talked with Anton about this before and maybe he'll
>>>>>>>>>>>>>>>>> jump in on this thread. I think that we may need to extend 
>>>>>>>>>>>>>>>>> default values
>>>>>>>>>>>>>>>>> to include default value expressions, like 
>>>>>>>>>>>>>>>>> `current_timestamp` that is
>>>>>>>>>>>>>>>>> allowed by the SQL spec. That would solve the problem as well 
>>>>>>>>>>>>>>>>> as some
>>>>>>>>>>>>>>>>> others (like `current_date` or `current_user`) and would not 
>>>>>>>>>>>>>>>>> create a
>>>>>>>>>>>>>>>>> potentially misleading (and heavyweight) timestamp feature in 
>>>>>>>>>>>>>>>>> the format.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> > Also some environments may have stronger clock service,
>>>>>>>>>>>>>>>>> like Spanner TrueTime service.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Even in cases like this, commit retries can reorder
>>>>>>>>>>>>>>>>> commits and make timestamps out of order. I don't think that 
>>>>>>>>>>>>>>>>> we should be
>>>>>>>>>>>>>>>>> making guarantees or even exposing metadata that people might 
>>>>>>>>>>>>>>>>> mistake as
>>>>>>>>>>>>>>>>> having those guarantees.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Ryan, thanks a lot for the feedback!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Regarding the concern for reliable timestamps, we are not
>>>>>>>>>>>>>>>>>> proposing using timestamps for ordering. With NTP in modern 
>>>>>>>>>>>>>>>>>> computers, they
>>>>>>>>>>>>>>>>>> are generally reliable enough for the intended use cases. 
>>>>>>>>>>>>>>>>>> Also some
>>>>>>>>>>>>>>>>>> environments may have stronger clock service, like Spanner
>>>>>>>>>>>>>>>>>> TrueTime service
>>>>>>>>>>>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency>
>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> >  joining to timestamps from the snapshots metadata
>>>>>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> As you also mentioned, it depends on the snapshot
>>>>>>>>>>>>>>>>>> history, which is often retained for a few days due to 
>>>>>>>>>>>>>>>>>> performance reasons.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> > embedding a timestamp in DML (like `current_timestamp`)
>>>>>>>>>>>>>>>>>> rather than relying on an implicit one from table metadata.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> An explicit timestamp column adds more burden to
>>>>>>>>>>>>>>>>>> application developers. While some databases require an 
>>>>>>>>>>>>>>>>>> explicit column in
>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set the 
>>>>>>>>>>>>>>>>>> column value.
>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to the 
>>>>>>>>>>>>>>>>>> trigger timestamp.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Also, the timestamp set during computation (like
>>>>>>>>>>>>>>>>>> streaming ingestion or relative long batch computation) 
>>>>>>>>>>>>>>>>>> doesn't capture the
>>>>>>>>>>>>>>>>>> time the rows/files are added to the Iceberg table in a 
>>>>>>>>>>>>>>>>>> batch fashion.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> > And for those use cases, you could also keep a longer
>>>>>>>>>>>>>>>>>> history of snapshot timestamps, like storing a catalog's 
>>>>>>>>>>>>>>>>>> event log for
>>>>>>>>>>>>>>>>>> long-term access to timestamp info
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> this is not really consumable by joining the regular
>>>>>>>>>>>>>>>>>> table query with catalog event log. I would also imagine 
>>>>>>>>>>>>>>>>>> catalog event log
>>>>>>>>>>>>>>>>>> is capped at shorter retention (maybe a few months) compared 
>>>>>>>>>>>>>>>>>> to data
>>>>>>>>>>>>>>>>>> retention (could be a few years).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I don't think it is a good idea to expose timestamps at
>>>>>>>>>>>>>>>>>>> the row level. Timestamps in metadata that would be carried 
>>>>>>>>>>>>>>>>>>> down to the row
>>>>>>>>>>>>>>>>>>> level already confuse people that expect them to be useful 
>>>>>>>>>>>>>>>>>>> or reliable,
>>>>>>>>>>>>>>>>>>> rather than for debugging. I think extending this to the 
>>>>>>>>>>>>>>>>>>> row level would
>>>>>>>>>>>>>>>>>>> only make the problem worse.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> You can already get this information by projecting the
>>>>>>>>>>>>>>>>>>> last updated sequence number, which is reliable, and 
>>>>>>>>>>>>>>>>>>> joining to timestamps
>>>>>>>>>>>>>>>>>>> from the snapshots metadata table. Of course, the drawback 
>>>>>>>>>>>>>>>>>>> there is losing
>>>>>>>>>>>>>>>>>>> the timestamp information when snapshots expire, but since 
>>>>>>>>>>>>>>>>>>> it isn't
>>>>>>>>>>>>>>>>>>> reliable anyway I'd be fine with that.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Some of the use cases, like auditing and compliance, are
>>>>>>>>>>>>>>>>>>> probably better served by embedding a timestamp in DML (like
>>>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit one 
>>>>>>>>>>>>>>>>>>> from table
>>>>>>>>>>>>>>>>>>> metadata. And for those use cases, you could also keep a 
>>>>>>>>>>>>>>>>>>> longer history of
>>>>>>>>>>>>>>>>>>> snapshot timestamps, like storing a catalog's event log for 
>>>>>>>>>>>>>>>>>>> long-term
>>>>>>>>>>>>>>>>>>> access to timestamp info. I think that would be better than 
>>>>>>>>>>>>>>>>>>> storing it at
>>>>>>>>>>>>>>>>>>> the row level.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> For V4 spec, I have a small proposal [1] to expose the
>>>>>>>>>>>>>>>>>>>> row timestamp concept that can help with many use cases 
>>>>>>>>>>>>>>>>>>>> like temporal
>>>>>>>>>>>>>>>>>>>> queries, latency tracking, TTL, auditing and compliance.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This *_last_updated_timestamp_ms * metadata column
>>>>>>>>>>>>>>>>>>>> behaves very similarly to the
>>>>>>>>>>>>>>>>>>>> *_last_updated_sequence_number* for row lineage.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>    - Initially, it inherits from the snapshot
>>>>>>>>>>>>>>>>>>>>    timestamp.
>>>>>>>>>>>>>>>>>>>>    - During rewrite (like compaction), its values are
>>>>>>>>>>>>>>>>>>>>    persisted in the data files.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Would love to hear what you think.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] row timestamp proposal

Reply via email to