Yeah, to add my perspective on that discussion, I think my primary concern is that people expect timestamps to be monotonic and if they aren't then a `_last_update_timestamp` field just makes the problem worse. But it is _nice_ to have row-level timestamps. So I would be okay if we revisit how we assign commit timestamps and improve it so that you get monotonic behavior.
On Thu, Jan 15, 2026 at 2:23 PM Steven Wu <[email protected]> wrote: > We had an offline discussion with Ryan. I revised the proposal as follows. > > 1. V4 would require writers to generate *monotonic* snapshot timestamps. > The proposal doc has a section that describes a recommended implementation > using lamport timestamps. > 2. Expose *last_update_timestamp* metadata column that inherits from > snapshot timestamp > > This is a relatively low-friction change that can fix the time travel > problem and enable use cases like latency tracking, temporal query, TTL, > auditing. > > There is no accuracy requirement on the timestamp values. In practice, > modern servers with NTP have pretty reliable wall clocks. E.g., Java > library implemented this validation > <https://github.com/apache/iceberg/blob/035e0fb39d2a949f6343552ade0a7d6c2967e0db/core/src/main/java/org/apache/iceberg/TableMetadata.java#L369-L377> > that > protects against backward clock drift up to one minute for snapshot > timestamps. Don't think we have heard many complaints of commit failure due > to that clock drift validation. > > Would appreciate feedback on the revised proposal. > > https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0 > > Thanks, > Steven > > On Tue, Jan 13, 2026 at 8:40 PM Anton Okolnychyi <[email protected]> > wrote: > >> Steven, I was referring to the fact that CURRENT_TIMESTAMP() is usually >> evaluated quite early in engines so we could theoretically have another >> expression closer to the commit time. You are right, though, it won't be >> the actual commit time given that we have to write it into the files. Also, >> I don't think generating a timestamp for a row as it is being written is >> going to be beneficial. To sum up, expression-based defaults would allow us >> to capture the time the transaction or write starts, but not the actual >> commit time. >> >> Russell, if the goal is to know what happened to the table in a given >> time frame, isn't the changelog scan the way to go? It would assign commit >> ordinals based on lineage and include row-level diffs. How would you be >> able to determine changes with row timestamps by just looking at the latest >> snapshot? >> >> It does seem promising to make snapshot timestamps strictly increasing to >> avoid ambiguity during time travel. >> >> вт, 13 січ. 2026 р. о 16:33 Ryan Blue <[email protected]> пише: >> >>> > Whether or not "t" is an atomic clock time is not as important as the >>> query between time bounds making sense. >>> >>> I'm not sure I get it then. If we want monotonically increasing times, >>> but they don't have to be real times then how do you know what notion of >>> "time" you care about for these filters? Or to put it another way, how do >>> you know that your "before" and "after" times are reasonable? If the >>> boundaries of these time queries can move around a bit, by how much? >>> >>> It seems to me that row IDs can play an important role here because you >>> have the order guarantee that we seem to want for this use case: if >>> snapshot A was committed before snapshot B, then the rows from A have row >>> IDs that are always less than the rows IDs of B. The problem is that we >>> don't know where those row IDs start and end once A and B are no longer >>> tracked. Using a "timestamp" seems to work, but I still worry that without >>> reliable timestamps that correspond with some guarantee to real timestamps, >>> we are creating a feature that seems reliable but isn't. >>> >>> I'm somewhat open to the idea of introducing a snapshot timestamp that >>> the catalog guarantees is monotonically increasing. But if we did that, >>> wouldn't we still need to know the association between these timestamps and >>> snapshots after the snapshot metadata expires? My mental model is that this >>> would be used to look for data that arrived, say, 3 weeks ago on Dec 24th. >>> Since the snapshots metadata is no longer around we could use the row >>> timestamp to find those rows. But how do we know that the snapshot >>> timestamps correspond to the actual timestamp range of Dec 24th? Is it just >>> "close enough" as long as we don't have out of order timestamps? This is >>> what I mean by needing to keep track of the association between timestamps >>> and snapshots after the metadata expires. Seems like you either need to >>> keep track of what the catalog's clock was for events you care about, or >>> you don't really care about exact timestamps. >>> >>> On Tue, Jan 13, 2026 at 2:22 PM Russell Spitzer < >>> [email protected]> wrote: >>> >>>> The key goal here is the ability to answer the question "what happened >>>> to the table in some time window. (before < t < after)?" >>>> Whether or not "t" is an atomic clock time is not as important as the >>>> query between time bounds making sense. >>>> Downstream applications (from what I know) are mostly sensitive to >>>> getting discrete and well defined answers to >>>> this question like: >>>> >>>> 1 < t < 2 should be exclusive of >>>> 2 < t < 3 should be exclusive of >>>> 3 < t < 4 >>>> >>>> And the union of these should be the same as the query asking for 1 < t >>>> < 4 >>>> >>>> Currently this is not possible because we have no guarantee of ordering >>>> in our timestamps >>>> >>>> Snapshots >>>> A -> B -> C >>>> Sequence numbers >>>> 50 -> 51 -> 52 >>>> Timestamp >>>> 3 -> 1 -> 2 >>>> >>>> This makes time travel always a little wrong to start with. >>>> >>>> The Java implementation only allows one minute of negative time on >>>> commit so we actually kind of do have this as a >>>> "light monotonicity" requirement but as noted above there is no spec >>>> requirement for this. While we do have sequence >>>> number and row id, we still don't have a stable way of associating >>>> these with a consistent time in an engine independent way. >>>> >>>> Ideally we just want to have one consistent way of answering the >>>> question "what did the table look like at time t" >>>> which I think we get by adding in a new field that is a timestamp, set >>>> by the Catalog close to commit time, >>>> that always goes up. >>>> >>>> I'm not sure we can really do this with an engine expression since they >>>> won't know when the data is actually committed >>>> when writing files? >>>> >>>> >>>> On Tue, Jan 13, 2026 at 3:35 PM Anton Okolnychyi <[email protected]> >>>> wrote: >>>> >>>>> This seems like a lot of new complexity in the format. I would like us >>>>> to explore whether we can build the considered use cases on top of >>>>> expression-based defaults instead. >>>>> >>>>> We already plan to support CURRENT_TIMESTAMP() and similar functions >>>>> that are part of the SQL standard definition for default values. This >>>>> would >>>>> provide us a way to know the relative row order. True, this usually will >>>>> represent the start of the operation. We may define COMMIT_TIMESTAMP() or >>>>> a >>>>> similar expression for the actual commit time, if there are use cases that >>>>> need that. Plus, we may explore an approach similar to MySQL that allows >>>>> users to reset the default value on update. >>>>> >>>>> - Anton >>>>> >>>>> вт, 13 січ. 2026 р. о 11:04 Russell Spitzer <[email protected]> >>>>> пише: >>>>> >>>>>> I think this is the right step forward. Our current "timestamp" >>>>>> definition is too ambiguous to be useful so establishing >>>>>> a well defined and monotonic timestamp could be really great. I also >>>>>> like the ability for row's to know this value without >>>>>> having to rely on snapshot information which can be expired. >>>>>> >>>>>> On Mon, Jan 12, 2026 at 11:03 AM Steven Wu <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I have revised the row timestamp proposal with the following >>>>>>> changes. >>>>>>> * a new commit_timestamp field in snapshot metadata that has >>>>>>> nano-second precision. >>>>>>> * this optional field is only set by the REST catalog server >>>>>>> * it needs to be monotonic (e.g. implemented using Lamport timestamp) >>>>>>> >>>>>>> >>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh >>>>>>> >>>>>>> Thanks, >>>>>>> Steven >>>>>>> >>>>>>> >>>>>>> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks for the clarification, Ryan. >>>>>>>> >>>>>>>> For long-running streaming jobs that commit periodically, it is >>>>>>>> difficult to establish the constant value of current_timestamp across >>>>>>>> all >>>>>>>> writer tasks for each commit cycle. I guess streaming writers may just >>>>>>>> need >>>>>>>> to write the wall clock time when appending a row to a data file for >>>>>>>> the >>>>>>>> default value of current_timestamp. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <[email protected]> wrote: >>>>>>>> >>>>>>>>> I don't think that every row would have a different value. That >>>>>>>>> would be up to the engine, but I would expect engines to insert >>>>>>>>> `CURRENT_TIMESTAMP` into the plan and then replace it with a constant, >>>>>>>>> resulting in a consistent value for all rows. >>>>>>>>> >>>>>>>>> You're right that this would not necessarily be the commit time. >>>>>>>>> But neither is the commit timestamp from Iceberg's snapshot. I'm not >>>>>>>>> sure >>>>>>>>> how we are going to define "good enough" for this purpose. I think at >>>>>>>>> least >>>>>>>>> `CURRENT_TIMESTAMP` has reliable and known behavior when you look at >>>>>>>>> how it >>>>>>>>> is handled in engines. And if you want the Iceberg timestamp, then >>>>>>>>> use a >>>>>>>>> periodic query of the snapshot stable to keep track of them in a >>>>>>>>> table you >>>>>>>>> can join to. I don't think this rises to the need for a table feature >>>>>>>>> unless we can guarantee that it is correct. >>>>>>>>> >>>>>>>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> > Postgres `current_timestamp` captures the transaction start >>>>>>>>>> time [1, 2]. Should we extend the same semantic to Iceberg: all rows >>>>>>>>>> added >>>>>>>>>> in the same snapshot should have the same timestamp value? >>>>>>>>>> >>>>>>>>>> Let me clarify my last comment. >>>>>>>>>> >>>>>>>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP) >>>>>>>>>> >>>>>>>>>> Since Postgres current_timestamp captures the transaction start >>>>>>>>>> time, all rows added in the same insert transaction would have the >>>>>>>>>> same >>>>>>>>>> value as the transaction timestamp with the column definition above. >>>>>>>>>> >>>>>>>>>> If we extend a similar semantic to Iceberg, all rows added in the >>>>>>>>>> same Iceberg transaction/snapshot should have the same timestamp? >>>>>>>>>> >>>>>>>>>> Ryan, I understand your comment for using current_timestamp >>>>>>>>>> expression as column default value, you were thinking that the >>>>>>>>>> engine would >>>>>>>>>> set the column value to the wall clock time when appending a row to >>>>>>>>>> a data >>>>>>>>>> file, right? every row would almost have a different timestamp value. >>>>>>>>>> >>>>>>>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> `current_timestamp` expression may not always carry the right >>>>>>>>>>> semantic for the use cases. E.g., latency tracking is interested in >>>>>>>>>>> when >>>>>>>>>>> records are added / committed to the table, not when the record was >>>>>>>>>>> appended to an uncommitted data file in the processing engine. >>>>>>>>>>> Record creation and Iceberg commit can be minutes or even hours >>>>>>>>>>> apart. >>>>>>>>>>> >>>>>>>>>>> Row timestamp inherited from snapshot timestamp has no overhead >>>>>>>>>>> with the initial commit and has very minimal storage overhead >>>>>>>>>>> during file >>>>>>>>>>> rewrite. Per-row current_timestamp would have distinct values for >>>>>>>>>>> every row >>>>>>>>>>> and has more storage overhead. >>>>>>>>>>> >>>>>>>>>>> OLTP databases deal with small row-level transactions. Postgres >>>>>>>>>>> `current_timestamp` captures the transaction start time [1, 2]. >>>>>>>>>>> Should we >>>>>>>>>>> extend the same semantic to Iceberg: all rows added in the same >>>>>>>>>>> snapshot >>>>>>>>>>> should have the same timestamp value? >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> https://www.postgresql.org/docs/current/functions-datetime.html >>>>>>>>>>> [2] >>>>>>>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this proposal. >>>>>>>>>>>>> To support the created_by timestamp, we would need to >>>>>>>>>>>>> implement the following row lineage behavior >>>>>>>>>>>>> * Initially, it inherits from the snapshot timestamp >>>>>>>>>>>>> * during rewrite (like compaction), it should be persisted >>>>>>>>>>>>> into data files. >>>>>>>>>>>>> * during update, it needs to be carried over from the previous >>>>>>>>>>>>> row. This is similar to the row_id carry over for row updates. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Sorry for the short hand. These are not the same: >>>>>>>>>>>> >>>>>>>>>>>> 1. Insertion time - time the row was inserted. >>>>>>>>>>>> 2. Create by - The system that created the record. >>>>>>>>>>>> 3. Updated by - The system that last updated the record. >>>>>>>>>>>> >>>>>>>>>>>> Depending on the exact use-case these might or might not have >>>>>>>>>>>> utility. I'm just wondering if there will be more example like >>>>>>>>>>>> this in the >>>>>>>>>>>> future. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> created_by column would incur likely significantly higher >>>>>>>>>>>>> storage overhead compared to the updated_by column. As rows are >>>>>>>>>>>>> updated >>>>>>>>>>>>> overtime, the cardinality for this column in data files can be >>>>>>>>>>>>> high. Hence, >>>>>>>>>>>>> the created_by column may not compress well. This is a similar >>>>>>>>>>>>> problem for >>>>>>>>>>>>> the row_id column. One side effect of enabling row lineage by >>>>>>>>>>>>> default for >>>>>>>>>>>>> V3 tables is the storage overhead of row_id column after >>>>>>>>>>>>> compaction >>>>>>>>>>>>> especially for narrow tables with few columns. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I agree. I think this analysis also shows that some consumers >>>>>>>>>>>> of Iceberg might not necessarily want to have all these columns, >>>>>>>>>>>> so we >>>>>>>>>>>> might want to make them configurable, rather than mandating them >>>>>>>>>>>> for all >>>>>>>>>>>> tables. Ryan's thought on default values seems like it would solve >>>>>>>>>>>> the >>>>>>>>>>>> issues I was raising. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Micah >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> > An explicit timestamp column adds more burden to application >>>>>>>>>>>>> developers. While some databases require an explicit column in >>>>>>>>>>>>> the schema, >>>>>>>>>>>>> those databases provide triggers to auto set the column value. >>>>>>>>>>>>> For Iceberg, >>>>>>>>>>>>> the snapshot timestamp is the closest to the trigger timestamp. >>>>>>>>>>>>> >>>>>>>>>>>>> Since the use cases don't require an exact timestamp, this >>>>>>>>>>>>> seems like the best solution to get what people want (an insertion >>>>>>>>>>>>> timestamp) that has clear and well-defined behavior. Since >>>>>>>>>>>>> `current_timestamp` is defined by the SQL spec, it makes sense to >>>>>>>>>>>>> me that >>>>>>>>>>>>> we could use it and have reasonable behavior. >>>>>>>>>>>>> >>>>>>>>>>>>> I've talked with Anton about this before and maybe he'll jump >>>>>>>>>>>>> in on this thread. I think that we may need to extend default >>>>>>>>>>>>> values to >>>>>>>>>>>>> include default value expressions, like `current_timestamp` that >>>>>>>>>>>>> is allowed >>>>>>>>>>>>> by the SQL spec. That would solve the problem as well as some >>>>>>>>>>>>> others (like >>>>>>>>>>>>> `current_date` or `current_user`) and would not create a >>>>>>>>>>>>> potentially >>>>>>>>>>>>> misleading (and heavyweight) timestamp feature in the format. >>>>>>>>>>>>> >>>>>>>>>>>>> > Also some environments may have stronger clock service, like >>>>>>>>>>>>> Spanner TrueTime service. >>>>>>>>>>>>> >>>>>>>>>>>>> Even in cases like this, commit retries can reorder commits >>>>>>>>>>>>> and make timestamps out of order. I don't think that we should be >>>>>>>>>>>>> making >>>>>>>>>>>>> guarantees or even exposing metadata that people might mistake as >>>>>>>>>>>>> having >>>>>>>>>>>>> those guarantees. >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Ryan, thanks a lot for the feedback! >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regarding the concern for reliable timestamps, we are not >>>>>>>>>>>>>> proposing using timestamps for ordering. With NTP in modern >>>>>>>>>>>>>> computers, they >>>>>>>>>>>>>> are generally reliable enough for the intended use cases. Also >>>>>>>>>>>>>> some >>>>>>>>>>>>>> environments may have stronger clock service, like Spanner >>>>>>>>>>>>>> TrueTime service >>>>>>>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency> >>>>>>>>>>>>>> . >>>>>>>>>>>>>> >>>>>>>>>>>>>> > joining to timestamps from the snapshots metadata table. >>>>>>>>>>>>>> >>>>>>>>>>>>>> As you also mentioned, it depends on the snapshot history, >>>>>>>>>>>>>> which is often retained for a few days due to performance >>>>>>>>>>>>>> reasons. >>>>>>>>>>>>>> >>>>>>>>>>>>>> > embedding a timestamp in DML (like `current_timestamp`) >>>>>>>>>>>>>> rather than relying on an implicit one from table metadata. >>>>>>>>>>>>>> >>>>>>>>>>>>>> An explicit timestamp column adds more burden to application >>>>>>>>>>>>>> developers. While some databases require an explicit column in >>>>>>>>>>>>>> the schema, >>>>>>>>>>>>>> those databases provide triggers to auto set the column value. >>>>>>>>>>>>>> For Iceberg, >>>>>>>>>>>>>> the snapshot timestamp is the closest to the trigger timestamp. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Also, the timestamp set during computation (like streaming >>>>>>>>>>>>>> ingestion or relative long batch computation) doesn't capture >>>>>>>>>>>>>> the time the >>>>>>>>>>>>>> rows/files are added to the Iceberg table in a batch fashion. >>>>>>>>>>>>>> >>>>>>>>>>>>>> > And for those use cases, you could also keep a longer >>>>>>>>>>>>>> history of snapshot timestamps, like storing a catalog's event >>>>>>>>>>>>>> log for >>>>>>>>>>>>>> long-term access to timestamp info >>>>>>>>>>>>>> >>>>>>>>>>>>>> this is not really consumable by joining the regular table >>>>>>>>>>>>>> query with catalog event log. I would also imagine catalog event >>>>>>>>>>>>>> log is >>>>>>>>>>>>>> capped at shorter retention (maybe a few months) compared to >>>>>>>>>>>>>> data retention >>>>>>>>>>>>>> (could be a few years). >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <[email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I don't think it is a good idea to expose timestamps at the >>>>>>>>>>>>>>> row level. Timestamps in metadata that would be carried down to >>>>>>>>>>>>>>> the row >>>>>>>>>>>>>>> level already confuse people that expect them to be useful or >>>>>>>>>>>>>>> reliable, >>>>>>>>>>>>>>> rather than for debugging. I think extending this to the row >>>>>>>>>>>>>>> level would >>>>>>>>>>>>>>> only make the problem worse. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> You can already get this information by projecting the last >>>>>>>>>>>>>>> updated sequence number, which is reliable, and joining to >>>>>>>>>>>>>>> timestamps from >>>>>>>>>>>>>>> the snapshots metadata table. Of course, the drawback there is >>>>>>>>>>>>>>> losing the >>>>>>>>>>>>>>> timestamp information when snapshots expire, but since it isn't >>>>>>>>>>>>>>> reliable >>>>>>>>>>>>>>> anyway I'd be fine with that. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Some of the use cases, like auditing and compliance, are >>>>>>>>>>>>>>> probably better served by embedding a timestamp in DML (like >>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit one >>>>>>>>>>>>>>> from table >>>>>>>>>>>>>>> metadata. And for those use cases, you could also keep a longer >>>>>>>>>>>>>>> history of >>>>>>>>>>>>>>> snapshot timestamps, like storing a catalog's event log for >>>>>>>>>>>>>>> long-term >>>>>>>>>>>>>>> access to timestamp info. I think that would be better than >>>>>>>>>>>>>>> storing it at >>>>>>>>>>>>>>> the row level. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> For V4 spec, I have a small proposal [1] to expose the row >>>>>>>>>>>>>>>> timestamp concept that can help with many use cases like >>>>>>>>>>>>>>>> temporal queries, >>>>>>>>>>>>>>>> latency tracking, TTL, auditing and compliance. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This *_last_updated_timestamp_ms * metadata column behaves >>>>>>>>>>>>>>>> very similarly to the *_last_updated_sequence_number* for >>>>>>>>>>>>>>>> row lineage. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - Initially, it inherits from the snapshot timestamp. >>>>>>>>>>>>>>>> - During rewrite (like compaction), its values are >>>>>>>>>>>>>>>> persisted in the data files. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Would love to hear what you think. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Steven >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>
