Anton, you are right that the row-level deletes will be a problem for some of the mentioned use cases (like incremental processing). I have clarified the applicability of some use cases to "tables with inserts and updates only".
Right now, we are only tracking modification/commit time (not insertion time) in case of updates. On Thu, Jan 15, 2026 at 6:33 PM Anton Okolnychyi <[email protected]> wrote: > I think there is clear consensus that making snapshot timestamps strictly > increasing is a positive thing. I am also +1. > > - How will row timestamps allow us to reliably implement incremental > consumption independent of the snapshot retention given that rows can be > added AND removed in a particular time frame? How can we capture all > changes by just looking at the latest snapshot? > - Some use cases in the doc need the insertion time and some need the last > modification time. Do we plan to support both? > - What do we expect the behavior to be in UPDATE and MERGE operations? > > To be clear: I am not opposed to this change, just want to make sure I > understand all use cases that we aim to address and what would be required > in engines. > > чт, 15 січ. 2026 р. о 17:01 Maninder Parmar <[email protected]> > пише: > >> +1 for improving how the commit timestamps are assigned monotonically >> since this requirement has emerged over multiple discussions like >> notifications, multi-table transactions, time travel accuracy and row >> timestamps. It would be good to have a single consistent way to represent >> and assign timestamps that could be leveraged across multiple features. >> >> On Thu, Jan 15, 2026 at 4:05 PM Ryan Blue <[email protected]> wrote: >> >>> Yeah, to add my perspective on that discussion, I think my primary >>> concern is that people expect timestamps to be monotonic and if they aren't >>> then a `_last_update_timestamp` field just makes the problem worse. But it >>> is _nice_ to have row-level timestamps. So I would be okay if we revisit >>> how we assign commit timestamps and improve it so that you get monotonic >>> behavior. >>> >>> On Thu, Jan 15, 2026 at 2:23 PM Steven Wu <[email protected]> wrote: >>> >>>> We had an offline discussion with Ryan. I revised the proposal as >>>> follows. >>>> >>>> 1. V4 would require writers to generate *monotonic* snapshot >>>> timestamps. The proposal doc has a section that describes a recommended >>>> implementation using lamport timestamps. >>>> 2. Expose *last_update_timestamp* metadata column that inherits from >>>> snapshot timestamp >>>> >>>> This is a relatively low-friction change that can fix the time travel >>>> problem and enable use cases like latency tracking, temporal query, TTL, >>>> auditing. >>>> >>>> There is no accuracy requirement on the timestamp values. In practice, >>>> modern servers with NTP have pretty reliable wall clocks. E.g., Java >>>> library implemented this validation >>>> <https://github.com/apache/iceberg/blob/035e0fb39d2a949f6343552ade0a7d6c2967e0db/core/src/main/java/org/apache/iceberg/TableMetadata.java#L369-L377> >>>> that >>>> protects against backward clock drift up to one minute for snapshot >>>> timestamps. Don't think we have heard many complaints of commit failure due >>>> to that clock drift validation. >>>> >>>> Would appreciate feedback on the revised proposal. >>>> >>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0 >>>> >>>> Thanks, >>>> Steven >>>> >>>> On Tue, Jan 13, 2026 at 8:40 PM Anton Okolnychyi <[email protected]> >>>> wrote: >>>> >>>>> Steven, I was referring to the fact that CURRENT_TIMESTAMP() is >>>>> usually evaluated quite early in engines so we could theoretically have >>>>> another expression closer to the commit time. You are right, though, it >>>>> won't be the actual commit time given that we have to write it into the >>>>> files. Also, I don't think generating a timestamp for a row as it is being >>>>> written is going to be beneficial. To sum up, expression-based defaults >>>>> would allow us to capture the time the transaction or write starts, but >>>>> not >>>>> the actual commit time. >>>>> >>>>> Russell, if the goal is to know what happened to the table in a given >>>>> time frame, isn't the changelog scan the way to go? It would assign commit >>>>> ordinals based on lineage and include row-level diffs. How would you be >>>>> able to determine changes with row timestamps by just looking at the >>>>> latest >>>>> snapshot? >>>>> >>>>> It does seem promising to make snapshot timestamps strictly increasing >>>>> to avoid ambiguity during time travel. >>>>> >>>>> вт, 13 січ. 2026 р. о 16:33 Ryan Blue <[email protected]> пише: >>>>> >>>>>> > Whether or not "t" is an atomic clock time is not as important as >>>>>> the query between time bounds making sense. >>>>>> >>>>>> I'm not sure I get it then. If we want monotonically increasing >>>>>> times, but they don't have to be real times then how do you know what >>>>>> notion of "time" you care about for these filters? Or to put it another >>>>>> way, how do you know that your "before" and "after" times are reasonable? >>>>>> If the boundaries of these time queries can move around a bit, by how >>>>>> much? >>>>>> >>>>>> It seems to me that row IDs can play an important role here because >>>>>> you have the order guarantee that we seem to want for this use case: if >>>>>> snapshot A was committed before snapshot B, then the rows from A have row >>>>>> IDs that are always less than the rows IDs of B. The problem is that we >>>>>> don't know where those row IDs start and end once A and B are no longer >>>>>> tracked. Using a "timestamp" seems to work, but I still worry that >>>>>> without >>>>>> reliable timestamps that correspond with some guarantee to real >>>>>> timestamps, >>>>>> we are creating a feature that seems reliable but isn't. >>>>>> >>>>>> I'm somewhat open to the idea of introducing a snapshot timestamp >>>>>> that the catalog guarantees is monotonically increasing. But if we did >>>>>> that, wouldn't we still need to know the association between these >>>>>> timestamps and snapshots after the snapshot metadata expires? My mental >>>>>> model is that this would be used to look for data that arrived, say, 3 >>>>>> weeks ago on Dec 24th. Since the snapshots metadata is no longer around >>>>>> we >>>>>> could use the row timestamp to find those rows. But how do we know that >>>>>> the >>>>>> snapshot timestamps correspond to the actual timestamp range of Dec 24th? >>>>>> Is it just "close enough" as long as we don't have out of order >>>>>> timestamps? >>>>>> This is what I mean by needing to keep track of the association between >>>>>> timestamps and snapshots after the metadata expires. Seems like you >>>>>> either >>>>>> need to keep track of what the catalog's clock was for events you care >>>>>> about, or you don't really care about exact timestamps. >>>>>> >>>>>> On Tue, Jan 13, 2026 at 2:22 PM Russell Spitzer < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> The key goal here is the ability to answer the question "what >>>>>>> happened to the table in some time window. (before < t < after)?" >>>>>>> Whether or not "t" is an atomic clock time is not as important as >>>>>>> the query between time bounds making sense. >>>>>>> Downstream applications (from what I know) are mostly sensitive to >>>>>>> getting discrete and well defined answers to >>>>>>> this question like: >>>>>>> >>>>>>> 1 < t < 2 should be exclusive of >>>>>>> 2 < t < 3 should be exclusive of >>>>>>> 3 < t < 4 >>>>>>> >>>>>>> And the union of these should be the same as the query asking for 1 >>>>>>> < t < 4 >>>>>>> >>>>>>> Currently this is not possible because we have no guarantee of >>>>>>> ordering in our timestamps >>>>>>> >>>>>>> Snapshots >>>>>>> A -> B -> C >>>>>>> Sequence numbers >>>>>>> 50 -> 51 -> 52 >>>>>>> Timestamp >>>>>>> 3 -> 1 -> 2 >>>>>>> >>>>>>> This makes time travel always a little wrong to start with. >>>>>>> >>>>>>> The Java implementation only allows one minute of negative time on >>>>>>> commit so we actually kind of do have this as a >>>>>>> "light monotonicity" requirement but as noted above there is no spec >>>>>>> requirement for this. While we do have sequence >>>>>>> number and row id, we still don't have a stable way of associating >>>>>>> these with a consistent time in an engine independent way. >>>>>>> >>>>>>> Ideally we just want to have one consistent way of answering the >>>>>>> question "what did the table look like at time t" >>>>>>> which I think we get by adding in a new field that is a timestamp, >>>>>>> set by the Catalog close to commit time, >>>>>>> that always goes up. >>>>>>> >>>>>>> I'm not sure we can really do this with an engine expression since >>>>>>> they won't know when the data is actually committed >>>>>>> when writing files? >>>>>>> >>>>>>> >>>>>>> On Tue, Jan 13, 2026 at 3:35 PM Anton Okolnychyi < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> This seems like a lot of new complexity in the format. I would like >>>>>>>> us to explore whether we can build the considered use cases on top of >>>>>>>> expression-based defaults instead. >>>>>>>> >>>>>>>> We already plan to support CURRENT_TIMESTAMP() and similar >>>>>>>> functions that are part of the SQL standard definition for default >>>>>>>> values. >>>>>>>> This would provide us a way to know the relative row order. True, this >>>>>>>> usually will represent the start of the operation. We may define >>>>>>>> COMMIT_TIMESTAMP() or a similar expression for the actual commit time, >>>>>>>> if >>>>>>>> there are use cases that need that. Plus, we may explore an approach >>>>>>>> similar to MySQL that allows users to reset the default value on >>>>>>>> update. >>>>>>>> >>>>>>>> - Anton >>>>>>>> >>>>>>>> вт, 13 січ. 2026 р. о 11:04 Russell Spitzer < >>>>>>>> [email protected]> пише: >>>>>>>> >>>>>>>>> I think this is the right step forward. Our current "timestamp" >>>>>>>>> definition is too ambiguous to be useful so establishing >>>>>>>>> a well defined and monotonic timestamp could be really great. I >>>>>>>>> also like the ability for row's to know this value without >>>>>>>>> having to rely on snapshot information which can be expired. >>>>>>>>> >>>>>>>>> On Mon, Jan 12, 2026 at 11:03 AM Steven Wu <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> I have revised the row timestamp proposal with the following >>>>>>>>>> changes. >>>>>>>>>> * a new commit_timestamp field in snapshot metadata that has >>>>>>>>>> nano-second precision. >>>>>>>>>> * this optional field is only set by the REST catalog server >>>>>>>>>> * it needs to be monotonic (e.g. implemented using Lamport >>>>>>>>>> timestamp) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Steven >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks for the clarification, Ryan. >>>>>>>>>>> >>>>>>>>>>> For long-running streaming jobs that commit periodically, it is >>>>>>>>>>> difficult to establish the constant value of current_timestamp >>>>>>>>>>> across all >>>>>>>>>>> writer tasks for each commit cycle. I guess streaming writers may >>>>>>>>>>> just need >>>>>>>>>>> to write the wall clock time when appending a row to a data file >>>>>>>>>>> for the >>>>>>>>>>> default value of current_timestamp. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I don't think that every row would have a different value. That >>>>>>>>>>>> would be up to the engine, but I would expect engines to insert >>>>>>>>>>>> `CURRENT_TIMESTAMP` into the plan and then replace it with a >>>>>>>>>>>> constant, >>>>>>>>>>>> resulting in a consistent value for all rows. >>>>>>>>>>>> >>>>>>>>>>>> You're right that this would not necessarily be the commit >>>>>>>>>>>> time. But neither is the commit timestamp from Iceberg's snapshot. >>>>>>>>>>>> I'm not >>>>>>>>>>>> sure how we are going to define "good enough" for this purpose. I >>>>>>>>>>>> think at >>>>>>>>>>>> least `CURRENT_TIMESTAMP` has reliable and known behavior when you >>>>>>>>>>>> look at >>>>>>>>>>>> how it is handled in engines. And if you want the Iceberg >>>>>>>>>>>> timestamp, then >>>>>>>>>>>> use a periodic query of the snapshot stable to keep track of them >>>>>>>>>>>> in a >>>>>>>>>>>> table you can join to. I don't think this rises to the need for a >>>>>>>>>>>> table >>>>>>>>>>>> feature unless we can guarantee that it is correct. >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> > Postgres `current_timestamp` captures the transaction start >>>>>>>>>>>>> time [1, 2]. Should we extend the same semantic to Iceberg: all >>>>>>>>>>>>> rows added >>>>>>>>>>>>> in the same snapshot should have the same timestamp value? >>>>>>>>>>>>> >>>>>>>>>>>>> Let me clarify my last comment. >>>>>>>>>>>>> >>>>>>>>>>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP) >>>>>>>>>>>>> >>>>>>>>>>>>> Since Postgres current_timestamp captures the transaction >>>>>>>>>>>>> start time, all rows added in the same insert transaction would >>>>>>>>>>>>> have the >>>>>>>>>>>>> same value as the transaction timestamp with the column >>>>>>>>>>>>> definition above. >>>>>>>>>>>>> >>>>>>>>>>>>> If we extend a similar semantic to Iceberg, all rows added in >>>>>>>>>>>>> the same Iceberg transaction/snapshot should have the same >>>>>>>>>>>>> timestamp? >>>>>>>>>>>>> >>>>>>>>>>>>> Ryan, I understand your comment for using current_timestamp >>>>>>>>>>>>> expression as column default value, you were thinking that the >>>>>>>>>>>>> engine would >>>>>>>>>>>>> set the column value to the wall clock time when appending a row >>>>>>>>>>>>> to a data >>>>>>>>>>>>> file, right? every row would almost have a different timestamp >>>>>>>>>>>>> value. >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> `current_timestamp` expression may not always carry the right >>>>>>>>>>>>>> semantic for the use cases. E.g., latency tracking is interested >>>>>>>>>>>>>> in when >>>>>>>>>>>>>> records are added / committed to the table, not when the record >>>>>>>>>>>>>> was >>>>>>>>>>>>>> appended to an uncommitted data file in the processing engine. >>>>>>>>>>>>>> Record creation and Iceberg commit can be minutes or even hours >>>>>>>>>>>>>> apart. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Row timestamp inherited from snapshot timestamp has no >>>>>>>>>>>>>> overhead with the initial commit and has very minimal storage >>>>>>>>>>>>>> overhead >>>>>>>>>>>>>> during file rewrite. Per-row current_timestamp would have >>>>>>>>>>>>>> distinct values >>>>>>>>>>>>>> for every row and has more storage overhead. >>>>>>>>>>>>>> >>>>>>>>>>>>>> OLTP databases deal with small row-level transactions. >>>>>>>>>>>>>> Postgres `current_timestamp` captures the transaction start time >>>>>>>>>>>>>> [1, 2]. >>>>>>>>>>>>>> Should we extend the same semantic to Iceberg: all rows added in >>>>>>>>>>>>>> the same >>>>>>>>>>>>>> snapshot should have the same timestamp value? >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] >>>>>>>>>>>>>> https://www.postgresql.org/docs/current/functions-datetime.html >>>>>>>>>>>>>> [2] >>>>>>>>>>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this proposal. >>>>>>>>>>>>>>>> To support the created_by timestamp, we would need to >>>>>>>>>>>>>>>> implement the following row lineage behavior >>>>>>>>>>>>>>>> * Initially, it inherits from the snapshot timestamp >>>>>>>>>>>>>>>> * during rewrite (like compaction), it should be persisted >>>>>>>>>>>>>>>> into data files. >>>>>>>>>>>>>>>> * during update, it needs to be carried over from the >>>>>>>>>>>>>>>> previous row. This is similar to the row_id carry over for row >>>>>>>>>>>>>>>> updates. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sorry for the short hand. These are not the same: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1. Insertion time - time the row was inserted. >>>>>>>>>>>>>>> 2. Create by - The system that created the record. >>>>>>>>>>>>>>> 3. Updated by - The system that last updated the record. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Depending on the exact use-case these might or might not >>>>>>>>>>>>>>> have utility. I'm just wondering if there will be more example >>>>>>>>>>>>>>> like this >>>>>>>>>>>>>>> in the future. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> created_by column would incur likely significantly higher >>>>>>>>>>>>>>>> storage overhead compared to the updated_by column. As rows >>>>>>>>>>>>>>>> are updated >>>>>>>>>>>>>>>> overtime, the cardinality for this column in data files can be >>>>>>>>>>>>>>>> high. Hence, >>>>>>>>>>>>>>>> the created_by column may not compress well. This is a similar >>>>>>>>>>>>>>>> problem for >>>>>>>>>>>>>>>> the row_id column. One side effect of enabling row lineage by >>>>>>>>>>>>>>>> default for >>>>>>>>>>>>>>>> V3 tables is the storage overhead of row_id column after >>>>>>>>>>>>>>>> compaction >>>>>>>>>>>>>>>> especially for narrow tables with few columns. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I agree. I think this analysis also shows that some >>>>>>>>>>>>>>> consumers of Iceberg might not necessarily want to have all >>>>>>>>>>>>>>> these columns, >>>>>>>>>>>>>>> so we might want to make them configurable, rather than >>>>>>>>>>>>>>> mandating them for >>>>>>>>>>>>>>> all tables. Ryan's thought on default values seems like it >>>>>>>>>>>>>>> would solve the >>>>>>>>>>>>>>> issues I was raising. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Micah >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <[email protected]> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> > An explicit timestamp column adds more burden to >>>>>>>>>>>>>>>> application developers. While some databases require an >>>>>>>>>>>>>>>> explicit column in >>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set the >>>>>>>>>>>>>>>> column value. >>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to the >>>>>>>>>>>>>>>> trigger timestamp. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Since the use cases don't require an exact timestamp, this >>>>>>>>>>>>>>>> seems like the best solution to get what people want (an >>>>>>>>>>>>>>>> insertion >>>>>>>>>>>>>>>> timestamp) that has clear and well-defined behavior. Since >>>>>>>>>>>>>>>> `current_timestamp` is defined by the SQL spec, it makes sense >>>>>>>>>>>>>>>> to me that >>>>>>>>>>>>>>>> we could use it and have reasonable behavior. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I've talked with Anton about this before and maybe he'll >>>>>>>>>>>>>>>> jump in on this thread. I think that we may need to extend >>>>>>>>>>>>>>>> default values >>>>>>>>>>>>>>>> to include default value expressions, like `current_timestamp` >>>>>>>>>>>>>>>> that is >>>>>>>>>>>>>>>> allowed by the SQL spec. That would solve the problem as well >>>>>>>>>>>>>>>> as some >>>>>>>>>>>>>>>> others (like `current_date` or `current_user`) and would not >>>>>>>>>>>>>>>> create a >>>>>>>>>>>>>>>> potentially misleading (and heavyweight) timestamp feature in >>>>>>>>>>>>>>>> the format. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> > Also some environments may have stronger clock service, >>>>>>>>>>>>>>>> like Spanner TrueTime service. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Even in cases like this, commit retries can reorder commits >>>>>>>>>>>>>>>> and make timestamps out of order. I don't think that we should >>>>>>>>>>>>>>>> be making >>>>>>>>>>>>>>>> guarantees or even exposing metadata that people might mistake >>>>>>>>>>>>>>>> as having >>>>>>>>>>>>>>>> those guarantees. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Ryan, thanks a lot for the feedback! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Regarding the concern for reliable timestamps, we are not >>>>>>>>>>>>>>>>> proposing using timestamps for ordering. With NTP in modern >>>>>>>>>>>>>>>>> computers, they >>>>>>>>>>>>>>>>> are generally reliable enough for the intended use cases. >>>>>>>>>>>>>>>>> Also some >>>>>>>>>>>>>>>>> environments may have stronger clock service, like Spanner >>>>>>>>>>>>>>>>> TrueTime service >>>>>>>>>>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency> >>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> > joining to timestamps from the snapshots metadata >>>>>>>>>>>>>>>>> table. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> As you also mentioned, it depends on the snapshot history, >>>>>>>>>>>>>>>>> which is often retained for a few days due to performance >>>>>>>>>>>>>>>>> reasons. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> > embedding a timestamp in DML (like `current_timestamp`) >>>>>>>>>>>>>>>>> rather than relying on an implicit one from table metadata. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> An explicit timestamp column adds more burden to >>>>>>>>>>>>>>>>> application developers. While some databases require an >>>>>>>>>>>>>>>>> explicit column in >>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set the >>>>>>>>>>>>>>>>> column value. >>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to the >>>>>>>>>>>>>>>>> trigger timestamp. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Also, the timestamp set during computation (like streaming >>>>>>>>>>>>>>>>> ingestion or relative long batch computation) doesn't capture >>>>>>>>>>>>>>>>> the time the >>>>>>>>>>>>>>>>> rows/files are added to the Iceberg table in a batch fashion. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> > And for those use cases, you could also keep a longer >>>>>>>>>>>>>>>>> history of snapshot timestamps, like storing a catalog's >>>>>>>>>>>>>>>>> event log for >>>>>>>>>>>>>>>>> long-term access to timestamp info >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> this is not really consumable by joining the regular table >>>>>>>>>>>>>>>>> query with catalog event log. I would also imagine catalog >>>>>>>>>>>>>>>>> event log is >>>>>>>>>>>>>>>>> capped at shorter retention (maybe a few months) compared to >>>>>>>>>>>>>>>>> data retention >>>>>>>>>>>>>>>>> (could be a few years). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <[email protected]> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I don't think it is a good idea to expose timestamps at >>>>>>>>>>>>>>>>>> the row level. Timestamps in metadata that would be carried >>>>>>>>>>>>>>>>>> down to the row >>>>>>>>>>>>>>>>>> level already confuse people that expect them to be useful >>>>>>>>>>>>>>>>>> or reliable, >>>>>>>>>>>>>>>>>> rather than for debugging. I think extending this to the row >>>>>>>>>>>>>>>>>> level would >>>>>>>>>>>>>>>>>> only make the problem worse. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> You can already get this information by projecting the >>>>>>>>>>>>>>>>>> last updated sequence number, which is reliable, and joining >>>>>>>>>>>>>>>>>> to timestamps >>>>>>>>>>>>>>>>>> from the snapshots metadata table. Of course, the drawback >>>>>>>>>>>>>>>>>> there is losing >>>>>>>>>>>>>>>>>> the timestamp information when snapshots expire, but since >>>>>>>>>>>>>>>>>> it isn't >>>>>>>>>>>>>>>>>> reliable anyway I'd be fine with that. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Some of the use cases, like auditing and compliance, are >>>>>>>>>>>>>>>>>> probably better served by embedding a timestamp in DML (like >>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit one >>>>>>>>>>>>>>>>>> from table >>>>>>>>>>>>>>>>>> metadata. And for those use cases, you could also keep a >>>>>>>>>>>>>>>>>> longer history of >>>>>>>>>>>>>>>>>> snapshot timestamps, like storing a catalog's event log for >>>>>>>>>>>>>>>>>> long-term >>>>>>>>>>>>>>>>>> access to timestamp info. I think that would be better than >>>>>>>>>>>>>>>>>> storing it at >>>>>>>>>>>>>>>>>> the row level. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> For V4 spec, I have a small proposal [1] to expose the >>>>>>>>>>>>>>>>>>> row timestamp concept that can help with many use cases >>>>>>>>>>>>>>>>>>> like temporal >>>>>>>>>>>>>>>>>>> queries, latency tracking, TTL, auditing and compliance. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> This *_last_updated_timestamp_ms * metadata column >>>>>>>>>>>>>>>>>>> behaves very similarly to the >>>>>>>>>>>>>>>>>>> *_last_updated_sequence_number* for row lineage. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> - Initially, it inherits from the snapshot timestamp. >>>>>>>>>>>>>>>>>>> - During rewrite (like compaction), its values are >>>>>>>>>>>>>>>>>>> persisted in the data files. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Would love to hear what you think. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>> Steven >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>
