For row timestamp inheritance to work, I would need to implement the plumbing. So I would imagine existing rows would have null values because the inheritance plumbing was not there yet. This would be consistent with upgrade behavior for the V3 row lineage: https://iceberg.apache.org/spec/#row-lineage-for-upgraded-tables.
On Thu, Jan 22, 2026 at 4:09 PM Anton Okolnychyi <[email protected]> wrote: > Also, do we have a concrete plan for how to handle tables that would be > upgraded to V4? What timestamp will we assign to existing rows? > > On Wed, Jan 21, 2026 at 3:59 PM Anton Okolnychyi <[email protected]> > wrote: > >> If we ignore temporal queries that need strict snapshot boundaries and >> can't be solved completely using row timestamps in case of mutations, you >> mentioned other use cases when row timestamps may be helpful like TTL and >> auditing. We can debate whether using CURRENT_TIMESTAMP() is enough for >> them, but I don't really see a point given that we already have row lineage >> in V3 and the storage overhead for one more field isn't likely to be >> noticable. One of the problems with CURRENT_TIMESTAMP() is the required >> action by the user. Having a reliable row timestamp populated automatically >> is likely to be better, so +1. >> >> пт, 16 січ. 2026 р. о 14:30 Steven Wu <[email protected]> пише: >> >>> Joining with snapshot history also has significant complexity. It >>> requires retaining the entire snapshot history with probably trimmed >>> snapshot metadata. There are concerns on the size of the snapshot history >>> for tables with frequent commits (like streaming ingestion). Do we maintain >>> the unbounded trimmed snapshot history in the same table metadata, which >>> could affect table metadata.json size? or store it separately somewhere >>> (like in catalog), which would require the complexity of multi-entity >>> transaction in catalog? >>> >>> >>> On Fri, Jan 16, 2026 at 12:07 PM Russell Spitzer < >>> [email protected]> wrote: >>> >>>> I've gone back and forth on the inherited columns. I think the thing >>>> which keeps coming back to me is that I don't >>>> like that the only way to determine the timestamp associated with a row >>>> update/creation is to do a join back >>>> against table metadata. While that's doable, It feels user unfriendly. >>>> >>>> >>>> >>>> On Fri, Jan 16, 2026 at 11:54 AM Steven Wu <[email protected]> >>>> wrote: >>>> >>>>> Anton, you are right that the row-level deletes will be a problem for >>>>> some of the mentioned use cases (like incremental processing). I have >>>>> clarified the applicability of some use cases to "tables with inserts and >>>>> updates only". >>>>> >>>>> Right now, we are only tracking modification/commit time (not >>>>> insertion time) in case of updates. >>>>> >>>>> On Thu, Jan 15, 2026 at 6:33 PM Anton Okolnychyi < >>>>> [email protected]> wrote: >>>>> >>>>>> I think there is clear consensus that making snapshot timestamps >>>>>> strictly increasing is a positive thing. I am also +1. >>>>>> >>>>>> - How will row timestamps allow us to reliably implement incremental >>>>>> consumption independent of the snapshot retention given that rows can be >>>>>> added AND removed in a particular time frame? How can we capture all >>>>>> changes by just looking at the latest snapshot? >>>>>> - Some use cases in the doc need the insertion time and some need the >>>>>> last modification time. Do we plan to support both? >>>>>> - What do we expect the behavior to be in UPDATE and MERGE >>>>>> operations? >>>>>> >>>>>> To be clear: I am not opposed to this change, just want to make sure >>>>>> I understand all use cases that we aim to address and what would be >>>>>> required in engines. >>>>>> >>>>>> чт, 15 січ. 2026 р. о 17:01 Maninder Parmar < >>>>>> [email protected]> пише: >>>>>> >>>>>>> +1 for improving how the commit timestamps are >>>>>>> assigned monotonically since this requirement has emerged over multiple >>>>>>> discussions like notifications, multi-table transactions, time travel >>>>>>> accuracy and row timestamps. It would be good to have a single >>>>>>> consistent >>>>>>> way to represent and assign timestamps that could be leveraged across >>>>>>> multiple features. >>>>>>> >>>>>>> On Thu, Jan 15, 2026 at 4:05 PM Ryan Blue <[email protected]> wrote: >>>>>>> >>>>>>>> Yeah, to add my perspective on that discussion, I think my primary >>>>>>>> concern is that people expect timestamps to be monotonic and if they >>>>>>>> aren't >>>>>>>> then a `_last_update_timestamp` field just makes the problem worse. >>>>>>>> But it >>>>>>>> is _nice_ to have row-level timestamps. So I would be okay if we >>>>>>>> revisit >>>>>>>> how we assign commit timestamps and improve it so that you get >>>>>>>> monotonic >>>>>>>> behavior. >>>>>>>> >>>>>>>> On Thu, Jan 15, 2026 at 2:23 PM Steven Wu <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> We had an offline discussion with Ryan. I revised the proposal as >>>>>>>>> follows. >>>>>>>>> >>>>>>>>> 1. V4 would require writers to generate *monotonic* snapshot >>>>>>>>> timestamps. The proposal doc has a section that describes a >>>>>>>>> recommended >>>>>>>>> implementation using lamport timestamps. >>>>>>>>> 2. Expose *last_update_timestamp* metadata column that inherits >>>>>>>>> from snapshot timestamp >>>>>>>>> >>>>>>>>> This is a relatively low-friction change that can fix the time >>>>>>>>> travel problem and enable use cases like latency tracking, temporal >>>>>>>>> query, >>>>>>>>> TTL, auditing. >>>>>>>>> >>>>>>>>> There is no accuracy requirement on the timestamp values. In >>>>>>>>> practice, modern servers with NTP have pretty reliable wall clocks. >>>>>>>>> E.g., >>>>>>>>> Java library implemented this validation >>>>>>>>> <https://github.com/apache/iceberg/blob/035e0fb39d2a949f6343552ade0a7d6c2967e0db/core/src/main/java/org/apache/iceberg/TableMetadata.java#L369-L377> >>>>>>>>> that >>>>>>>>> protects against backward clock drift up to one minute for snapshot >>>>>>>>> timestamps. Don't think we have heard many complaints of commit >>>>>>>>> failure due >>>>>>>>> to that clock drift validation. >>>>>>>>> >>>>>>>>> Would appreciate feedback on the revised proposal. >>>>>>>>> >>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0 >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Steven >>>>>>>>> >>>>>>>>> On Tue, Jan 13, 2026 at 8:40 PM Anton Okolnychyi < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Steven, I was referring to the fact that CURRENT_TIMESTAMP() is >>>>>>>>>> usually evaluated quite early in engines so we could theoretically >>>>>>>>>> have >>>>>>>>>> another expression closer to the commit time. You are right, though, >>>>>>>>>> it >>>>>>>>>> won't be the actual commit time given that we have to write it into >>>>>>>>>> the >>>>>>>>>> files. Also, I don't think generating a timestamp for a row as it is >>>>>>>>>> being >>>>>>>>>> written is going to be beneficial. To sum up, expression-based >>>>>>>>>> defaults >>>>>>>>>> would allow us to capture the time the transaction or write starts, >>>>>>>>>> but not >>>>>>>>>> the actual commit time. >>>>>>>>>> >>>>>>>>>> Russell, if the goal is to know what happened to the table in a >>>>>>>>>> given time frame, isn't the changelog scan the way to go? It would >>>>>>>>>> assign >>>>>>>>>> commit ordinals based on lineage and include row-level diffs. How >>>>>>>>>> would you >>>>>>>>>> be able to determine changes with row timestamps by just looking at >>>>>>>>>> the >>>>>>>>>> latest snapshot? >>>>>>>>>> >>>>>>>>>> It does seem promising to make snapshot timestamps strictly >>>>>>>>>> increasing to avoid ambiguity during time travel. >>>>>>>>>> >>>>>>>>>> вт, 13 січ. 2026 р. о 16:33 Ryan Blue <[email protected]> пише: >>>>>>>>>> >>>>>>>>>>> > Whether or not "t" is an atomic clock time is not as important >>>>>>>>>>> as the query between time bounds making sense. >>>>>>>>>>> >>>>>>>>>>> I'm not sure I get it then. If we want monotonically increasing >>>>>>>>>>> times, but they don't have to be real times then how do you know >>>>>>>>>>> what >>>>>>>>>>> notion of "time" you care about for these filters? Or to put it >>>>>>>>>>> another >>>>>>>>>>> way, how do you know that your "before" and "after" times are >>>>>>>>>>> reasonable? >>>>>>>>>>> If the boundaries of these time queries can move around a bit, by >>>>>>>>>>> how much? >>>>>>>>>>> >>>>>>>>>>> It seems to me that row IDs can play an important role here >>>>>>>>>>> because you have the order guarantee that we seem to want for this >>>>>>>>>>> use >>>>>>>>>>> case: if snapshot A was committed before snapshot B, then the rows >>>>>>>>>>> from A >>>>>>>>>>> have row IDs that are always less than the rows IDs of B. The >>>>>>>>>>> problem is >>>>>>>>>>> that we don't know where those row IDs start and end once A and B >>>>>>>>>>> are no >>>>>>>>>>> longer tracked. Using a "timestamp" seems to work, but I still >>>>>>>>>>> worry that >>>>>>>>>>> without reliable timestamps that correspond with some guarantee to >>>>>>>>>>> real >>>>>>>>>>> timestamps, we are creating a feature that seems reliable but isn't. >>>>>>>>>>> >>>>>>>>>>> I'm somewhat open to the idea of introducing a snapshot >>>>>>>>>>> timestamp that the catalog guarantees is monotonically increasing. >>>>>>>>>>> But if >>>>>>>>>>> we did that, wouldn't we still need to know the association between >>>>>>>>>>> these >>>>>>>>>>> timestamps and snapshots after the snapshot metadata expires? My >>>>>>>>>>> mental >>>>>>>>>>> model is that this would be used to look for data that arrived, >>>>>>>>>>> say, 3 >>>>>>>>>>> weeks ago on Dec 24th. Since the snapshots metadata is no longer >>>>>>>>>>> around we >>>>>>>>>>> could use the row timestamp to find those rows. But how do we know >>>>>>>>>>> that the >>>>>>>>>>> snapshot timestamps correspond to the actual timestamp range of Dec >>>>>>>>>>> 24th? >>>>>>>>>>> Is it just "close enough" as long as we don't have out of order >>>>>>>>>>> timestamps? >>>>>>>>>>> This is what I mean by needing to keep track of the association >>>>>>>>>>> between >>>>>>>>>>> timestamps and snapshots after the metadata expires. Seems like you >>>>>>>>>>> either >>>>>>>>>>> need to keep track of what the catalog's clock was for events you >>>>>>>>>>> care >>>>>>>>>>> about, or you don't really care about exact timestamps. >>>>>>>>>>> >>>>>>>>>>> On Tue, Jan 13, 2026 at 2:22 PM Russell Spitzer < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> The key goal here is the ability to answer the question "what >>>>>>>>>>>> happened to the table in some time window. (before < t < after)?" >>>>>>>>>>>> Whether or not "t" is an atomic clock time is not as important >>>>>>>>>>>> as the query between time bounds making sense. >>>>>>>>>>>> Downstream applications (from what I know) are mostly sensitive >>>>>>>>>>>> to getting discrete and well defined answers to >>>>>>>>>>>> this question like: >>>>>>>>>>>> >>>>>>>>>>>> 1 < t < 2 should be exclusive of >>>>>>>>>>>> 2 < t < 3 should be exclusive of >>>>>>>>>>>> 3 < t < 4 >>>>>>>>>>>> >>>>>>>>>>>> And the union of these should be the same as the query asking >>>>>>>>>>>> for 1 < t < 4 >>>>>>>>>>>> >>>>>>>>>>>> Currently this is not possible because we have no guarantee of >>>>>>>>>>>> ordering in our timestamps >>>>>>>>>>>> >>>>>>>>>>>> Snapshots >>>>>>>>>>>> A -> B -> C >>>>>>>>>>>> Sequence numbers >>>>>>>>>>>> 50 -> 51 -> 52 >>>>>>>>>>>> Timestamp >>>>>>>>>>>> 3 -> 1 -> 2 >>>>>>>>>>>> >>>>>>>>>>>> This makes time travel always a little wrong to start with. >>>>>>>>>>>> >>>>>>>>>>>> The Java implementation only allows one minute of negative time >>>>>>>>>>>> on commit so we actually kind of do have this as a >>>>>>>>>>>> "light monotonicity" requirement but as noted above there is no >>>>>>>>>>>> spec requirement for this. While we do have sequence >>>>>>>>>>>> number and row id, we still don't have a stable way of >>>>>>>>>>>> associating these with a consistent time in an engine independent >>>>>>>>>>>> way. >>>>>>>>>>>> >>>>>>>>>>>> Ideally we just want to have one consistent way of answering >>>>>>>>>>>> the question "what did the table look like at time t" >>>>>>>>>>>> which I think we get by adding in a new field that is a >>>>>>>>>>>> timestamp, set by the Catalog close to commit time, >>>>>>>>>>>> that always goes up. >>>>>>>>>>>> >>>>>>>>>>>> I'm not sure we can really do this with an engine expression >>>>>>>>>>>> since they won't know when the data is actually committed >>>>>>>>>>>> when writing files? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Jan 13, 2026 at 3:35 PM Anton Okolnychyi < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> This seems like a lot of new complexity in the format. I would >>>>>>>>>>>>> like us to explore whether we can build the considered use cases >>>>>>>>>>>>> on top of >>>>>>>>>>>>> expression-based defaults instead. >>>>>>>>>>>>> >>>>>>>>>>>>> We already plan to support CURRENT_TIMESTAMP() and similar >>>>>>>>>>>>> functions that are part of the SQL standard definition for >>>>>>>>>>>>> default values. >>>>>>>>>>>>> This would provide us a way to know the relative row order. True, >>>>>>>>>>>>> this >>>>>>>>>>>>> usually will represent the start of the operation. We may define >>>>>>>>>>>>> COMMIT_TIMESTAMP() or a similar expression for the actual commit >>>>>>>>>>>>> time, if >>>>>>>>>>>>> there are use cases that need that. Plus, we may explore an >>>>>>>>>>>>> approach >>>>>>>>>>>>> similar to MySQL that allows users to reset the default value on >>>>>>>>>>>>> update. >>>>>>>>>>>>> >>>>>>>>>>>>> - Anton >>>>>>>>>>>>> >>>>>>>>>>>>> вт, 13 січ. 2026 р. о 11:04 Russell Spitzer < >>>>>>>>>>>>> [email protected]> пише: >>>>>>>>>>>>> >>>>>>>>>>>>>> I think this is the right step forward. Our current >>>>>>>>>>>>>> "timestamp" definition is too ambiguous to be useful so >>>>>>>>>>>>>> establishing >>>>>>>>>>>>>> a well defined and monotonic timestamp could be really great. >>>>>>>>>>>>>> I also like the ability for row's to know this value without >>>>>>>>>>>>>> having to rely on snapshot information which can be expired. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Jan 12, 2026 at 11:03 AM Steven Wu < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I have revised the row timestamp proposal with the following >>>>>>>>>>>>>>> changes. >>>>>>>>>>>>>>> * a new commit_timestamp field in snapshot metadata that has >>>>>>>>>>>>>>> nano-second precision. >>>>>>>>>>>>>>> * this optional field is only set by the REST catalog server >>>>>>>>>>>>>>> * it needs to be monotonic (e.g. implemented using Lamport >>>>>>>>>>>>>>> timestamp) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Steven >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks for the clarification, Ryan. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> For long-running streaming jobs that commit periodically, >>>>>>>>>>>>>>>> it is difficult to establish the constant value of >>>>>>>>>>>>>>>> current_timestamp across >>>>>>>>>>>>>>>> all writer tasks for each commit cycle. I guess streaming >>>>>>>>>>>>>>>> writers may just >>>>>>>>>>>>>>>> need to write the wall clock time when appending a row to a >>>>>>>>>>>>>>>> data file for >>>>>>>>>>>>>>>> the default value of current_timestamp. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <[email protected]> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I don't think that every row would have a different value. >>>>>>>>>>>>>>>>> That would be up to the engine, but I would expect engines to >>>>>>>>>>>>>>>>> insert >>>>>>>>>>>>>>>>> `CURRENT_TIMESTAMP` into the plan and then replace it with a >>>>>>>>>>>>>>>>> constant, >>>>>>>>>>>>>>>>> resulting in a consistent value for all rows. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> You're right that this would not necessarily be the commit >>>>>>>>>>>>>>>>> time. But neither is the commit timestamp from Iceberg's >>>>>>>>>>>>>>>>> snapshot. I'm not >>>>>>>>>>>>>>>>> sure how we are going to define "good enough" for this >>>>>>>>>>>>>>>>> purpose. I think at >>>>>>>>>>>>>>>>> least `CURRENT_TIMESTAMP` has reliable and known behavior >>>>>>>>>>>>>>>>> when you look at >>>>>>>>>>>>>>>>> how it is handled in engines. And if you want the Iceberg >>>>>>>>>>>>>>>>> timestamp, then >>>>>>>>>>>>>>>>> use a periodic query of the snapshot stable to keep track of >>>>>>>>>>>>>>>>> them in a >>>>>>>>>>>>>>>>> table you can join to. I don't think this rises to the need >>>>>>>>>>>>>>>>> for a table >>>>>>>>>>>>>>>>> feature unless we can guarantee that it is correct. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> > Postgres `current_timestamp` captures the >>>>>>>>>>>>>>>>>> transaction start time [1, 2]. Should we extend the same >>>>>>>>>>>>>>>>>> semantic to >>>>>>>>>>>>>>>>>> Iceberg: all rows added in the same snapshot should have the >>>>>>>>>>>>>>>>>> same timestamp >>>>>>>>>>>>>>>>>> value? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Let me clarify my last comment. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT >>>>>>>>>>>>>>>>>> CURRENT_TIMESTAMP) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Since Postgres current_timestamp captures the transaction >>>>>>>>>>>>>>>>>> start time, all rows added in the same insert transaction >>>>>>>>>>>>>>>>>> would have the >>>>>>>>>>>>>>>>>> same value as the transaction timestamp with the column >>>>>>>>>>>>>>>>>> definition above. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> If we extend a similar semantic to Iceberg, all rows >>>>>>>>>>>>>>>>>> added in the same Iceberg transaction/snapshot should have >>>>>>>>>>>>>>>>>> the same >>>>>>>>>>>>>>>>>> timestamp? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Ryan, I understand your comment for using >>>>>>>>>>>>>>>>>> current_timestamp expression as column default value, you >>>>>>>>>>>>>>>>>> were thinking >>>>>>>>>>>>>>>>>> that the engine would set the column value to the wall clock >>>>>>>>>>>>>>>>>> time when >>>>>>>>>>>>>>>>>> appending a row to a data file, right? every row would >>>>>>>>>>>>>>>>>> almost have a >>>>>>>>>>>>>>>>>> different timestamp value. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> `current_timestamp` expression may not always carry the >>>>>>>>>>>>>>>>>>> right semantic for the use cases. E.g., latency tracking is >>>>>>>>>>>>>>>>>>> interested in >>>>>>>>>>>>>>>>>>> when records are added / committed to the table, not when >>>>>>>>>>>>>>>>>>> the record was >>>>>>>>>>>>>>>>>>> appended to an uncommitted data file in the processing >>>>>>>>>>>>>>>>>>> engine. >>>>>>>>>>>>>>>>>>> Record creation and Iceberg commit can be minutes or even >>>>>>>>>>>>>>>>>>> hours apart. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Row timestamp inherited from snapshot timestamp has no >>>>>>>>>>>>>>>>>>> overhead with the initial commit and has very minimal >>>>>>>>>>>>>>>>>>> storage overhead >>>>>>>>>>>>>>>>>>> during file rewrite. Per-row current_timestamp would have >>>>>>>>>>>>>>>>>>> distinct values >>>>>>>>>>>>>>>>>>> for every row and has more storage overhead. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> OLTP databases deal with small row-level transactions. >>>>>>>>>>>>>>>>>>> Postgres `current_timestamp` captures the transaction start >>>>>>>>>>>>>>>>>>> time [1, 2]. >>>>>>>>>>>>>>>>>>> Should we extend the same semantic to Iceberg: all rows >>>>>>>>>>>>>>>>>>> added in the same >>>>>>>>>>>>>>>>>>> snapshot should have the same timestamp value? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>> https://www.postgresql.org/docs/current/functions-datetime.html >>>>>>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this >>>>>>>>>>>>>>>>>>>>> proposal. >>>>>>>>>>>>>>>>>>>>> To support the created_by timestamp, we would need to >>>>>>>>>>>>>>>>>>>>> implement the following row lineage behavior >>>>>>>>>>>>>>>>>>>>> * Initially, it inherits from the snapshot timestamp >>>>>>>>>>>>>>>>>>>>> * during rewrite (like compaction), it should be >>>>>>>>>>>>>>>>>>>>> persisted into data files. >>>>>>>>>>>>>>>>>>>>> * during update, it needs to be carried over from the >>>>>>>>>>>>>>>>>>>>> previous row. This is similar to the row_id carry over >>>>>>>>>>>>>>>>>>>>> for row updates. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Sorry for the short hand. These are not the same: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 1. Insertion time - time the row was inserted. >>>>>>>>>>>>>>>>>>>> 2. Create by - The system that created the record. >>>>>>>>>>>>>>>>>>>> 3. Updated by - The system that last updated the >>>>>>>>>>>>>>>>>>>> record. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Depending on the exact use-case these might or might >>>>>>>>>>>>>>>>>>>> not have utility. I'm just wondering if there will be >>>>>>>>>>>>>>>>>>>> more example like >>>>>>>>>>>>>>>>>>>> this in the future. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> created_by column would incur likely significantly >>>>>>>>>>>>>>>>>>>>> higher storage overhead compared to the updated_by >>>>>>>>>>>>>>>>>>>>> column. As rows are >>>>>>>>>>>>>>>>>>>>> updated overtime, the cardinality for this column in data >>>>>>>>>>>>>>>>>>>>> files can be >>>>>>>>>>>>>>>>>>>>> high. Hence, the created_by column may not compress well. >>>>>>>>>>>>>>>>>>>>> This is a similar >>>>>>>>>>>>>>>>>>>>> problem for the row_id column. One side effect of >>>>>>>>>>>>>>>>>>>>> enabling row lineage by >>>>>>>>>>>>>>>>>>>>> default for V3 tables is the storage overhead of row_id >>>>>>>>>>>>>>>>>>>>> column after >>>>>>>>>>>>>>>>>>>>> compaction especially for narrow tables with few columns. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I agree. I think this analysis also shows that some >>>>>>>>>>>>>>>>>>>> consumers of Iceberg might not necessarily want to have >>>>>>>>>>>>>>>>>>>> all these columns, >>>>>>>>>>>>>>>>>>>> so we might want to make them configurable, rather than >>>>>>>>>>>>>>>>>>>> mandating them for >>>>>>>>>>>>>>>>>>>> all tables. Ryan's thought on default values seems like it >>>>>>>>>>>>>>>>>>>> would solve the >>>>>>>>>>>>>>>>>>>> issues I was raising. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>> Micah >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> > An explicit timestamp column adds more burden to >>>>>>>>>>>>>>>>>>>>> application developers. While some databases require an >>>>>>>>>>>>>>>>>>>>> explicit column in >>>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set >>>>>>>>>>>>>>>>>>>>> the column value. >>>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to the >>>>>>>>>>>>>>>>>>>>> trigger timestamp. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Since the use cases don't require an exact timestamp, >>>>>>>>>>>>>>>>>>>>> this seems like the best solution to get what people want >>>>>>>>>>>>>>>>>>>>> (an insertion >>>>>>>>>>>>>>>>>>>>> timestamp) that has clear and well-defined behavior. Since >>>>>>>>>>>>>>>>>>>>> `current_timestamp` is defined by the SQL spec, it makes >>>>>>>>>>>>>>>>>>>>> sense to me that >>>>>>>>>>>>>>>>>>>>> we could use it and have reasonable behavior. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I've talked with Anton about this before and maybe >>>>>>>>>>>>>>>>>>>>> he'll jump in on this thread. I think that we may need to >>>>>>>>>>>>>>>>>>>>> extend default >>>>>>>>>>>>>>>>>>>>> values to include default value expressions, like >>>>>>>>>>>>>>>>>>>>> `current_timestamp` that >>>>>>>>>>>>>>>>>>>>> is allowed by the SQL spec. That would solve the problem >>>>>>>>>>>>>>>>>>>>> as well as some >>>>>>>>>>>>>>>>>>>>> others (like `current_date` or `current_user`) and would >>>>>>>>>>>>>>>>>>>>> not create a >>>>>>>>>>>>>>>>>>>>> potentially misleading (and heavyweight) timestamp >>>>>>>>>>>>>>>>>>>>> feature in the format. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> > Also some environments may have stronger clock >>>>>>>>>>>>>>>>>>>>> service, like Spanner TrueTime service. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Even in cases like this, commit retries can reorder >>>>>>>>>>>>>>>>>>>>> commits and make timestamps out of order. I don't think >>>>>>>>>>>>>>>>>>>>> that we should be >>>>>>>>>>>>>>>>>>>>> making guarantees or even exposing metadata that people >>>>>>>>>>>>>>>>>>>>> might mistake as >>>>>>>>>>>>>>>>>>>>> having those guarantees. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu < >>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Ryan, thanks a lot for the feedback! >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Regarding the concern for reliable timestamps, we are >>>>>>>>>>>>>>>>>>>>>> not proposing using timestamps for ordering. With NTP in >>>>>>>>>>>>>>>>>>>>>> modern computers, >>>>>>>>>>>>>>>>>>>>>> they are generally reliable enough for the intended use >>>>>>>>>>>>>>>>>>>>>> cases. Also some >>>>>>>>>>>>>>>>>>>>>> environments may have stronger clock service, like >>>>>>>>>>>>>>>>>>>>>> Spanner >>>>>>>>>>>>>>>>>>>>>> TrueTime service >>>>>>>>>>>>>>>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency> >>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> > joining to timestamps from the snapshots metadata >>>>>>>>>>>>>>>>>>>>>> table. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> As you also mentioned, it depends on the snapshot >>>>>>>>>>>>>>>>>>>>>> history, which is often retained for a few days due to >>>>>>>>>>>>>>>>>>>>>> performance reasons. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> > embedding a timestamp in DML (like >>>>>>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit >>>>>>>>>>>>>>>>>>>>>> one from table >>>>>>>>>>>>>>>>>>>>>> metadata. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> An explicit timestamp column adds more burden to >>>>>>>>>>>>>>>>>>>>>> application developers. While some databases require an >>>>>>>>>>>>>>>>>>>>>> explicit column in >>>>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set >>>>>>>>>>>>>>>>>>>>>> the column value. >>>>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to >>>>>>>>>>>>>>>>>>>>>> the trigger timestamp. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Also, the timestamp set during computation (like >>>>>>>>>>>>>>>>>>>>>> streaming ingestion or relative long batch computation) >>>>>>>>>>>>>>>>>>>>>> doesn't capture the >>>>>>>>>>>>>>>>>>>>>> time the rows/files are added to the Iceberg table in a >>>>>>>>>>>>>>>>>>>>>> batch fashion. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> > And for those use cases, you could also keep a >>>>>>>>>>>>>>>>>>>>>> longer history of snapshot timestamps, like storing a >>>>>>>>>>>>>>>>>>>>>> catalog's event log >>>>>>>>>>>>>>>>>>>>>> for long-term access to timestamp info >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> this is not really consumable by joining the regular >>>>>>>>>>>>>>>>>>>>>> table query with catalog event log. I would also imagine >>>>>>>>>>>>>>>>>>>>>> catalog event log >>>>>>>>>>>>>>>>>>>>>> is capped at shorter retention (maybe a few months) >>>>>>>>>>>>>>>>>>>>>> compared to data >>>>>>>>>>>>>>>>>>>>>> retention (could be a few years). >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue < >>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I don't think it is a good idea to expose timestamps >>>>>>>>>>>>>>>>>>>>>>> at the row level. Timestamps in metadata that would be >>>>>>>>>>>>>>>>>>>>>>> carried down to the >>>>>>>>>>>>>>>>>>>>>>> row level already confuse people that expect them to be >>>>>>>>>>>>>>>>>>>>>>> useful or reliable, >>>>>>>>>>>>>>>>>>>>>>> rather than for debugging. I think extending this to >>>>>>>>>>>>>>>>>>>>>>> the row level would >>>>>>>>>>>>>>>>>>>>>>> only make the problem worse. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> You can already get this information by projecting >>>>>>>>>>>>>>>>>>>>>>> the last updated sequence number, which is reliable, >>>>>>>>>>>>>>>>>>>>>>> and joining to >>>>>>>>>>>>>>>>>>>>>>> timestamps from the snapshots metadata table. Of >>>>>>>>>>>>>>>>>>>>>>> course, the drawback there >>>>>>>>>>>>>>>>>>>>>>> is losing the timestamp information when snapshots >>>>>>>>>>>>>>>>>>>>>>> expire, but since it >>>>>>>>>>>>>>>>>>>>>>> isn't reliable anyway I'd be fine with that. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Some of the use cases, like auditing and compliance, >>>>>>>>>>>>>>>>>>>>>>> are probably better served by embedding a timestamp in >>>>>>>>>>>>>>>>>>>>>>> DML (like >>>>>>>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit >>>>>>>>>>>>>>>>>>>>>>> one from table >>>>>>>>>>>>>>>>>>>>>>> metadata. And for those use cases, you could also keep >>>>>>>>>>>>>>>>>>>>>>> a longer history of >>>>>>>>>>>>>>>>>>>>>>> snapshot timestamps, like storing a catalog's event log >>>>>>>>>>>>>>>>>>>>>>> for long-term >>>>>>>>>>>>>>>>>>>>>>> access to timestamp info. I think that would be better >>>>>>>>>>>>>>>>>>>>>>> than storing it at >>>>>>>>>>>>>>>>>>>>>>> the row level. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu < >>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> For V4 spec, I have a small proposal [1] to expose >>>>>>>>>>>>>>>>>>>>>>>> the row timestamp concept that can help with many use >>>>>>>>>>>>>>>>>>>>>>>> cases like temporal >>>>>>>>>>>>>>>>>>>>>>>> queries, latency tracking, TTL, auditing and >>>>>>>>>>>>>>>>>>>>>>>> compliance. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> This *_last_updated_timestamp_ms * metadata column >>>>>>>>>>>>>>>>>>>>>>>> behaves very similarly to the >>>>>>>>>>>>>>>>>>>>>>>> *_last_updated_sequence_number* for row lineage. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> - Initially, it inherits from the snapshot >>>>>>>>>>>>>>>>>>>>>>>> timestamp. >>>>>>>>>>>>>>>>>>>>>>>> - During rewrite (like compaction), its values >>>>>>>>>>>>>>>>>>>>>>>> are persisted in the data files. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Would love to hear what you think. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>> Steven >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>
