Sounds good to me On Mon, Jan 26, 2026 at 11:59 AM Anton Okolnychyi <[email protected]> wrote:
> Cool, sounds like a plan then? Thanks for answering all the questions, > Steven! > > чт, 22 січ. 2026 р. о 18:29 Steven Wu <[email protected]> пише: > >> For row timestamp inheritance to work, I would need to implement the >> plumbing. So I would imagine existing rows would have null values because >> the inheritance plumbing was not there yet. This would be consistent with >> upgrade behavior for the V3 row lineage: >> https://iceberg.apache.org/spec/#row-lineage-for-upgraded-tables. >> >> On Thu, Jan 22, 2026 at 4:09 PM Anton Okolnychyi <[email protected]> >> wrote: >> >>> Also, do we have a concrete plan for how to handle tables that would be >>> upgraded to V4? What timestamp will we assign to existing rows? >>> >>> On Wed, Jan 21, 2026 at 3:59 PM Anton Okolnychyi <[email protected]> >>> wrote: >>> >>>> If we ignore temporal queries that need strict snapshot boundaries and >>>> can't be solved completely using row timestamps in case of mutations, you >>>> mentioned other use cases when row timestamps may be helpful like TTL and >>>> auditing. We can debate whether using CURRENT_TIMESTAMP() is enough for >>>> them, but I don't really see a point given that we already have row lineage >>>> in V3 and the storage overhead for one more field isn't likely to be >>>> noticable. One of the problems with CURRENT_TIMESTAMP() is the required >>>> action by the user. Having a reliable row timestamp populated automatically >>>> is likely to be better, so +1. >>>> >>>> пт, 16 січ. 2026 р. о 14:30 Steven Wu <[email protected]> пише: >>>> >>>>> Joining with snapshot history also has significant complexity. It >>>>> requires retaining the entire snapshot history with probably trimmed >>>>> snapshot metadata. There are concerns on the size of the snapshot history >>>>> for tables with frequent commits (like streaming ingestion). Do we >>>>> maintain >>>>> the unbounded trimmed snapshot history in the same table metadata, which >>>>> could affect table metadata.json size? or store it separately somewhere >>>>> (like in catalog), which would require the complexity of multi-entity >>>>> transaction in catalog? >>>>> >>>>> >>>>> On Fri, Jan 16, 2026 at 12:07 PM Russell Spitzer < >>>>> [email protected]> wrote: >>>>> >>>>>> I've gone back and forth on the inherited columns. I think the thing >>>>>> which keeps coming back to me is that I don't >>>>>> like that the only way to determine the timestamp associated with a >>>>>> row update/creation is to do a join back >>>>>> against table metadata. While that's doable, It feels user unfriendly. >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Jan 16, 2026 at 11:54 AM Steven Wu <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Anton, you are right that the row-level deletes will be a problem >>>>>>> for some of the mentioned use cases (like incremental processing). I >>>>>>> have >>>>>>> clarified the applicability of some use cases to "tables with inserts >>>>>>> and >>>>>>> updates only". >>>>>>> >>>>>>> Right now, we are only tracking modification/commit time (not >>>>>>> insertion time) in case of updates. >>>>>>> >>>>>>> On Thu, Jan 15, 2026 at 6:33 PM Anton Okolnychyi < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> I think there is clear consensus that making snapshot timestamps >>>>>>>> strictly increasing is a positive thing. I am also +1. >>>>>>>> >>>>>>>> - How will row timestamps allow us to reliably implement >>>>>>>> incremental consumption independent of the snapshot retention given >>>>>>>> that >>>>>>>> rows can be added AND removed in a particular time frame? How can we >>>>>>>> capture all changes by just looking at the latest snapshot? >>>>>>>> - Some use cases in the doc need the insertion time and some need >>>>>>>> the last modification time. Do we plan to support both? >>>>>>>> - What do we expect the behavior to be in UPDATE and MERGE >>>>>>>> operations? >>>>>>>> >>>>>>>> To be clear: I am not opposed to this change, just want to make >>>>>>>> sure I understand all use cases that we aim to address and what would >>>>>>>> be >>>>>>>> required in engines. >>>>>>>> >>>>>>>> чт, 15 січ. 2026 р. о 17:01 Maninder Parmar < >>>>>>>> [email protected]> пише: >>>>>>>> >>>>>>>>> +1 for improving how the commit timestamps are >>>>>>>>> assigned monotonically since this requirement has emerged over >>>>>>>>> multiple >>>>>>>>> discussions like notifications, multi-table transactions, time travel >>>>>>>>> accuracy and row timestamps. It would be good to have a single >>>>>>>>> consistent >>>>>>>>> way to represent and assign timestamps that could be leveraged across >>>>>>>>> multiple features. >>>>>>>>> >>>>>>>>> On Thu, Jan 15, 2026 at 4:05 PM Ryan Blue <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Yeah, to add my perspective on that discussion, I think my >>>>>>>>>> primary concern is that people expect timestamps to be monotonic and >>>>>>>>>> if >>>>>>>>>> they aren't then a `_last_update_timestamp` field just makes the >>>>>>>>>> problem >>>>>>>>>> worse. But it is _nice_ to have row-level timestamps. So I would be >>>>>>>>>> okay if >>>>>>>>>> we revisit how we assign commit timestamps and improve it so that >>>>>>>>>> you get >>>>>>>>>> monotonic behavior. >>>>>>>>>> >>>>>>>>>> On Thu, Jan 15, 2026 at 2:23 PM Steven Wu <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> We had an offline discussion with Ryan. I revised the proposal >>>>>>>>>>> as follows. >>>>>>>>>>> >>>>>>>>>>> 1. V4 would require writers to generate *monotonic* snapshot >>>>>>>>>>> timestamps. The proposal doc has a section that describes a >>>>>>>>>>> recommended >>>>>>>>>>> implementation using lamport timestamps. >>>>>>>>>>> 2. Expose *last_update_timestamp* metadata column that inherits >>>>>>>>>>> from snapshot timestamp >>>>>>>>>>> >>>>>>>>>>> This is a relatively low-friction change that can fix the time >>>>>>>>>>> travel problem and enable use cases like latency tracking, temporal >>>>>>>>>>> query, >>>>>>>>>>> TTL, auditing. >>>>>>>>>>> >>>>>>>>>>> There is no accuracy requirement on the timestamp values. In >>>>>>>>>>> practice, modern servers with NTP have pretty reliable wall clocks. >>>>>>>>>>> E.g., >>>>>>>>>>> Java library implemented this validation >>>>>>>>>>> <https://github.com/apache/iceberg/blob/035e0fb39d2a949f6343552ade0a7d6c2967e0db/core/src/main/java/org/apache/iceberg/TableMetadata.java#L369-L377> >>>>>>>>>>> that >>>>>>>>>>> protects against backward clock drift up to one minute for snapshot >>>>>>>>>>> timestamps. Don't think we have heard many complaints of commit >>>>>>>>>>> failure due >>>>>>>>>>> to that clock drift validation. >>>>>>>>>>> >>>>>>>>>>> Would appreciate feedback on the revised proposal. >>>>>>>>>>> >>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0 >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Steven >>>>>>>>>>> >>>>>>>>>>> On Tue, Jan 13, 2026 at 8:40 PM Anton Okolnychyi < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Steven, I was referring to the fact that CURRENT_TIMESTAMP() is >>>>>>>>>>>> usually evaluated quite early in engines so we could theoretically >>>>>>>>>>>> have >>>>>>>>>>>> another expression closer to the commit time. You are right, >>>>>>>>>>>> though, it >>>>>>>>>>>> won't be the actual commit time given that we have to write it >>>>>>>>>>>> into the >>>>>>>>>>>> files. Also, I don't think generating a timestamp for a row as it >>>>>>>>>>>> is being >>>>>>>>>>>> written is going to be beneficial. To sum up, expression-based >>>>>>>>>>>> defaults >>>>>>>>>>>> would allow us to capture the time the transaction or write >>>>>>>>>>>> starts, but not >>>>>>>>>>>> the actual commit time. >>>>>>>>>>>> >>>>>>>>>>>> Russell, if the goal is to know what happened to the table in a >>>>>>>>>>>> given time frame, isn't the changelog scan the way to go? It would >>>>>>>>>>>> assign >>>>>>>>>>>> commit ordinals based on lineage and include row-level diffs. How >>>>>>>>>>>> would you >>>>>>>>>>>> be able to determine changes with row timestamps by just looking >>>>>>>>>>>> at the >>>>>>>>>>>> latest snapshot? >>>>>>>>>>>> >>>>>>>>>>>> It does seem promising to make snapshot timestamps strictly >>>>>>>>>>>> increasing to avoid ambiguity during time travel. >>>>>>>>>>>> >>>>>>>>>>>> вт, 13 січ. 2026 р. о 16:33 Ryan Blue <[email protected]> пише: >>>>>>>>>>>> >>>>>>>>>>>>> > Whether or not "t" is an atomic clock time is not as >>>>>>>>>>>>> important as the query between time bounds making sense. >>>>>>>>>>>>> >>>>>>>>>>>>> I'm not sure I get it then. If we want monotonically >>>>>>>>>>>>> increasing times, but they don't have to be real times then how >>>>>>>>>>>>> do you know >>>>>>>>>>>>> what notion of "time" you care about for these filters? Or to put >>>>>>>>>>>>> it >>>>>>>>>>>>> another way, how do you know that your "before" and "after" times >>>>>>>>>>>>> are >>>>>>>>>>>>> reasonable? If the boundaries of these time queries can move >>>>>>>>>>>>> around a bit, >>>>>>>>>>>>> by how much? >>>>>>>>>>>>> >>>>>>>>>>>>> It seems to me that row IDs can play an important role here >>>>>>>>>>>>> because you have the order guarantee that we seem to want for >>>>>>>>>>>>> this use >>>>>>>>>>>>> case: if snapshot A was committed before snapshot B, then the >>>>>>>>>>>>> rows from A >>>>>>>>>>>>> have row IDs that are always less than the rows IDs of B. The >>>>>>>>>>>>> problem is >>>>>>>>>>>>> that we don't know where those row IDs start and end once A and B >>>>>>>>>>>>> are no >>>>>>>>>>>>> longer tracked. Using a "timestamp" seems to work, but I still >>>>>>>>>>>>> worry that >>>>>>>>>>>>> without reliable timestamps that correspond with some guarantee >>>>>>>>>>>>> to real >>>>>>>>>>>>> timestamps, we are creating a feature that seems reliable but >>>>>>>>>>>>> isn't. >>>>>>>>>>>>> >>>>>>>>>>>>> I'm somewhat open to the idea of introducing a snapshot >>>>>>>>>>>>> timestamp that the catalog guarantees is monotonically >>>>>>>>>>>>> increasing. But if >>>>>>>>>>>>> we did that, wouldn't we still need to know the association >>>>>>>>>>>>> between these >>>>>>>>>>>>> timestamps and snapshots after the snapshot metadata expires? My >>>>>>>>>>>>> mental >>>>>>>>>>>>> model is that this would be used to look for data that arrived, >>>>>>>>>>>>> say, 3 >>>>>>>>>>>>> weeks ago on Dec 24th. Since the snapshots metadata is no longer >>>>>>>>>>>>> around we >>>>>>>>>>>>> could use the row timestamp to find those rows. But how do we >>>>>>>>>>>>> know that the >>>>>>>>>>>>> snapshot timestamps correspond to the actual timestamp range of >>>>>>>>>>>>> Dec 24th? >>>>>>>>>>>>> Is it just "close enough" as long as we don't have out of order >>>>>>>>>>>>> timestamps? >>>>>>>>>>>>> This is what I mean by needing to keep track of the association >>>>>>>>>>>>> between >>>>>>>>>>>>> timestamps and snapshots after the metadata expires. Seems like >>>>>>>>>>>>> you either >>>>>>>>>>>>> need to keep track of what the catalog's clock was for events you >>>>>>>>>>>>> care >>>>>>>>>>>>> about, or you don't really care about exact timestamps. >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Jan 13, 2026 at 2:22 PM Russell Spitzer < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> The key goal here is the ability to answer the question "what >>>>>>>>>>>>>> happened to the table in some time window. (before < t < after)?" >>>>>>>>>>>>>> Whether or not "t" is an atomic clock time is not as >>>>>>>>>>>>>> important as the query between time bounds making sense. >>>>>>>>>>>>>> Downstream applications (from what I know) are mostly >>>>>>>>>>>>>> sensitive to getting discrete and well defined answers to >>>>>>>>>>>>>> this question like: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1 < t < 2 should be exclusive of >>>>>>>>>>>>>> 2 < t < 3 should be exclusive of >>>>>>>>>>>>>> 3 < t < 4 >>>>>>>>>>>>>> >>>>>>>>>>>>>> And the union of these should be the same as the query asking >>>>>>>>>>>>>> for 1 < t < 4 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Currently this is not possible because we have no >>>>>>>>>>>>>> guarantee of ordering in our timestamps >>>>>>>>>>>>>> >>>>>>>>>>>>>> Snapshots >>>>>>>>>>>>>> A -> B -> C >>>>>>>>>>>>>> Sequence numbers >>>>>>>>>>>>>> 50 -> 51 -> 52 >>>>>>>>>>>>>> Timestamp >>>>>>>>>>>>>> 3 -> 1 -> 2 >>>>>>>>>>>>>> >>>>>>>>>>>>>> This makes time travel always a little wrong to start with. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The Java implementation only allows one minute of negative >>>>>>>>>>>>>> time on commit so we actually kind of do have this as a >>>>>>>>>>>>>> "light monotonicity" requirement but as noted above there is >>>>>>>>>>>>>> no spec requirement for this. While we do have sequence >>>>>>>>>>>>>> number and row id, we still don't have a stable way of >>>>>>>>>>>>>> associating these with a consistent time in an engine >>>>>>>>>>>>>> independent way. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Ideally we just want to have one consistent way of answering >>>>>>>>>>>>>> the question "what did the table look like at time t" >>>>>>>>>>>>>> which I think we get by adding in a new field that is a >>>>>>>>>>>>>> timestamp, set by the Catalog close to commit time, >>>>>>>>>>>>>> that always goes up. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm not sure we can really do this with an engine expression >>>>>>>>>>>>>> since they won't know when the data is actually committed >>>>>>>>>>>>>> when writing files? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Jan 13, 2026 at 3:35 PM Anton Okolnychyi < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> This seems like a lot of new complexity in the format. I >>>>>>>>>>>>>>> would like us to explore whether we can build the considered >>>>>>>>>>>>>>> use cases on >>>>>>>>>>>>>>> top of expression-based defaults instead. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We already plan to support CURRENT_TIMESTAMP() and similar >>>>>>>>>>>>>>> functions that are part of the SQL standard definition for >>>>>>>>>>>>>>> default values. >>>>>>>>>>>>>>> This would provide us a way to know the relative row order. >>>>>>>>>>>>>>> True, this >>>>>>>>>>>>>>> usually will represent the start of the operation. We may define >>>>>>>>>>>>>>> COMMIT_TIMESTAMP() or a similar expression for the actual >>>>>>>>>>>>>>> commit time, if >>>>>>>>>>>>>>> there are use cases that need that. Plus, we may explore an >>>>>>>>>>>>>>> approach >>>>>>>>>>>>>>> similar to MySQL that allows users to reset the default value >>>>>>>>>>>>>>> on update. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Anton >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> вт, 13 січ. 2026 р. о 11:04 Russell Spitzer < >>>>>>>>>>>>>>> [email protected]> пише: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think this is the right step forward. Our current >>>>>>>>>>>>>>>> "timestamp" definition is too ambiguous to be useful so >>>>>>>>>>>>>>>> establishing >>>>>>>>>>>>>>>> a well defined and monotonic timestamp could be really >>>>>>>>>>>>>>>> great. I also like the ability for row's to know this value >>>>>>>>>>>>>>>> without >>>>>>>>>>>>>>>> having to rely on snapshot information which can be >>>>>>>>>>>>>>>> expired. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, Jan 12, 2026 at 11:03 AM Steven Wu < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I have revised the row timestamp proposal with the >>>>>>>>>>>>>>>>> following changes. >>>>>>>>>>>>>>>>> * a new commit_timestamp field in snapshot metadata that >>>>>>>>>>>>>>>>> has nano-second precision. >>>>>>>>>>>>>>>>> * this optional field is only set by the REST catalog >>>>>>>>>>>>>>>>> server >>>>>>>>>>>>>>>>> * it needs to be monotonic (e.g. implemented using Lamport >>>>>>>>>>>>>>>>> timestamp) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> Steven >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks for the clarification, Ryan. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> For long-running streaming jobs that commit periodically, >>>>>>>>>>>>>>>>>> it is difficult to establish the constant value of >>>>>>>>>>>>>>>>>> current_timestamp across >>>>>>>>>>>>>>>>>> all writer tasks for each commit cycle. I guess streaming >>>>>>>>>>>>>>>>>> writers may just >>>>>>>>>>>>>>>>>> need to write the wall clock time when appending a row to a >>>>>>>>>>>>>>>>>> data file for >>>>>>>>>>>>>>>>>> the default value of current_timestamp. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I don't think that every row would have a different >>>>>>>>>>>>>>>>>>> value. That would be up to the engine, but I would expect >>>>>>>>>>>>>>>>>>> engines to insert >>>>>>>>>>>>>>>>>>> `CURRENT_TIMESTAMP` into the plan and then replace it with >>>>>>>>>>>>>>>>>>> a constant, >>>>>>>>>>>>>>>>>>> resulting in a consistent value for all rows. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> You're right that this would not necessarily be the >>>>>>>>>>>>>>>>>>> commit time. But neither is the commit timestamp from >>>>>>>>>>>>>>>>>>> Iceberg's snapshot. >>>>>>>>>>>>>>>>>>> I'm not sure how we are going to define "good enough" for >>>>>>>>>>>>>>>>>>> this purpose. I >>>>>>>>>>>>>>>>>>> think at least `CURRENT_TIMESTAMP` has reliable and known >>>>>>>>>>>>>>>>>>> behavior when you >>>>>>>>>>>>>>>>>>> look at how it is handled in engines. And if you want the >>>>>>>>>>>>>>>>>>> Iceberg >>>>>>>>>>>>>>>>>>> timestamp, then use a periodic query of the snapshot stable >>>>>>>>>>>>>>>>>>> to keep track >>>>>>>>>>>>>>>>>>> of them in a table you can join to. I don't think this >>>>>>>>>>>>>>>>>>> rises to the need >>>>>>>>>>>>>>>>>>> for a table feature unless we can guarantee that it is >>>>>>>>>>>>>>>>>>> correct. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> > Postgres `current_timestamp` captures the >>>>>>>>>>>>>>>>>>>> transaction start time [1, 2]. Should we extend the same >>>>>>>>>>>>>>>>>>>> semantic to >>>>>>>>>>>>>>>>>>>> Iceberg: all rows added in the same snapshot should have >>>>>>>>>>>>>>>>>>>> the same timestamp >>>>>>>>>>>>>>>>>>>> value? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Let me clarify my last comment. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT >>>>>>>>>>>>>>>>>>>> CURRENT_TIMESTAMP) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Since Postgres current_timestamp captures the >>>>>>>>>>>>>>>>>>>> transaction start time, all rows added in the same insert >>>>>>>>>>>>>>>>>>>> transaction would >>>>>>>>>>>>>>>>>>>> have the same value as the transaction timestamp with the >>>>>>>>>>>>>>>>>>>> column >>>>>>>>>>>>>>>>>>>> definition above. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> If we extend a similar semantic to Iceberg, all rows >>>>>>>>>>>>>>>>>>>> added in the same Iceberg transaction/snapshot should have >>>>>>>>>>>>>>>>>>>> the same >>>>>>>>>>>>>>>>>>>> timestamp? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Ryan, I understand your comment for using >>>>>>>>>>>>>>>>>>>> current_timestamp expression as column default value, you >>>>>>>>>>>>>>>>>>>> were thinking >>>>>>>>>>>>>>>>>>>> that the engine would set the column value to the wall >>>>>>>>>>>>>>>>>>>> clock time when >>>>>>>>>>>>>>>>>>>> appending a row to a data file, right? every row would >>>>>>>>>>>>>>>>>>>> almost have a >>>>>>>>>>>>>>>>>>>> different timestamp value. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> `current_timestamp` expression may not always carry >>>>>>>>>>>>>>>>>>>>> the right semantic for the use cases. E.g., latency >>>>>>>>>>>>>>>>>>>>> tracking is interested >>>>>>>>>>>>>>>>>>>>> in when records are added / committed to the table, not >>>>>>>>>>>>>>>>>>>>> when the record was >>>>>>>>>>>>>>>>>>>>> appended to an uncommitted data file in the processing >>>>>>>>>>>>>>>>>>>>> engine. >>>>>>>>>>>>>>>>>>>>> Record creation and Iceberg commit can be minutes or even >>>>>>>>>>>>>>>>>>>>> hours apart. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Row timestamp inherited from snapshot timestamp has no >>>>>>>>>>>>>>>>>>>>> overhead with the initial commit and has very minimal >>>>>>>>>>>>>>>>>>>>> storage overhead >>>>>>>>>>>>>>>>>>>>> during file rewrite. Per-row current_timestamp would have >>>>>>>>>>>>>>>>>>>>> distinct values >>>>>>>>>>>>>>>>>>>>> for every row and has more storage overhead. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> OLTP databases deal with small row-level transactions. >>>>>>>>>>>>>>>>>>>>> Postgres `current_timestamp` captures the transaction >>>>>>>>>>>>>>>>>>>>> start time [1, 2]. >>>>>>>>>>>>>>>>>>>>> Should we extend the same semantic to Iceberg: all rows >>>>>>>>>>>>>>>>>>>>> added in the same >>>>>>>>>>>>>>>>>>>>> snapshot should have the same timestamp value? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>>>> https://www.postgresql.org/docs/current/functions-datetime.html >>>>>>>>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>>>>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield < >>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this >>>>>>>>>>>>>>>>>>>>>>> proposal. >>>>>>>>>>>>>>>>>>>>>>> To support the created_by timestamp, we would need >>>>>>>>>>>>>>>>>>>>>>> to implement the following row lineage behavior >>>>>>>>>>>>>>>>>>>>>>> * Initially, it inherits from the snapshot timestamp >>>>>>>>>>>>>>>>>>>>>>> * during rewrite (like compaction), it should be >>>>>>>>>>>>>>>>>>>>>>> persisted into data files. >>>>>>>>>>>>>>>>>>>>>>> * during update, it needs to be carried over from >>>>>>>>>>>>>>>>>>>>>>> the previous row. This is similar to the row_id carry >>>>>>>>>>>>>>>>>>>>>>> over for row updates. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Sorry for the short hand. These are not the same: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> 1. Insertion time - time the row was inserted. >>>>>>>>>>>>>>>>>>>>>> 2. Create by - The system that created the record. >>>>>>>>>>>>>>>>>>>>>> 3. Updated by - The system that last updated the >>>>>>>>>>>>>>>>>>>>>> record. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Depending on the exact use-case these might or might >>>>>>>>>>>>>>>>>>>>>> not have utility. I'm just wondering if there will be >>>>>>>>>>>>>>>>>>>>>> more example like >>>>>>>>>>>>>>>>>>>>>> this in the future. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> created_by column would incur likely significantly >>>>>>>>>>>>>>>>>>>>>>> higher storage overhead compared to the updated_by >>>>>>>>>>>>>>>>>>>>>>> column. As rows are >>>>>>>>>>>>>>>>>>>>>>> updated overtime, the cardinality for this column in >>>>>>>>>>>>>>>>>>>>>>> data files can be >>>>>>>>>>>>>>>>>>>>>>> high. Hence, the created_by column may not compress >>>>>>>>>>>>>>>>>>>>>>> well. This is a similar >>>>>>>>>>>>>>>>>>>>>>> problem for the row_id column. One side effect of >>>>>>>>>>>>>>>>>>>>>>> enabling row lineage by >>>>>>>>>>>>>>>>>>>>>>> default for V3 tables is the storage overhead of row_id >>>>>>>>>>>>>>>>>>>>>>> column after >>>>>>>>>>>>>>>>>>>>>>> compaction especially for narrow tables with few >>>>>>>>>>>>>>>>>>>>>>> columns. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I agree. I think this analysis also shows that some >>>>>>>>>>>>>>>>>>>>>> consumers of Iceberg might not necessarily want to have >>>>>>>>>>>>>>>>>>>>>> all these columns, >>>>>>>>>>>>>>>>>>>>>> so we might want to make them configurable, rather than >>>>>>>>>>>>>>>>>>>>>> mandating them for >>>>>>>>>>>>>>>>>>>>>> all tables. Ryan's thought on default values seems like >>>>>>>>>>>>>>>>>>>>>> it would solve the >>>>>>>>>>>>>>>>>>>>>> issues I was raising. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>> Micah >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue < >>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> > An explicit timestamp column adds more burden to >>>>>>>>>>>>>>>>>>>>>>> application developers. While some databases require an >>>>>>>>>>>>>>>>>>>>>>> explicit column in >>>>>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto >>>>>>>>>>>>>>>>>>>>>>> set the column value. >>>>>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to >>>>>>>>>>>>>>>>>>>>>>> the trigger timestamp. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Since the use cases don't require an exact >>>>>>>>>>>>>>>>>>>>>>> timestamp, this seems like the best solution to get >>>>>>>>>>>>>>>>>>>>>>> what people want (an >>>>>>>>>>>>>>>>>>>>>>> insertion timestamp) that has clear and well-defined >>>>>>>>>>>>>>>>>>>>>>> behavior. Since >>>>>>>>>>>>>>>>>>>>>>> `current_timestamp` is defined by the SQL spec, it >>>>>>>>>>>>>>>>>>>>>>> makes sense to me that >>>>>>>>>>>>>>>>>>>>>>> we could use it and have reasonable behavior. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I've talked with Anton about this before and maybe >>>>>>>>>>>>>>>>>>>>>>> he'll jump in on this thread. I think that we may need >>>>>>>>>>>>>>>>>>>>>>> to extend default >>>>>>>>>>>>>>>>>>>>>>> values to include default value expressions, like >>>>>>>>>>>>>>>>>>>>>>> `current_timestamp` that >>>>>>>>>>>>>>>>>>>>>>> is allowed by the SQL spec. That would solve the >>>>>>>>>>>>>>>>>>>>>>> problem as well as some >>>>>>>>>>>>>>>>>>>>>>> others (like `current_date` or `current_user`) and >>>>>>>>>>>>>>>>>>>>>>> would not create a >>>>>>>>>>>>>>>>>>>>>>> potentially misleading (and heavyweight) timestamp >>>>>>>>>>>>>>>>>>>>>>> feature in the format. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> > Also some environments may have stronger clock >>>>>>>>>>>>>>>>>>>>>>> service, like Spanner TrueTime service. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Even in cases like this, commit retries can reorder >>>>>>>>>>>>>>>>>>>>>>> commits and make timestamps out of order. I don't think >>>>>>>>>>>>>>>>>>>>>>> that we should be >>>>>>>>>>>>>>>>>>>>>>> making guarantees or even exposing metadata that people >>>>>>>>>>>>>>>>>>>>>>> might mistake as >>>>>>>>>>>>>>>>>>>>>>> having those guarantees. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu < >>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Ryan, thanks a lot for the feedback! >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Regarding the concern for reliable timestamps, we >>>>>>>>>>>>>>>>>>>>>>>> are not proposing using timestamps for ordering. With >>>>>>>>>>>>>>>>>>>>>>>> NTP in modern >>>>>>>>>>>>>>>>>>>>>>>> computers, they are generally reliable enough for the >>>>>>>>>>>>>>>>>>>>>>>> intended use cases. >>>>>>>>>>>>>>>>>>>>>>>> Also some environments may have stronger clock >>>>>>>>>>>>>>>>>>>>>>>> service, like Spanner >>>>>>>>>>>>>>>>>>>>>>>> TrueTime service >>>>>>>>>>>>>>>>>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency> >>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> > joining to timestamps from the snapshots >>>>>>>>>>>>>>>>>>>>>>>> metadata table. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> As you also mentioned, it depends on the snapshot >>>>>>>>>>>>>>>>>>>>>>>> history, which is often retained for a few days due to >>>>>>>>>>>>>>>>>>>>>>>> performance reasons. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> > embedding a timestamp in DML (like >>>>>>>>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an >>>>>>>>>>>>>>>>>>>>>>>> implicit one from table >>>>>>>>>>>>>>>>>>>>>>>> metadata. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> An explicit timestamp column adds more burden to >>>>>>>>>>>>>>>>>>>>>>>> application developers. While some databases require >>>>>>>>>>>>>>>>>>>>>>>> an explicit column in >>>>>>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto >>>>>>>>>>>>>>>>>>>>>>>> set the column value. >>>>>>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to >>>>>>>>>>>>>>>>>>>>>>>> the trigger timestamp. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Also, the timestamp set during computation (like >>>>>>>>>>>>>>>>>>>>>>>> streaming ingestion or relative long batch >>>>>>>>>>>>>>>>>>>>>>>> computation) doesn't capture the >>>>>>>>>>>>>>>>>>>>>>>> time the rows/files are added to the Iceberg table in >>>>>>>>>>>>>>>>>>>>>>>> a batch fashion. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> > And for those use cases, you could also keep a >>>>>>>>>>>>>>>>>>>>>>>> longer history of snapshot timestamps, like storing a >>>>>>>>>>>>>>>>>>>>>>>> catalog's event log >>>>>>>>>>>>>>>>>>>>>>>> for long-term access to timestamp info >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> this is not really consumable by joining the >>>>>>>>>>>>>>>>>>>>>>>> regular table query with catalog event log. I would >>>>>>>>>>>>>>>>>>>>>>>> also imagine catalog >>>>>>>>>>>>>>>>>>>>>>>> event log is capped at shorter retention (maybe a few >>>>>>>>>>>>>>>>>>>>>>>> months) compared to >>>>>>>>>>>>>>>>>>>>>>>> data retention (could be a few years). >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue < >>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> I don't think it is a good idea to expose >>>>>>>>>>>>>>>>>>>>>>>>> timestamps at the row level. Timestamps in metadata >>>>>>>>>>>>>>>>>>>>>>>>> that would be carried >>>>>>>>>>>>>>>>>>>>>>>>> down to the row level already confuse people that >>>>>>>>>>>>>>>>>>>>>>>>> expect them to be useful >>>>>>>>>>>>>>>>>>>>>>>>> or reliable, rather than for debugging. I think >>>>>>>>>>>>>>>>>>>>>>>>> extending this to the row >>>>>>>>>>>>>>>>>>>>>>>>> level would only make the problem worse. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> You can already get this information by projecting >>>>>>>>>>>>>>>>>>>>>>>>> the last updated sequence number, which is reliable, >>>>>>>>>>>>>>>>>>>>>>>>> and joining to >>>>>>>>>>>>>>>>>>>>>>>>> timestamps from the snapshots metadata table. Of >>>>>>>>>>>>>>>>>>>>>>>>> course, the drawback there >>>>>>>>>>>>>>>>>>>>>>>>> is losing the timestamp information when snapshots >>>>>>>>>>>>>>>>>>>>>>>>> expire, but since it >>>>>>>>>>>>>>>>>>>>>>>>> isn't reliable anyway I'd be fine with that. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Some of the use cases, like auditing and >>>>>>>>>>>>>>>>>>>>>>>>> compliance, are probably better served by embedding a >>>>>>>>>>>>>>>>>>>>>>>>> timestamp in DML >>>>>>>>>>>>>>>>>>>>>>>>> (like `current_timestamp`) rather than relying on an >>>>>>>>>>>>>>>>>>>>>>>>> implicit one from >>>>>>>>>>>>>>>>>>>>>>>>> table metadata. And for those use cases, you could >>>>>>>>>>>>>>>>>>>>>>>>> also keep a longer >>>>>>>>>>>>>>>>>>>>>>>>> history of snapshot timestamps, like storing a >>>>>>>>>>>>>>>>>>>>>>>>> catalog's event log for >>>>>>>>>>>>>>>>>>>>>>>>> long-term access to timestamp info. I think that >>>>>>>>>>>>>>>>>>>>>>>>> would be better than >>>>>>>>>>>>>>>>>>>>>>>>> storing it at the row level. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu < >>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> For V4 spec, I have a small proposal [1] to >>>>>>>>>>>>>>>>>>>>>>>>>> expose the row timestamp concept that can help with >>>>>>>>>>>>>>>>>>>>>>>>>> many use cases like >>>>>>>>>>>>>>>>>>>>>>>>>> temporal queries, latency tracking, TTL, auditing >>>>>>>>>>>>>>>>>>>>>>>>>> and compliance. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> This *_last_updated_timestamp_ms * metadata >>>>>>>>>>>>>>>>>>>>>>>>>> column behaves very similarly to the >>>>>>>>>>>>>>>>>>>>>>>>>> *_last_updated_sequence_number* for row lineage. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> - Initially, it inherits from the snapshot >>>>>>>>>>>>>>>>>>>>>>>>>> timestamp. >>>>>>>>>>>>>>>>>>>>>>>>>> - During rewrite (like compaction), its >>>>>>>>>>>>>>>>>>>>>>>>>> values are persisted in the data files. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear what you think. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>> Steven >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>
