Cool, sounds like a plan then? Thanks for answering all the questions, Steven!
чт, 22 січ. 2026 р. о 18:29 Steven Wu <[email protected]> пише: > For row timestamp inheritance to work, I would need to implement the > plumbing. So I would imagine existing rows would have null values because > the inheritance plumbing was not there yet. This would be consistent with > upgrade behavior for the V3 row lineage: > https://iceberg.apache.org/spec/#row-lineage-for-upgraded-tables. > > On Thu, Jan 22, 2026 at 4:09 PM Anton Okolnychyi <[email protected]> > wrote: > >> Also, do we have a concrete plan for how to handle tables that would be >> upgraded to V4? What timestamp will we assign to existing rows? >> >> On Wed, Jan 21, 2026 at 3:59 PM Anton Okolnychyi <[email protected]> >> wrote: >> >>> If we ignore temporal queries that need strict snapshot boundaries and >>> can't be solved completely using row timestamps in case of mutations, you >>> mentioned other use cases when row timestamps may be helpful like TTL and >>> auditing. We can debate whether using CURRENT_TIMESTAMP() is enough for >>> them, but I don't really see a point given that we already have row lineage >>> in V3 and the storage overhead for one more field isn't likely to be >>> noticable. One of the problems with CURRENT_TIMESTAMP() is the required >>> action by the user. Having a reliable row timestamp populated automatically >>> is likely to be better, so +1. >>> >>> пт, 16 січ. 2026 р. о 14:30 Steven Wu <[email protected]> пише: >>> >>>> Joining with snapshot history also has significant complexity. It >>>> requires retaining the entire snapshot history with probably trimmed >>>> snapshot metadata. There are concerns on the size of the snapshot history >>>> for tables with frequent commits (like streaming ingestion). Do we maintain >>>> the unbounded trimmed snapshot history in the same table metadata, which >>>> could affect table metadata.json size? or store it separately somewhere >>>> (like in catalog), which would require the complexity of multi-entity >>>> transaction in catalog? >>>> >>>> >>>> On Fri, Jan 16, 2026 at 12:07 PM Russell Spitzer < >>>> [email protected]> wrote: >>>> >>>>> I've gone back and forth on the inherited columns. I think the thing >>>>> which keeps coming back to me is that I don't >>>>> like that the only way to determine the timestamp associated with a >>>>> row update/creation is to do a join back >>>>> against table metadata. While that's doable, It feels user unfriendly. >>>>> >>>>> >>>>> >>>>> On Fri, Jan 16, 2026 at 11:54 AM Steven Wu <[email protected]> >>>>> wrote: >>>>> >>>>>> Anton, you are right that the row-level deletes will be a problem for >>>>>> some of the mentioned use cases (like incremental processing). I have >>>>>> clarified the applicability of some use cases to "tables with inserts and >>>>>> updates only". >>>>>> >>>>>> Right now, we are only tracking modification/commit time (not >>>>>> insertion time) in case of updates. >>>>>> >>>>>> On Thu, Jan 15, 2026 at 6:33 PM Anton Okolnychyi < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> I think there is clear consensus that making snapshot timestamps >>>>>>> strictly increasing is a positive thing. I am also +1. >>>>>>> >>>>>>> - How will row timestamps allow us to reliably implement incremental >>>>>>> consumption independent of the snapshot retention given that rows can be >>>>>>> added AND removed in a particular time frame? How can we capture all >>>>>>> changes by just looking at the latest snapshot? >>>>>>> - Some use cases in the doc need the insertion time and some need >>>>>>> the last modification time. Do we plan to support both? >>>>>>> - What do we expect the behavior to be in UPDATE and MERGE >>>>>>> operations? >>>>>>> >>>>>>> To be clear: I am not opposed to this change, just want to make sure >>>>>>> I understand all use cases that we aim to address and what would be >>>>>>> required in engines. >>>>>>> >>>>>>> чт, 15 січ. 2026 р. о 17:01 Maninder Parmar < >>>>>>> [email protected]> пише: >>>>>>> >>>>>>>> +1 for improving how the commit timestamps are >>>>>>>> assigned monotonically since this requirement has emerged over multiple >>>>>>>> discussions like notifications, multi-table transactions, time travel >>>>>>>> accuracy and row timestamps. It would be good to have a single >>>>>>>> consistent >>>>>>>> way to represent and assign timestamps that could be leveraged across >>>>>>>> multiple features. >>>>>>>> >>>>>>>> On Thu, Jan 15, 2026 at 4:05 PM Ryan Blue <[email protected]> wrote: >>>>>>>> >>>>>>>>> Yeah, to add my perspective on that discussion, I think my primary >>>>>>>>> concern is that people expect timestamps to be monotonic and if they >>>>>>>>> aren't >>>>>>>>> then a `_last_update_timestamp` field just makes the problem worse. >>>>>>>>> But it >>>>>>>>> is _nice_ to have row-level timestamps. So I would be okay if we >>>>>>>>> revisit >>>>>>>>> how we assign commit timestamps and improve it so that you get >>>>>>>>> monotonic >>>>>>>>> behavior. >>>>>>>>> >>>>>>>>> On Thu, Jan 15, 2026 at 2:23 PM Steven Wu <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> We had an offline discussion with Ryan. I revised the proposal as >>>>>>>>>> follows. >>>>>>>>>> >>>>>>>>>> 1. V4 would require writers to generate *monotonic* snapshot >>>>>>>>>> timestamps. The proposal doc has a section that describes a >>>>>>>>>> recommended >>>>>>>>>> implementation using lamport timestamps. >>>>>>>>>> 2. Expose *last_update_timestamp* metadata column that inherits >>>>>>>>>> from snapshot timestamp >>>>>>>>>> >>>>>>>>>> This is a relatively low-friction change that can fix the time >>>>>>>>>> travel problem and enable use cases like latency tracking, temporal >>>>>>>>>> query, >>>>>>>>>> TTL, auditing. >>>>>>>>>> >>>>>>>>>> There is no accuracy requirement on the timestamp values. In >>>>>>>>>> practice, modern servers with NTP have pretty reliable wall clocks. >>>>>>>>>> E.g., >>>>>>>>>> Java library implemented this validation >>>>>>>>>> <https://github.com/apache/iceberg/blob/035e0fb39d2a949f6343552ade0a7d6c2967e0db/core/src/main/java/org/apache/iceberg/TableMetadata.java#L369-L377> >>>>>>>>>> that >>>>>>>>>> protects against backward clock drift up to one minute for snapshot >>>>>>>>>> timestamps. Don't think we have heard many complaints of commit >>>>>>>>>> failure due >>>>>>>>>> to that clock drift validation. >>>>>>>>>> >>>>>>>>>> Would appreciate feedback on the revised proposal. >>>>>>>>>> >>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0 >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Steven >>>>>>>>>> >>>>>>>>>> On Tue, Jan 13, 2026 at 8:40 PM Anton Okolnychyi < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Steven, I was referring to the fact that CURRENT_TIMESTAMP() is >>>>>>>>>>> usually evaluated quite early in engines so we could theoretically >>>>>>>>>>> have >>>>>>>>>>> another expression closer to the commit time. You are right, >>>>>>>>>>> though, it >>>>>>>>>>> won't be the actual commit time given that we have to write it into >>>>>>>>>>> the >>>>>>>>>>> files. Also, I don't think generating a timestamp for a row as it >>>>>>>>>>> is being >>>>>>>>>>> written is going to be beneficial. To sum up, expression-based >>>>>>>>>>> defaults >>>>>>>>>>> would allow us to capture the time the transaction or write starts, >>>>>>>>>>> but not >>>>>>>>>>> the actual commit time. >>>>>>>>>>> >>>>>>>>>>> Russell, if the goal is to know what happened to the table in a >>>>>>>>>>> given time frame, isn't the changelog scan the way to go? It would >>>>>>>>>>> assign >>>>>>>>>>> commit ordinals based on lineage and include row-level diffs. How >>>>>>>>>>> would you >>>>>>>>>>> be able to determine changes with row timestamps by just looking at >>>>>>>>>>> the >>>>>>>>>>> latest snapshot? >>>>>>>>>>> >>>>>>>>>>> It does seem promising to make snapshot timestamps strictly >>>>>>>>>>> increasing to avoid ambiguity during time travel. >>>>>>>>>>> >>>>>>>>>>> вт, 13 січ. 2026 р. о 16:33 Ryan Blue <[email protected]> пише: >>>>>>>>>>> >>>>>>>>>>>> > Whether or not "t" is an atomic clock time is not as >>>>>>>>>>>> important as the query between time bounds making sense. >>>>>>>>>>>> >>>>>>>>>>>> I'm not sure I get it then. If we want monotonically increasing >>>>>>>>>>>> times, but they don't have to be real times then how do you know >>>>>>>>>>>> what >>>>>>>>>>>> notion of "time" you care about for these filters? Or to put it >>>>>>>>>>>> another >>>>>>>>>>>> way, how do you know that your "before" and "after" times are >>>>>>>>>>>> reasonable? >>>>>>>>>>>> If the boundaries of these time queries can move around a bit, by >>>>>>>>>>>> how much? >>>>>>>>>>>> >>>>>>>>>>>> It seems to me that row IDs can play an important role here >>>>>>>>>>>> because you have the order guarantee that we seem to want for this >>>>>>>>>>>> use >>>>>>>>>>>> case: if snapshot A was committed before snapshot B, then the rows >>>>>>>>>>>> from A >>>>>>>>>>>> have row IDs that are always less than the rows IDs of B. The >>>>>>>>>>>> problem is >>>>>>>>>>>> that we don't know where those row IDs start and end once A and B >>>>>>>>>>>> are no >>>>>>>>>>>> longer tracked. Using a "timestamp" seems to work, but I still >>>>>>>>>>>> worry that >>>>>>>>>>>> without reliable timestamps that correspond with some guarantee to >>>>>>>>>>>> real >>>>>>>>>>>> timestamps, we are creating a feature that seems reliable but >>>>>>>>>>>> isn't. >>>>>>>>>>>> >>>>>>>>>>>> I'm somewhat open to the idea of introducing a snapshot >>>>>>>>>>>> timestamp that the catalog guarantees is monotonically increasing. >>>>>>>>>>>> But if >>>>>>>>>>>> we did that, wouldn't we still need to know the association >>>>>>>>>>>> between these >>>>>>>>>>>> timestamps and snapshots after the snapshot metadata expires? My >>>>>>>>>>>> mental >>>>>>>>>>>> model is that this would be used to look for data that arrived, >>>>>>>>>>>> say, 3 >>>>>>>>>>>> weeks ago on Dec 24th. Since the snapshots metadata is no longer >>>>>>>>>>>> around we >>>>>>>>>>>> could use the row timestamp to find those rows. But how do we know >>>>>>>>>>>> that the >>>>>>>>>>>> snapshot timestamps correspond to the actual timestamp range of >>>>>>>>>>>> Dec 24th? >>>>>>>>>>>> Is it just "close enough" as long as we don't have out of order >>>>>>>>>>>> timestamps? >>>>>>>>>>>> This is what I mean by needing to keep track of the association >>>>>>>>>>>> between >>>>>>>>>>>> timestamps and snapshots after the metadata expires. Seems like >>>>>>>>>>>> you either >>>>>>>>>>>> need to keep track of what the catalog's clock was for events you >>>>>>>>>>>> care >>>>>>>>>>>> about, or you don't really care about exact timestamps. >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Jan 13, 2026 at 2:22 PM Russell Spitzer < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> The key goal here is the ability to answer the question "what >>>>>>>>>>>>> happened to the table in some time window. (before < t < after)?" >>>>>>>>>>>>> Whether or not "t" is an atomic clock time is not as important >>>>>>>>>>>>> as the query between time bounds making sense. >>>>>>>>>>>>> Downstream applications (from what I know) are mostly >>>>>>>>>>>>> sensitive to getting discrete and well defined answers to >>>>>>>>>>>>> this question like: >>>>>>>>>>>>> >>>>>>>>>>>>> 1 < t < 2 should be exclusive of >>>>>>>>>>>>> 2 < t < 3 should be exclusive of >>>>>>>>>>>>> 3 < t < 4 >>>>>>>>>>>>> >>>>>>>>>>>>> And the union of these should be the same as the query asking >>>>>>>>>>>>> for 1 < t < 4 >>>>>>>>>>>>> >>>>>>>>>>>>> Currently this is not possible because we have no guarantee of >>>>>>>>>>>>> ordering in our timestamps >>>>>>>>>>>>> >>>>>>>>>>>>> Snapshots >>>>>>>>>>>>> A -> B -> C >>>>>>>>>>>>> Sequence numbers >>>>>>>>>>>>> 50 -> 51 -> 52 >>>>>>>>>>>>> Timestamp >>>>>>>>>>>>> 3 -> 1 -> 2 >>>>>>>>>>>>> >>>>>>>>>>>>> This makes time travel always a little wrong to start with. >>>>>>>>>>>>> >>>>>>>>>>>>> The Java implementation only allows one minute of negative >>>>>>>>>>>>> time on commit so we actually kind of do have this as a >>>>>>>>>>>>> "light monotonicity" requirement but as noted above there is >>>>>>>>>>>>> no spec requirement for this. While we do have sequence >>>>>>>>>>>>> number and row id, we still don't have a stable way of >>>>>>>>>>>>> associating these with a consistent time in an engine independent >>>>>>>>>>>>> way. >>>>>>>>>>>>> >>>>>>>>>>>>> Ideally we just want to have one consistent way of answering >>>>>>>>>>>>> the question "what did the table look like at time t" >>>>>>>>>>>>> which I think we get by adding in a new field that is a >>>>>>>>>>>>> timestamp, set by the Catalog close to commit time, >>>>>>>>>>>>> that always goes up. >>>>>>>>>>>>> >>>>>>>>>>>>> I'm not sure we can really do this with an engine expression >>>>>>>>>>>>> since they won't know when the data is actually committed >>>>>>>>>>>>> when writing files? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Jan 13, 2026 at 3:35 PM Anton Okolnychyi < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> This seems like a lot of new complexity in the format. I >>>>>>>>>>>>>> would like us to explore whether we can build the considered use >>>>>>>>>>>>>> cases on >>>>>>>>>>>>>> top of expression-based defaults instead. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We already plan to support CURRENT_TIMESTAMP() and similar >>>>>>>>>>>>>> functions that are part of the SQL standard definition for >>>>>>>>>>>>>> default values. >>>>>>>>>>>>>> This would provide us a way to know the relative row order. >>>>>>>>>>>>>> True, this >>>>>>>>>>>>>> usually will represent the start of the operation. We may define >>>>>>>>>>>>>> COMMIT_TIMESTAMP() or a similar expression for the actual commit >>>>>>>>>>>>>> time, if >>>>>>>>>>>>>> there are use cases that need that. Plus, we may explore an >>>>>>>>>>>>>> approach >>>>>>>>>>>>>> similar to MySQL that allows users to reset the default value on >>>>>>>>>>>>>> update. >>>>>>>>>>>>>> >>>>>>>>>>>>>> - Anton >>>>>>>>>>>>>> >>>>>>>>>>>>>> вт, 13 січ. 2026 р. о 11:04 Russell Spitzer < >>>>>>>>>>>>>> [email protected]> пише: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think this is the right step forward. Our current >>>>>>>>>>>>>>> "timestamp" definition is too ambiguous to be useful so >>>>>>>>>>>>>>> establishing >>>>>>>>>>>>>>> a well defined and monotonic timestamp could be really >>>>>>>>>>>>>>> great. I also like the ability for row's to know this value >>>>>>>>>>>>>>> without >>>>>>>>>>>>>>> having to rely on snapshot information which can be expired. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Jan 12, 2026 at 11:03 AM Steven Wu < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have revised the row timestamp proposal with the >>>>>>>>>>>>>>>> following changes. >>>>>>>>>>>>>>>> * a new commit_timestamp field in snapshot metadata that >>>>>>>>>>>>>>>> has nano-second precision. >>>>>>>>>>>>>>>> * this optional field is only set by the REST catalog server >>>>>>>>>>>>>>>> * it needs to be monotonic (e.g. implemented using Lamport >>>>>>>>>>>>>>>> timestamp) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Steven >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks for the clarification, Ryan. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> For long-running streaming jobs that commit periodically, >>>>>>>>>>>>>>>>> it is difficult to establish the constant value of >>>>>>>>>>>>>>>>> current_timestamp across >>>>>>>>>>>>>>>>> all writer tasks for each commit cycle. I guess streaming >>>>>>>>>>>>>>>>> writers may just >>>>>>>>>>>>>>>>> need to write the wall clock time when appending a row to a >>>>>>>>>>>>>>>>> data file for >>>>>>>>>>>>>>>>> the default value of current_timestamp. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I don't think that every row would have a different >>>>>>>>>>>>>>>>>> value. That would be up to the engine, but I would expect >>>>>>>>>>>>>>>>>> engines to insert >>>>>>>>>>>>>>>>>> `CURRENT_TIMESTAMP` into the plan and then replace it with a >>>>>>>>>>>>>>>>>> constant, >>>>>>>>>>>>>>>>>> resulting in a consistent value for all rows. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> You're right that this would not necessarily be the >>>>>>>>>>>>>>>>>> commit time. But neither is the commit timestamp from >>>>>>>>>>>>>>>>>> Iceberg's snapshot. >>>>>>>>>>>>>>>>>> I'm not sure how we are going to define "good enough" for >>>>>>>>>>>>>>>>>> this purpose. I >>>>>>>>>>>>>>>>>> think at least `CURRENT_TIMESTAMP` has reliable and known >>>>>>>>>>>>>>>>>> behavior when you >>>>>>>>>>>>>>>>>> look at how it is handled in engines. And if you want the >>>>>>>>>>>>>>>>>> Iceberg >>>>>>>>>>>>>>>>>> timestamp, then use a periodic query of the snapshot stable >>>>>>>>>>>>>>>>>> to keep track >>>>>>>>>>>>>>>>>> of them in a table you can join to. I don't think this rises >>>>>>>>>>>>>>>>>> to the need >>>>>>>>>>>>>>>>>> for a table feature unless we can guarantee that it is >>>>>>>>>>>>>>>>>> correct. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> > Postgres `current_timestamp` captures the >>>>>>>>>>>>>>>>>>> transaction start time [1, 2]. Should we extend the same >>>>>>>>>>>>>>>>>>> semantic to >>>>>>>>>>>>>>>>>>> Iceberg: all rows added in the same snapshot should have >>>>>>>>>>>>>>>>>>> the same timestamp >>>>>>>>>>>>>>>>>>> value? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Let me clarify my last comment. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT >>>>>>>>>>>>>>>>>>> CURRENT_TIMESTAMP) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Since Postgres current_timestamp captures the >>>>>>>>>>>>>>>>>>> transaction start time, all rows added in the same insert >>>>>>>>>>>>>>>>>>> transaction would >>>>>>>>>>>>>>>>>>> have the same value as the transaction timestamp with the >>>>>>>>>>>>>>>>>>> column >>>>>>>>>>>>>>>>>>> definition above. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> If we extend a similar semantic to Iceberg, all rows >>>>>>>>>>>>>>>>>>> added in the same Iceberg transaction/snapshot should have >>>>>>>>>>>>>>>>>>> the same >>>>>>>>>>>>>>>>>>> timestamp? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Ryan, I understand your comment for using >>>>>>>>>>>>>>>>>>> current_timestamp expression as column default value, you >>>>>>>>>>>>>>>>>>> were thinking >>>>>>>>>>>>>>>>>>> that the engine would set the column value to the wall >>>>>>>>>>>>>>>>>>> clock time when >>>>>>>>>>>>>>>>>>> appending a row to a data file, right? every row would >>>>>>>>>>>>>>>>>>> almost have a >>>>>>>>>>>>>>>>>>> different timestamp value. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> `current_timestamp` expression may not always carry the >>>>>>>>>>>>>>>>>>>> right semantic for the use cases. E.g., latency tracking >>>>>>>>>>>>>>>>>>>> is interested in >>>>>>>>>>>>>>>>>>>> when records are added / committed to the table, not when >>>>>>>>>>>>>>>>>>>> the record was >>>>>>>>>>>>>>>>>>>> appended to an uncommitted data file in the processing >>>>>>>>>>>>>>>>>>>> engine. >>>>>>>>>>>>>>>>>>>> Record creation and Iceberg commit can be minutes or even >>>>>>>>>>>>>>>>>>>> hours apart. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Row timestamp inherited from snapshot timestamp has no >>>>>>>>>>>>>>>>>>>> overhead with the initial commit and has very minimal >>>>>>>>>>>>>>>>>>>> storage overhead >>>>>>>>>>>>>>>>>>>> during file rewrite. Per-row current_timestamp would have >>>>>>>>>>>>>>>>>>>> distinct values >>>>>>>>>>>>>>>>>>>> for every row and has more storage overhead. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> OLTP databases deal with small row-level transactions. >>>>>>>>>>>>>>>>>>>> Postgres `current_timestamp` captures the transaction >>>>>>>>>>>>>>>>>>>> start time [1, 2]. >>>>>>>>>>>>>>>>>>>> Should we extend the same semantic to Iceberg: all rows >>>>>>>>>>>>>>>>>>>> added in the same >>>>>>>>>>>>>>>>>>>> snapshot should have the same timestamp value? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>>> https://www.postgresql.org/docs/current/functions-datetime.html >>>>>>>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>>>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this >>>>>>>>>>>>>>>>>>>>>> proposal. >>>>>>>>>>>>>>>>>>>>>> To support the created_by timestamp, we would need to >>>>>>>>>>>>>>>>>>>>>> implement the following row lineage behavior >>>>>>>>>>>>>>>>>>>>>> * Initially, it inherits from the snapshot timestamp >>>>>>>>>>>>>>>>>>>>>> * during rewrite (like compaction), it should be >>>>>>>>>>>>>>>>>>>>>> persisted into data files. >>>>>>>>>>>>>>>>>>>>>> * during update, it needs to be carried over from the >>>>>>>>>>>>>>>>>>>>>> previous row. This is similar to the row_id carry over >>>>>>>>>>>>>>>>>>>>>> for row updates. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Sorry for the short hand. These are not the same: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 1. Insertion time - time the row was inserted. >>>>>>>>>>>>>>>>>>>>> 2. Create by - The system that created the record. >>>>>>>>>>>>>>>>>>>>> 3. Updated by - The system that last updated the >>>>>>>>>>>>>>>>>>>>> record. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Depending on the exact use-case these might or might >>>>>>>>>>>>>>>>>>>>> not have utility. I'm just wondering if there will be >>>>>>>>>>>>>>>>>>>>> more example like >>>>>>>>>>>>>>>>>>>>> this in the future. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> created_by column would incur likely significantly >>>>>>>>>>>>>>>>>>>>>> higher storage overhead compared to the updated_by >>>>>>>>>>>>>>>>>>>>>> column. As rows are >>>>>>>>>>>>>>>>>>>>>> updated overtime, the cardinality for this column in >>>>>>>>>>>>>>>>>>>>>> data files can be >>>>>>>>>>>>>>>>>>>>>> high. Hence, the created_by column may not compress >>>>>>>>>>>>>>>>>>>>>> well. This is a similar >>>>>>>>>>>>>>>>>>>>>> problem for the row_id column. One side effect of >>>>>>>>>>>>>>>>>>>>>> enabling row lineage by >>>>>>>>>>>>>>>>>>>>>> default for V3 tables is the storage overhead of row_id >>>>>>>>>>>>>>>>>>>>>> column after >>>>>>>>>>>>>>>>>>>>>> compaction especially for narrow tables with few columns. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I agree. I think this analysis also shows that some >>>>>>>>>>>>>>>>>>>>> consumers of Iceberg might not necessarily want to have >>>>>>>>>>>>>>>>>>>>> all these columns, >>>>>>>>>>>>>>>>>>>>> so we might want to make them configurable, rather than >>>>>>>>>>>>>>>>>>>>> mandating them for >>>>>>>>>>>>>>>>>>>>> all tables. Ryan's thought on default values seems like >>>>>>>>>>>>>>>>>>>>> it would solve the >>>>>>>>>>>>>>>>>>>>> issues I was raising. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>> Micah >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue < >>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> > An explicit timestamp column adds more burden to >>>>>>>>>>>>>>>>>>>>>> application developers. While some databases require an >>>>>>>>>>>>>>>>>>>>>> explicit column in >>>>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set >>>>>>>>>>>>>>>>>>>>>> the column value. >>>>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to >>>>>>>>>>>>>>>>>>>>>> the trigger timestamp. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Since the use cases don't require an exact timestamp, >>>>>>>>>>>>>>>>>>>>>> this seems like the best solution to get what people >>>>>>>>>>>>>>>>>>>>>> want (an insertion >>>>>>>>>>>>>>>>>>>>>> timestamp) that has clear and well-defined behavior. >>>>>>>>>>>>>>>>>>>>>> Since >>>>>>>>>>>>>>>>>>>>>> `current_timestamp` is defined by the SQL spec, it makes >>>>>>>>>>>>>>>>>>>>>> sense to me that >>>>>>>>>>>>>>>>>>>>>> we could use it and have reasonable behavior. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I've talked with Anton about this before and maybe >>>>>>>>>>>>>>>>>>>>>> he'll jump in on this thread. I think that we may need >>>>>>>>>>>>>>>>>>>>>> to extend default >>>>>>>>>>>>>>>>>>>>>> values to include default value expressions, like >>>>>>>>>>>>>>>>>>>>>> `current_timestamp` that >>>>>>>>>>>>>>>>>>>>>> is allowed by the SQL spec. That would solve the problem >>>>>>>>>>>>>>>>>>>>>> as well as some >>>>>>>>>>>>>>>>>>>>>> others (like `current_date` or `current_user`) and would >>>>>>>>>>>>>>>>>>>>>> not create a >>>>>>>>>>>>>>>>>>>>>> potentially misleading (and heavyweight) timestamp >>>>>>>>>>>>>>>>>>>>>> feature in the format. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> > Also some environments may have stronger clock >>>>>>>>>>>>>>>>>>>>>> service, like Spanner TrueTime service. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Even in cases like this, commit retries can reorder >>>>>>>>>>>>>>>>>>>>>> commits and make timestamps out of order. I don't think >>>>>>>>>>>>>>>>>>>>>> that we should be >>>>>>>>>>>>>>>>>>>>>> making guarantees or even exposing metadata that people >>>>>>>>>>>>>>>>>>>>>> might mistake as >>>>>>>>>>>>>>>>>>>>>> having those guarantees. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu < >>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Ryan, thanks a lot for the feedback! >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Regarding the concern for reliable timestamps, we >>>>>>>>>>>>>>>>>>>>>>> are not proposing using timestamps for ordering. With >>>>>>>>>>>>>>>>>>>>>>> NTP in modern >>>>>>>>>>>>>>>>>>>>>>> computers, they are generally reliable enough for the >>>>>>>>>>>>>>>>>>>>>>> intended use cases. >>>>>>>>>>>>>>>>>>>>>>> Also some environments may have stronger clock service, >>>>>>>>>>>>>>>>>>>>>>> like Spanner >>>>>>>>>>>>>>>>>>>>>>> TrueTime service >>>>>>>>>>>>>>>>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency> >>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> > joining to timestamps from the snapshots metadata >>>>>>>>>>>>>>>>>>>>>>> table. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> As you also mentioned, it depends on the snapshot >>>>>>>>>>>>>>>>>>>>>>> history, which is often retained for a few days due to >>>>>>>>>>>>>>>>>>>>>>> performance reasons. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> > embedding a timestamp in DML (like >>>>>>>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit >>>>>>>>>>>>>>>>>>>>>>> one from table >>>>>>>>>>>>>>>>>>>>>>> metadata. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> An explicit timestamp column adds more burden to >>>>>>>>>>>>>>>>>>>>>>> application developers. While some databases require an >>>>>>>>>>>>>>>>>>>>>>> explicit column in >>>>>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto >>>>>>>>>>>>>>>>>>>>>>> set the column value. >>>>>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to >>>>>>>>>>>>>>>>>>>>>>> the trigger timestamp. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Also, the timestamp set during computation (like >>>>>>>>>>>>>>>>>>>>>>> streaming ingestion or relative long batch computation) >>>>>>>>>>>>>>>>>>>>>>> doesn't capture the >>>>>>>>>>>>>>>>>>>>>>> time the rows/files are added to the Iceberg table in a >>>>>>>>>>>>>>>>>>>>>>> batch fashion. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> > And for those use cases, you could also keep a >>>>>>>>>>>>>>>>>>>>>>> longer history of snapshot timestamps, like storing a >>>>>>>>>>>>>>>>>>>>>>> catalog's event log >>>>>>>>>>>>>>>>>>>>>>> for long-term access to timestamp info >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> this is not really consumable by joining the regular >>>>>>>>>>>>>>>>>>>>>>> table query with catalog event log. I would also >>>>>>>>>>>>>>>>>>>>>>> imagine catalog event log >>>>>>>>>>>>>>>>>>>>>>> is capped at shorter retention (maybe a few months) >>>>>>>>>>>>>>>>>>>>>>> compared to data >>>>>>>>>>>>>>>>>>>>>>> retention (could be a few years). >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue < >>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I don't think it is a good idea to expose >>>>>>>>>>>>>>>>>>>>>>>> timestamps at the row level. Timestamps in metadata >>>>>>>>>>>>>>>>>>>>>>>> that would be carried >>>>>>>>>>>>>>>>>>>>>>>> down to the row level already confuse people that >>>>>>>>>>>>>>>>>>>>>>>> expect them to be useful >>>>>>>>>>>>>>>>>>>>>>>> or reliable, rather than for debugging. I think >>>>>>>>>>>>>>>>>>>>>>>> extending this to the row >>>>>>>>>>>>>>>>>>>>>>>> level would only make the problem worse. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> You can already get this information by projecting >>>>>>>>>>>>>>>>>>>>>>>> the last updated sequence number, which is reliable, >>>>>>>>>>>>>>>>>>>>>>>> and joining to >>>>>>>>>>>>>>>>>>>>>>>> timestamps from the snapshots metadata table. Of >>>>>>>>>>>>>>>>>>>>>>>> course, the drawback there >>>>>>>>>>>>>>>>>>>>>>>> is losing the timestamp information when snapshots >>>>>>>>>>>>>>>>>>>>>>>> expire, but since it >>>>>>>>>>>>>>>>>>>>>>>> isn't reliable anyway I'd be fine with that. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Some of the use cases, like auditing and >>>>>>>>>>>>>>>>>>>>>>>> compliance, are probably better served by embedding a >>>>>>>>>>>>>>>>>>>>>>>> timestamp in DML >>>>>>>>>>>>>>>>>>>>>>>> (like `current_timestamp`) rather than relying on an >>>>>>>>>>>>>>>>>>>>>>>> implicit one from >>>>>>>>>>>>>>>>>>>>>>>> table metadata. And for those use cases, you could >>>>>>>>>>>>>>>>>>>>>>>> also keep a longer >>>>>>>>>>>>>>>>>>>>>>>> history of snapshot timestamps, like storing a >>>>>>>>>>>>>>>>>>>>>>>> catalog's event log for >>>>>>>>>>>>>>>>>>>>>>>> long-term access to timestamp info. I think that would >>>>>>>>>>>>>>>>>>>>>>>> be better than >>>>>>>>>>>>>>>>>>>>>>>> storing it at the row level. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu < >>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> For V4 spec, I have a small proposal [1] to expose >>>>>>>>>>>>>>>>>>>>>>>>> the row timestamp concept that can help with many use >>>>>>>>>>>>>>>>>>>>>>>>> cases like temporal >>>>>>>>>>>>>>>>>>>>>>>>> queries, latency tracking, TTL, auditing and >>>>>>>>>>>>>>>>>>>>>>>>> compliance. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> This *_last_updated_timestamp_ms * metadata >>>>>>>>>>>>>>>>>>>>>>>>> column behaves very similarly to the >>>>>>>>>>>>>>>>>>>>>>>>> *_last_updated_sequence_number* for row lineage. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> - Initially, it inherits from the snapshot >>>>>>>>>>>>>>>>>>>>>>>>> timestamp. >>>>>>>>>>>>>>>>>>>>>>>>> - During rewrite (like compaction), its values >>>>>>>>>>>>>>>>>>>>>>>>> are persisted in the data files. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear what you think. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>> Steven >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>
