Also, do we have a concrete plan for how to handle tables that would be upgraded to V4? What timestamp will we assign to existing rows?
On Wed, Jan 21, 2026 at 3:59 PM Anton Okolnychyi <[email protected]> wrote: > If we ignore temporal queries that need strict snapshot boundaries and > can't be solved completely using row timestamps in case of mutations, you > mentioned other use cases when row timestamps may be helpful like TTL and > auditing. We can debate whether using CURRENT_TIMESTAMP() is enough for > them, but I don't really see a point given that we already have row lineage > in V3 and the storage overhead for one more field isn't likely to be > noticable. One of the problems with CURRENT_TIMESTAMP() is the required > action by the user. Having a reliable row timestamp populated automatically > is likely to be better, so +1. > > пт, 16 січ. 2026 р. о 14:30 Steven Wu <[email protected]> пише: > >> Joining with snapshot history also has significant complexity. It >> requires retaining the entire snapshot history with probably trimmed >> snapshot metadata. There are concerns on the size of the snapshot history >> for tables with frequent commits (like streaming ingestion). Do we maintain >> the unbounded trimmed snapshot history in the same table metadata, which >> could affect table metadata.json size? or store it separately somewhere >> (like in catalog), which would require the complexity of multi-entity >> transaction in catalog? >> >> >> On Fri, Jan 16, 2026 at 12:07 PM Russell Spitzer < >> [email protected]> wrote: >> >>> I've gone back and forth on the inherited columns. I think the thing >>> which keeps coming back to me is that I don't >>> like that the only way to determine the timestamp associated with a row >>> update/creation is to do a join back >>> against table metadata. While that's doable, It feels user unfriendly. >>> >>> >>> >>> On Fri, Jan 16, 2026 at 11:54 AM Steven Wu <[email protected]> wrote: >>> >>>> Anton, you are right that the row-level deletes will be a problem for >>>> some of the mentioned use cases (like incremental processing). I have >>>> clarified the applicability of some use cases to "tables with inserts and >>>> updates only". >>>> >>>> Right now, we are only tracking modification/commit time (not insertion >>>> time) in case of updates. >>>> >>>> On Thu, Jan 15, 2026 at 6:33 PM Anton Okolnychyi <[email protected]> >>>> wrote: >>>> >>>>> I think there is clear consensus that making snapshot timestamps >>>>> strictly increasing is a positive thing. I am also +1. >>>>> >>>>> - How will row timestamps allow us to reliably implement incremental >>>>> consumption independent of the snapshot retention given that rows can be >>>>> added AND removed in a particular time frame? How can we capture all >>>>> changes by just looking at the latest snapshot? >>>>> - Some use cases in the doc need the insertion time and some need the >>>>> last modification time. Do we plan to support both? >>>>> - What do we expect the behavior to be in UPDATE and MERGE operations? >>>>> >>>>> To be clear: I am not opposed to this change, just want to make sure I >>>>> understand all use cases that we aim to address and what would be required >>>>> in engines. >>>>> >>>>> чт, 15 січ. 2026 р. о 17:01 Maninder Parmar < >>>>> [email protected]> пише: >>>>> >>>>>> +1 for improving how the commit timestamps are assigned monotonically >>>>>> since this requirement has emerged over multiple discussions like >>>>>> notifications, multi-table transactions, time travel accuracy and row >>>>>> timestamps. It would be good to have a single consistent way to represent >>>>>> and assign timestamps that could be leveraged across multiple features. >>>>>> >>>>>> On Thu, Jan 15, 2026 at 4:05 PM Ryan Blue <[email protected]> wrote: >>>>>> >>>>>>> Yeah, to add my perspective on that discussion, I think my primary >>>>>>> concern is that people expect timestamps to be monotonic and if they >>>>>>> aren't >>>>>>> then a `_last_update_timestamp` field just makes the problem worse. But >>>>>>> it >>>>>>> is _nice_ to have row-level timestamps. So I would be okay if we revisit >>>>>>> how we assign commit timestamps and improve it so that you get monotonic >>>>>>> behavior. >>>>>>> >>>>>>> On Thu, Jan 15, 2026 at 2:23 PM Steven Wu <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> We had an offline discussion with Ryan. I revised the proposal as >>>>>>>> follows. >>>>>>>> >>>>>>>> 1. V4 would require writers to generate *monotonic* snapshot >>>>>>>> timestamps. The proposal doc has a section that describes a recommended >>>>>>>> implementation using lamport timestamps. >>>>>>>> 2. Expose *last_update_timestamp* metadata column that inherits >>>>>>>> from snapshot timestamp >>>>>>>> >>>>>>>> This is a relatively low-friction change that can fix the time >>>>>>>> travel problem and enable use cases like latency tracking, temporal >>>>>>>> query, >>>>>>>> TTL, auditing. >>>>>>>> >>>>>>>> There is no accuracy requirement on the timestamp values. In >>>>>>>> practice, modern servers with NTP have pretty reliable wall clocks. >>>>>>>> E.g., >>>>>>>> Java library implemented this validation >>>>>>>> <https://github.com/apache/iceberg/blob/035e0fb39d2a949f6343552ade0a7d6c2967e0db/core/src/main/java/org/apache/iceberg/TableMetadata.java#L369-L377> >>>>>>>> that >>>>>>>> protects against backward clock drift up to one minute for snapshot >>>>>>>> timestamps. Don't think we have heard many complaints of commit >>>>>>>> failure due >>>>>>>> to that clock drift validation. >>>>>>>> >>>>>>>> Would appreciate feedback on the revised proposal. >>>>>>>> >>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0 >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Steven >>>>>>>> >>>>>>>> On Tue, Jan 13, 2026 at 8:40 PM Anton Okolnychyi < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Steven, I was referring to the fact that CURRENT_TIMESTAMP() is >>>>>>>>> usually evaluated quite early in engines so we could theoretically >>>>>>>>> have >>>>>>>>> another expression closer to the commit time. You are right, though, >>>>>>>>> it >>>>>>>>> won't be the actual commit time given that we have to write it into >>>>>>>>> the >>>>>>>>> files. Also, I don't think generating a timestamp for a row as it is >>>>>>>>> being >>>>>>>>> written is going to be beneficial. To sum up, expression-based >>>>>>>>> defaults >>>>>>>>> would allow us to capture the time the transaction or write starts, >>>>>>>>> but not >>>>>>>>> the actual commit time. >>>>>>>>> >>>>>>>>> Russell, if the goal is to know what happened to the table in a >>>>>>>>> given time frame, isn't the changelog scan the way to go? It would >>>>>>>>> assign >>>>>>>>> commit ordinals based on lineage and include row-level diffs. How >>>>>>>>> would you >>>>>>>>> be able to determine changes with row timestamps by just looking at >>>>>>>>> the >>>>>>>>> latest snapshot? >>>>>>>>> >>>>>>>>> It does seem promising to make snapshot timestamps strictly >>>>>>>>> increasing to avoid ambiguity during time travel. >>>>>>>>> >>>>>>>>> вт, 13 січ. 2026 р. о 16:33 Ryan Blue <[email protected]> пише: >>>>>>>>> >>>>>>>>>> > Whether or not "t" is an atomic clock time is not as important >>>>>>>>>> as the query between time bounds making sense. >>>>>>>>>> >>>>>>>>>> I'm not sure I get it then. If we want monotonically increasing >>>>>>>>>> times, but they don't have to be real times then how do you know what >>>>>>>>>> notion of "time" you care about for these filters? Or to put it >>>>>>>>>> another >>>>>>>>>> way, how do you know that your "before" and "after" times are >>>>>>>>>> reasonable? >>>>>>>>>> If the boundaries of these time queries can move around a bit, by >>>>>>>>>> how much? >>>>>>>>>> >>>>>>>>>> It seems to me that row IDs can play an important role here >>>>>>>>>> because you have the order guarantee that we seem to want for this >>>>>>>>>> use >>>>>>>>>> case: if snapshot A was committed before snapshot B, then the rows >>>>>>>>>> from A >>>>>>>>>> have row IDs that are always less than the rows IDs of B. The >>>>>>>>>> problem is >>>>>>>>>> that we don't know where those row IDs start and end once A and B >>>>>>>>>> are no >>>>>>>>>> longer tracked. Using a "timestamp" seems to work, but I still worry >>>>>>>>>> that >>>>>>>>>> without reliable timestamps that correspond with some guarantee to >>>>>>>>>> real >>>>>>>>>> timestamps, we are creating a feature that seems reliable but isn't. >>>>>>>>>> >>>>>>>>>> I'm somewhat open to the idea of introducing a snapshot timestamp >>>>>>>>>> that the catalog guarantees is monotonically increasing. But if we >>>>>>>>>> did >>>>>>>>>> that, wouldn't we still need to know the association between these >>>>>>>>>> timestamps and snapshots after the snapshot metadata expires? My >>>>>>>>>> mental >>>>>>>>>> model is that this would be used to look for data that arrived, say, >>>>>>>>>> 3 >>>>>>>>>> weeks ago on Dec 24th. Since the snapshots metadata is no longer >>>>>>>>>> around we >>>>>>>>>> could use the row timestamp to find those rows. But how do we know >>>>>>>>>> that the >>>>>>>>>> snapshot timestamps correspond to the actual timestamp range of Dec >>>>>>>>>> 24th? >>>>>>>>>> Is it just "close enough" as long as we don't have out of order >>>>>>>>>> timestamps? >>>>>>>>>> This is what I mean by needing to keep track of the association >>>>>>>>>> between >>>>>>>>>> timestamps and snapshots after the metadata expires. Seems like you >>>>>>>>>> either >>>>>>>>>> need to keep track of what the catalog's clock was for events you >>>>>>>>>> care >>>>>>>>>> about, or you don't really care about exact timestamps. >>>>>>>>>> >>>>>>>>>> On Tue, Jan 13, 2026 at 2:22 PM Russell Spitzer < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> The key goal here is the ability to answer the question "what >>>>>>>>>>> happened to the table in some time window. (before < t < after)?" >>>>>>>>>>> Whether or not "t" is an atomic clock time is not as important >>>>>>>>>>> as the query between time bounds making sense. >>>>>>>>>>> Downstream applications (from what I know) are mostly sensitive >>>>>>>>>>> to getting discrete and well defined answers to >>>>>>>>>>> this question like: >>>>>>>>>>> >>>>>>>>>>> 1 < t < 2 should be exclusive of >>>>>>>>>>> 2 < t < 3 should be exclusive of >>>>>>>>>>> 3 < t < 4 >>>>>>>>>>> >>>>>>>>>>> And the union of these should be the same as the query asking >>>>>>>>>>> for 1 < t < 4 >>>>>>>>>>> >>>>>>>>>>> Currently this is not possible because we have no guarantee of >>>>>>>>>>> ordering in our timestamps >>>>>>>>>>> >>>>>>>>>>> Snapshots >>>>>>>>>>> A -> B -> C >>>>>>>>>>> Sequence numbers >>>>>>>>>>> 50 -> 51 -> 52 >>>>>>>>>>> Timestamp >>>>>>>>>>> 3 -> 1 -> 2 >>>>>>>>>>> >>>>>>>>>>> This makes time travel always a little wrong to start with. >>>>>>>>>>> >>>>>>>>>>> The Java implementation only allows one minute of negative time >>>>>>>>>>> on commit so we actually kind of do have this as a >>>>>>>>>>> "light monotonicity" requirement but as noted above there is no >>>>>>>>>>> spec requirement for this. While we do have sequence >>>>>>>>>>> number and row id, we still don't have a stable way of >>>>>>>>>>> associating these with a consistent time in an engine independent >>>>>>>>>>> way. >>>>>>>>>>> >>>>>>>>>>> Ideally we just want to have one consistent way of answering the >>>>>>>>>>> question "what did the table look like at time t" >>>>>>>>>>> which I think we get by adding in a new field that is a >>>>>>>>>>> timestamp, set by the Catalog close to commit time, >>>>>>>>>>> that always goes up. >>>>>>>>>>> >>>>>>>>>>> I'm not sure we can really do this with an engine expression >>>>>>>>>>> since they won't know when the data is actually committed >>>>>>>>>>> when writing files? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Jan 13, 2026 at 3:35 PM Anton Okolnychyi < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> This seems like a lot of new complexity in the format. I would >>>>>>>>>>>> like us to explore whether we can build the considered use cases >>>>>>>>>>>> on top of >>>>>>>>>>>> expression-based defaults instead. >>>>>>>>>>>> >>>>>>>>>>>> We already plan to support CURRENT_TIMESTAMP() and similar >>>>>>>>>>>> functions that are part of the SQL standard definition for default >>>>>>>>>>>> values. >>>>>>>>>>>> This would provide us a way to know the relative row order. True, >>>>>>>>>>>> this >>>>>>>>>>>> usually will represent the start of the operation. We may define >>>>>>>>>>>> COMMIT_TIMESTAMP() or a similar expression for the actual commit >>>>>>>>>>>> time, if >>>>>>>>>>>> there are use cases that need that. Plus, we may explore an >>>>>>>>>>>> approach >>>>>>>>>>>> similar to MySQL that allows users to reset the default value on >>>>>>>>>>>> update. >>>>>>>>>>>> >>>>>>>>>>>> - Anton >>>>>>>>>>>> >>>>>>>>>>>> вт, 13 січ. 2026 р. о 11:04 Russell Spitzer < >>>>>>>>>>>> [email protected]> пише: >>>>>>>>>>>> >>>>>>>>>>>>> I think this is the right step forward. Our current >>>>>>>>>>>>> "timestamp" definition is too ambiguous to be useful so >>>>>>>>>>>>> establishing >>>>>>>>>>>>> a well defined and monotonic timestamp could be really great. >>>>>>>>>>>>> I also like the ability for row's to know this value without >>>>>>>>>>>>> having to rely on snapshot information which can be expired. >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Jan 12, 2026 at 11:03 AM Steven Wu < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have revised the row timestamp proposal with the following >>>>>>>>>>>>>> changes. >>>>>>>>>>>>>> * a new commit_timestamp field in snapshot metadata that has >>>>>>>>>>>>>> nano-second precision. >>>>>>>>>>>>>> * this optional field is only set by the REST catalog server >>>>>>>>>>>>>> * it needs to be monotonic (e.g. implemented using Lamport >>>>>>>>>>>>>> timestamp) >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Steven >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks for the clarification, Ryan. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> For long-running streaming jobs that commit periodically, it >>>>>>>>>>>>>>> is difficult to establish the constant value of >>>>>>>>>>>>>>> current_timestamp across >>>>>>>>>>>>>>> all writer tasks for each commit cycle. I guess streaming >>>>>>>>>>>>>>> writers may just >>>>>>>>>>>>>>> need to write the wall clock time when appending a row to a >>>>>>>>>>>>>>> data file for >>>>>>>>>>>>>>> the default value of current_timestamp. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <[email protected]> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I don't think that every row would have a different value. >>>>>>>>>>>>>>>> That would be up to the engine, but I would expect engines to >>>>>>>>>>>>>>>> insert >>>>>>>>>>>>>>>> `CURRENT_TIMESTAMP` into the plan and then replace it with a >>>>>>>>>>>>>>>> constant, >>>>>>>>>>>>>>>> resulting in a consistent value for all rows. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> You're right that this would not necessarily be the commit >>>>>>>>>>>>>>>> time. But neither is the commit timestamp from Iceberg's >>>>>>>>>>>>>>>> snapshot. I'm not >>>>>>>>>>>>>>>> sure how we are going to define "good enough" for this >>>>>>>>>>>>>>>> purpose. I think at >>>>>>>>>>>>>>>> least `CURRENT_TIMESTAMP` has reliable and known behavior when >>>>>>>>>>>>>>>> you look at >>>>>>>>>>>>>>>> how it is handled in engines. And if you want the Iceberg >>>>>>>>>>>>>>>> timestamp, then >>>>>>>>>>>>>>>> use a periodic query of the snapshot stable to keep track of >>>>>>>>>>>>>>>> them in a >>>>>>>>>>>>>>>> table you can join to. I don't think this rises to the need >>>>>>>>>>>>>>>> for a table >>>>>>>>>>>>>>>> feature unless we can guarantee that it is correct. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> > Postgres `current_timestamp` captures the >>>>>>>>>>>>>>>>> transaction start time [1, 2]. Should we extend the same >>>>>>>>>>>>>>>>> semantic to >>>>>>>>>>>>>>>>> Iceberg: all rows added in the same snapshot should have the >>>>>>>>>>>>>>>>> same timestamp >>>>>>>>>>>>>>>>> value? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Let me clarify my last comment. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT >>>>>>>>>>>>>>>>> CURRENT_TIMESTAMP) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Since Postgres current_timestamp captures the transaction >>>>>>>>>>>>>>>>> start time, all rows added in the same insert transaction >>>>>>>>>>>>>>>>> would have the >>>>>>>>>>>>>>>>> same value as the transaction timestamp with the column >>>>>>>>>>>>>>>>> definition above. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> If we extend a similar semantic to Iceberg, all rows added >>>>>>>>>>>>>>>>> in the same Iceberg transaction/snapshot should have the same >>>>>>>>>>>>>>>>> timestamp? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Ryan, I understand your comment for using >>>>>>>>>>>>>>>>> current_timestamp expression as column default value, you >>>>>>>>>>>>>>>>> were thinking >>>>>>>>>>>>>>>>> that the engine would set the column value to the wall clock >>>>>>>>>>>>>>>>> time when >>>>>>>>>>>>>>>>> appending a row to a data file, right? every row would almost >>>>>>>>>>>>>>>>> have a >>>>>>>>>>>>>>>>> different timestamp value. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> `current_timestamp` expression may not always carry the >>>>>>>>>>>>>>>>>> right semantic for the use cases. E.g., latency tracking is >>>>>>>>>>>>>>>>>> interested in >>>>>>>>>>>>>>>>>> when records are added / committed to the table, not when >>>>>>>>>>>>>>>>>> the record was >>>>>>>>>>>>>>>>>> appended to an uncommitted data file in the processing >>>>>>>>>>>>>>>>>> engine. >>>>>>>>>>>>>>>>>> Record creation and Iceberg commit can be minutes or even >>>>>>>>>>>>>>>>>> hours apart. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Row timestamp inherited from snapshot timestamp has no >>>>>>>>>>>>>>>>>> overhead with the initial commit and has very minimal >>>>>>>>>>>>>>>>>> storage overhead >>>>>>>>>>>>>>>>>> during file rewrite. Per-row current_timestamp would have >>>>>>>>>>>>>>>>>> distinct values >>>>>>>>>>>>>>>>>> for every row and has more storage overhead. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> OLTP databases deal with small row-level transactions. >>>>>>>>>>>>>>>>>> Postgres `current_timestamp` captures the transaction start >>>>>>>>>>>>>>>>>> time [1, 2]. >>>>>>>>>>>>>>>>>> Should we extend the same semantic to Iceberg: all rows >>>>>>>>>>>>>>>>>> added in the same >>>>>>>>>>>>>>>>>> snapshot should have the same timestamp value? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>> https://www.postgresql.org/docs/current/functions-datetime.html >>>>>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this >>>>>>>>>>>>>>>>>>>> proposal. >>>>>>>>>>>>>>>>>>>> To support the created_by timestamp, we would need to >>>>>>>>>>>>>>>>>>>> implement the following row lineage behavior >>>>>>>>>>>>>>>>>>>> * Initially, it inherits from the snapshot timestamp >>>>>>>>>>>>>>>>>>>> * during rewrite (like compaction), it should be >>>>>>>>>>>>>>>>>>>> persisted into data files. >>>>>>>>>>>>>>>>>>>> * during update, it needs to be carried over from the >>>>>>>>>>>>>>>>>>>> previous row. This is similar to the row_id carry over for >>>>>>>>>>>>>>>>>>>> row updates. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Sorry for the short hand. These are not the same: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 1. Insertion time - time the row was inserted. >>>>>>>>>>>>>>>>>>> 2. Create by - The system that created the record. >>>>>>>>>>>>>>>>>>> 3. Updated by - The system that last updated the record. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Depending on the exact use-case these might or might not >>>>>>>>>>>>>>>>>>> have utility. I'm just wondering if there will be more >>>>>>>>>>>>>>>>>>> example like this >>>>>>>>>>>>>>>>>>> in the future. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> created_by column would incur likely significantly >>>>>>>>>>>>>>>>>>>> higher storage overhead compared to the updated_by column. >>>>>>>>>>>>>>>>>>>> As rows are >>>>>>>>>>>>>>>>>>>> updated overtime, the cardinality for this column in data >>>>>>>>>>>>>>>>>>>> files can be >>>>>>>>>>>>>>>>>>>> high. Hence, the created_by column may not compress well. >>>>>>>>>>>>>>>>>>>> This is a similar >>>>>>>>>>>>>>>>>>>> problem for the row_id column. One side effect of enabling >>>>>>>>>>>>>>>>>>>> row lineage by >>>>>>>>>>>>>>>>>>>> default for V3 tables is the storage overhead of row_id >>>>>>>>>>>>>>>>>>>> column after >>>>>>>>>>>>>>>>>>>> compaction especially for narrow tables with few columns. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I agree. I think this analysis also shows that some >>>>>>>>>>>>>>>>>>> consumers of Iceberg might not necessarily want to have all >>>>>>>>>>>>>>>>>>> these columns, >>>>>>>>>>>>>>>>>>> so we might want to make them configurable, rather than >>>>>>>>>>>>>>>>>>> mandating them for >>>>>>>>>>>>>>>>>>> all tables. Ryan's thought on default values seems like it >>>>>>>>>>>>>>>>>>> would solve the >>>>>>>>>>>>>>>>>>> issues I was raising. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>> Micah >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> > An explicit timestamp column adds more burden to >>>>>>>>>>>>>>>>>>>> application developers. While some databases require an >>>>>>>>>>>>>>>>>>>> explicit column in >>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set >>>>>>>>>>>>>>>>>>>> the column value. >>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to the >>>>>>>>>>>>>>>>>>>> trigger timestamp. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Since the use cases don't require an exact timestamp, >>>>>>>>>>>>>>>>>>>> this seems like the best solution to get what people want >>>>>>>>>>>>>>>>>>>> (an insertion >>>>>>>>>>>>>>>>>>>> timestamp) that has clear and well-defined behavior. Since >>>>>>>>>>>>>>>>>>>> `current_timestamp` is defined by the SQL spec, it makes >>>>>>>>>>>>>>>>>>>> sense to me that >>>>>>>>>>>>>>>>>>>> we could use it and have reasonable behavior. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I've talked with Anton about this before and maybe >>>>>>>>>>>>>>>>>>>> he'll jump in on this thread. I think that we may need to >>>>>>>>>>>>>>>>>>>> extend default >>>>>>>>>>>>>>>>>>>> values to include default value expressions, like >>>>>>>>>>>>>>>>>>>> `current_timestamp` that >>>>>>>>>>>>>>>>>>>> is allowed by the SQL spec. That would solve the problem >>>>>>>>>>>>>>>>>>>> as well as some >>>>>>>>>>>>>>>>>>>> others (like `current_date` or `current_user`) and would >>>>>>>>>>>>>>>>>>>> not create a >>>>>>>>>>>>>>>>>>>> potentially misleading (and heavyweight) timestamp feature >>>>>>>>>>>>>>>>>>>> in the format. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> > Also some environments may have stronger clock >>>>>>>>>>>>>>>>>>>> service, like Spanner TrueTime service. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Even in cases like this, commit retries can reorder >>>>>>>>>>>>>>>>>>>> commits and make timestamps out of order. I don't think >>>>>>>>>>>>>>>>>>>> that we should be >>>>>>>>>>>>>>>>>>>> making guarantees or even exposing metadata that people >>>>>>>>>>>>>>>>>>>> might mistake as >>>>>>>>>>>>>>>>>>>> having those guarantees. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Ryan, thanks a lot for the feedback! >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Regarding the concern for reliable timestamps, we are >>>>>>>>>>>>>>>>>>>>> not proposing using timestamps for ordering. With NTP in >>>>>>>>>>>>>>>>>>>>> modern computers, >>>>>>>>>>>>>>>>>>>>> they are generally reliable enough for the intended use >>>>>>>>>>>>>>>>>>>>> cases. Also some >>>>>>>>>>>>>>>>>>>>> environments may have stronger clock service, like Spanner >>>>>>>>>>>>>>>>>>>>> TrueTime service >>>>>>>>>>>>>>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency> >>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> > joining to timestamps from the snapshots metadata >>>>>>>>>>>>>>>>>>>>> table. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> As you also mentioned, it depends on the snapshot >>>>>>>>>>>>>>>>>>>>> history, which is often retained for a few days due to >>>>>>>>>>>>>>>>>>>>> performance reasons. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> > embedding a timestamp in DML (like >>>>>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit >>>>>>>>>>>>>>>>>>>>> one from table >>>>>>>>>>>>>>>>>>>>> metadata. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> An explicit timestamp column adds more burden to >>>>>>>>>>>>>>>>>>>>> application developers. While some databases require an >>>>>>>>>>>>>>>>>>>>> explicit column in >>>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set >>>>>>>>>>>>>>>>>>>>> the column value. >>>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to the >>>>>>>>>>>>>>>>>>>>> trigger timestamp. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Also, the timestamp set during computation (like >>>>>>>>>>>>>>>>>>>>> streaming ingestion or relative long batch computation) >>>>>>>>>>>>>>>>>>>>> doesn't capture the >>>>>>>>>>>>>>>>>>>>> time the rows/files are added to the Iceberg table in a >>>>>>>>>>>>>>>>>>>>> batch fashion. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> > And for those use cases, you could also keep a >>>>>>>>>>>>>>>>>>>>> longer history of snapshot timestamps, like storing a >>>>>>>>>>>>>>>>>>>>> catalog's event log >>>>>>>>>>>>>>>>>>>>> for long-term access to timestamp info >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> this is not really consumable by joining the regular >>>>>>>>>>>>>>>>>>>>> table query with catalog event log. I would also imagine >>>>>>>>>>>>>>>>>>>>> catalog event log >>>>>>>>>>>>>>>>>>>>> is capped at shorter retention (maybe a few months) >>>>>>>>>>>>>>>>>>>>> compared to data >>>>>>>>>>>>>>>>>>>>> retention (could be a few years). >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue < >>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I don't think it is a good idea to expose timestamps >>>>>>>>>>>>>>>>>>>>>> at the row level. Timestamps in metadata that would be >>>>>>>>>>>>>>>>>>>>>> carried down to the >>>>>>>>>>>>>>>>>>>>>> row level already confuse people that expect them to be >>>>>>>>>>>>>>>>>>>>>> useful or reliable, >>>>>>>>>>>>>>>>>>>>>> rather than for debugging. I think extending this to the >>>>>>>>>>>>>>>>>>>>>> row level would >>>>>>>>>>>>>>>>>>>>>> only make the problem worse. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> You can already get this information by projecting >>>>>>>>>>>>>>>>>>>>>> the last updated sequence number, which is reliable, and >>>>>>>>>>>>>>>>>>>>>> joining to >>>>>>>>>>>>>>>>>>>>>> timestamps from the snapshots metadata table. Of course, >>>>>>>>>>>>>>>>>>>>>> the drawback there >>>>>>>>>>>>>>>>>>>>>> is losing the timestamp information when snapshots >>>>>>>>>>>>>>>>>>>>>> expire, but since it >>>>>>>>>>>>>>>>>>>>>> isn't reliable anyway I'd be fine with that. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Some of the use cases, like auditing and compliance, >>>>>>>>>>>>>>>>>>>>>> are probably better served by embedding a timestamp in >>>>>>>>>>>>>>>>>>>>>> DML (like >>>>>>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit >>>>>>>>>>>>>>>>>>>>>> one from table >>>>>>>>>>>>>>>>>>>>>> metadata. And for those use cases, you could also keep a >>>>>>>>>>>>>>>>>>>>>> longer history of >>>>>>>>>>>>>>>>>>>>>> snapshot timestamps, like storing a catalog's event log >>>>>>>>>>>>>>>>>>>>>> for long-term >>>>>>>>>>>>>>>>>>>>>> access to timestamp info. I think that would be better >>>>>>>>>>>>>>>>>>>>>> than storing it at >>>>>>>>>>>>>>>>>>>>>> the row level. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu < >>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> For V4 spec, I have a small proposal [1] to expose >>>>>>>>>>>>>>>>>>>>>>> the row timestamp concept that can help with many use >>>>>>>>>>>>>>>>>>>>>>> cases like temporal >>>>>>>>>>>>>>>>>>>>>>> queries, latency tracking, TTL, auditing and compliance. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> This *_last_updated_timestamp_ms * metadata column >>>>>>>>>>>>>>>>>>>>>>> behaves very similarly to the >>>>>>>>>>>>>>>>>>>>>>> *_last_updated_sequence_number* for row lineage. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> - Initially, it inherits from the snapshot >>>>>>>>>>>>>>>>>>>>>>> timestamp. >>>>>>>>>>>>>>>>>>>>>>> - During rewrite (like compaction), its values >>>>>>>>>>>>>>>>>>>>>>> are persisted in the data files. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Would love to hear what you think. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>> Steven >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>
