Re: [DISCUSS] row timestamp proposal

Russell Spitzer Mon, 26 Jan 2026 10:05:21 -0800

Sounds good to me

On Mon, Jan 26, 2026 at 11:59 AM Anton Okolnychyi <[email protected]>
wrote:


> Cool, sounds like a plan then? Thanks for answering all the questions,
> Steven!
>
> чт, 22 січ. 2026 р. о 18:29 Steven Wu <[email protected]> пише:
>
>> For row timestamp inheritance to work, I would need to implement the
>> plumbing. So I would imagine existing rows would have null values because
>> the inheritance plumbing was not there yet. This would be consistent with
>> upgrade behavior for the V3 row lineage:
>> https://iceberg.apache.org/spec/#row-lineage-for-upgraded-tables.
>>
>> On Thu, Jan 22, 2026 at 4:09 PM Anton Okolnychyi <[email protected]>
>> wrote:
>>
>>> Also, do we have a concrete plan for how to handle tables that would be
>>> upgraded to V4? What timestamp will we assign to existing rows?
>>>
>>> On Wed, Jan 21, 2026 at 3:59 PM Anton Okolnychyi <[email protected]>
>>> wrote:
>>>
>>>> If we ignore temporal queries that need strict snapshot boundaries and
>>>> can't be solved completely using row timestamps in case of mutations, you
>>>> mentioned other use cases when row timestamps may be helpful like TTL and
>>>> auditing. We can debate whether using CURRENT_TIMESTAMP() is enough for
>>>> them, but I don't really see a point given that we already have row lineage
>>>> in V3 and the storage overhead for one more field isn't likely to be
>>>> noticable. One of the problems with CURRENT_TIMESTAMP() is the required
>>>> action by the user. Having a reliable row timestamp populated automatically
>>>> is likely to be better, so +1.
>>>>
>>>> пт, 16 січ. 2026 р. о 14:30 Steven Wu <[email protected]> пише:
>>>>
>>>>> Joining with snapshot history also has significant complexity. It
>>>>> requires retaining the entire snapshot history with probably trimmed
>>>>> snapshot metadata. There are concerns on the size of the snapshot history
>>>>> for tables with frequent commits (like streaming ingestion). Do we 
>>>>> maintain
>>>>> the unbounded trimmed snapshot history in the same table metadata, which
>>>>> could affect table metadata.json size? or store it separately somewhere
>>>>> (like in catalog), which would require the complexity of multi-entity
>>>>> transaction in catalog?
>>>>>
>>>>>
>>>>> On Fri, Jan 16, 2026 at 12:07 PM Russell Spitzer <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I've gone back and forth on the inherited columns. I think the thing
>>>>>> which keeps coming back to me is that I don't
>>>>>> like that the only way to determine the timestamp associated with a
>>>>>> row update/creation is to do a join back
>>>>>> against table metadata. While that's doable, It feels user unfriendly.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 16, 2026 at 11:54 AM Steven Wu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Anton, you are right that the row-level deletes will be a problem
>>>>>>> for some of the mentioned use cases (like incremental processing). I 
>>>>>>> have
>>>>>>> clarified the applicability of some use cases to "tables with inserts 
>>>>>>> and
>>>>>>> updates only".
>>>>>>>
>>>>>>> Right now, we are only tracking modification/commit time (not
>>>>>>> insertion time) in case of updates.
>>>>>>>
>>>>>>> On Thu, Jan 15, 2026 at 6:33 PM Anton Okolnychyi <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> I think there is clear consensus that making snapshot timestamps
>>>>>>>> strictly increasing is a positive thing. I am also +1.
>>>>>>>>
>>>>>>>> - How will row timestamps allow us to reliably implement
>>>>>>>> incremental consumption independent of the snapshot retention given 
>>>>>>>> that
>>>>>>>> rows can be added AND removed in a particular time frame? How can we
>>>>>>>> capture all changes by just looking at the latest snapshot?
>>>>>>>> - Some use cases in the doc need the insertion time and some need
>>>>>>>> the last modification time. Do we plan to support both?
>>>>>>>> - What do we expect the behavior to be in UPDATE and MERGE
>>>>>>>> operations?
>>>>>>>>
>>>>>>>> To be clear: I am not opposed to this change, just want to make
>>>>>>>> sure I understand all use cases that we aim to address and what would 
>>>>>>>> be
>>>>>>>> required in engines.
>>>>>>>>
>>>>>>>> чт, 15 січ. 2026 р. о 17:01 Maninder Parmar <
>>>>>>>> [email protected]> пише:
>>>>>>>>
>>>>>>>>> +1 for improving how the commit timestamps are
>>>>>>>>> assigned monotonically since this requirement has emerged over 
>>>>>>>>> multiple
>>>>>>>>> discussions like notifications, multi-table transactions, time travel
>>>>>>>>> accuracy and row timestamps. It would be good to have a single 
>>>>>>>>> consistent
>>>>>>>>> way to represent and assign timestamps that could be leveraged across
>>>>>>>>> multiple features.
>>>>>>>>>
>>>>>>>>> On Thu, Jan 15, 2026 at 4:05 PM Ryan Blue <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Yeah, to add my perspective on that discussion, I think my
>>>>>>>>>> primary concern is that people expect timestamps to be monotonic and 
>>>>>>>>>> if
>>>>>>>>>> they aren't then a `_last_update_timestamp` field just makes the 
>>>>>>>>>> problem
>>>>>>>>>> worse. But it is _nice_ to have row-level timestamps. So I would be 
>>>>>>>>>> okay if
>>>>>>>>>> we revisit how we assign commit timestamps and improve it so that 
>>>>>>>>>> you get
>>>>>>>>>> monotonic behavior.
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 15, 2026 at 2:23 PM Steven Wu <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> We had an offline discussion with Ryan. I revised the proposal
>>>>>>>>>>> as follows.
>>>>>>>>>>>
>>>>>>>>>>> 1. V4 would require writers to generate *monotonic* snapshot
>>>>>>>>>>> timestamps. The proposal doc has a section that describes a 
>>>>>>>>>>> recommended
>>>>>>>>>>> implementation using lamport timestamps.
>>>>>>>>>>> 2. Expose *last_update_timestamp* metadata column that inherits
>>>>>>>>>>> from snapshot timestamp
>>>>>>>>>>>
>>>>>>>>>>> This is a relatively low-friction change that can fix the time
>>>>>>>>>>> travel problem and enable use cases like latency tracking, temporal 
>>>>>>>>>>> query,
>>>>>>>>>>> TTL, auditing.
>>>>>>>>>>>
>>>>>>>>>>> There is no accuracy requirement on the timestamp values. In
>>>>>>>>>>> practice, modern servers with NTP have pretty reliable wall clocks. 
>>>>>>>>>>> E.g.,
>>>>>>>>>>> Java library implemented this validation
>>>>>>>>>>> <https://github.com/apache/iceberg/blob/035e0fb39d2a949f6343552ade0a7d6c2967e0db/core/src/main/java/org/apache/iceberg/TableMetadata.java#L369-L377>
>>>>>>>>>>>  that
>>>>>>>>>>> protects against backward clock drift up to one minute for snapshot
>>>>>>>>>>> timestamps. Don't think we have heard many complaints of commit 
>>>>>>>>>>> failure due
>>>>>>>>>>> to that clock drift validation.
>>>>>>>>>>>
>>>>>>>>>>> Would appreciate feedback on the revised proposal.
>>>>>>>>>>>
>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Steven
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 13, 2026 at 8:40 PM Anton Okolnychyi <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Steven, I was referring to the fact that CURRENT_TIMESTAMP() is
>>>>>>>>>>>> usually evaluated quite early in engines so we could theoretically 
>>>>>>>>>>>> have
>>>>>>>>>>>> another expression closer to the commit time. You are right, 
>>>>>>>>>>>> though, it
>>>>>>>>>>>> won't be the actual commit time given that we have to write it 
>>>>>>>>>>>> into the
>>>>>>>>>>>> files. Also, I don't think generating a timestamp for a row as it 
>>>>>>>>>>>> is being
>>>>>>>>>>>> written is going to be beneficial. To sum up, expression-based 
>>>>>>>>>>>> defaults
>>>>>>>>>>>> would allow us to capture the time the transaction or write 
>>>>>>>>>>>> starts, but not
>>>>>>>>>>>> the actual commit time.
>>>>>>>>>>>>
>>>>>>>>>>>> Russell, if the goal is to know what happened to the table in a
>>>>>>>>>>>> given time frame, isn't the changelog scan the way to go? It would 
>>>>>>>>>>>> assign
>>>>>>>>>>>> commit ordinals based on lineage and include row-level diffs. How 
>>>>>>>>>>>> would you
>>>>>>>>>>>> be able to determine changes with row timestamps by just looking 
>>>>>>>>>>>> at the
>>>>>>>>>>>> latest snapshot?
>>>>>>>>>>>>
>>>>>>>>>>>> It does seem promising to make snapshot timestamps strictly
>>>>>>>>>>>> increasing to avoid ambiguity during time travel.
>>>>>>>>>>>>
>>>>>>>>>>>> вт, 13 січ. 2026 р. о 16:33 Ryan Blue <[email protected]> пише:
>>>>>>>>>>>>
>>>>>>>>>>>>> > Whether or not "t" is an atomic clock time is not as
>>>>>>>>>>>>> important as the query between time bounds making sense.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not sure I get it then. If we want monotonically
>>>>>>>>>>>>> increasing times, but they don't have to be real times then how 
>>>>>>>>>>>>> do you know
>>>>>>>>>>>>> what notion of "time" you care about for these filters? Or to put 
>>>>>>>>>>>>> it
>>>>>>>>>>>>> another way, how do you know that your "before" and "after" times 
>>>>>>>>>>>>> are
>>>>>>>>>>>>> reasonable? If the boundaries of these time queries can move 
>>>>>>>>>>>>> around a bit,
>>>>>>>>>>>>> by how much?
>>>>>>>>>>>>>
>>>>>>>>>>>>> It seems to me that row IDs can play an important role here
>>>>>>>>>>>>> because you have the order guarantee that we seem to want for 
>>>>>>>>>>>>> this use
>>>>>>>>>>>>> case: if snapshot A was committed before snapshot B, then the 
>>>>>>>>>>>>> rows from A
>>>>>>>>>>>>> have row IDs that are always less than the rows IDs of B. The 
>>>>>>>>>>>>> problem is
>>>>>>>>>>>>> that we don't know where those row IDs start and end once A and B 
>>>>>>>>>>>>> are no
>>>>>>>>>>>>> longer tracked. Using a "timestamp" seems to work, but I still 
>>>>>>>>>>>>> worry that
>>>>>>>>>>>>> without reliable timestamps that correspond with some guarantee 
>>>>>>>>>>>>> to real
>>>>>>>>>>>>> timestamps, we are creating a feature that seems reliable but 
>>>>>>>>>>>>> isn't.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm somewhat open to the idea of introducing a snapshot
>>>>>>>>>>>>> timestamp that the catalog guarantees is monotonically 
>>>>>>>>>>>>> increasing. But if
>>>>>>>>>>>>> we did that, wouldn't we still need to know the association 
>>>>>>>>>>>>> between these
>>>>>>>>>>>>> timestamps and snapshots after the snapshot metadata expires? My 
>>>>>>>>>>>>> mental
>>>>>>>>>>>>> model is that this would be used to look for data that arrived, 
>>>>>>>>>>>>> say, 3
>>>>>>>>>>>>> weeks ago on Dec 24th. Since the snapshots metadata is no longer 
>>>>>>>>>>>>> around we
>>>>>>>>>>>>> could use the row timestamp to find those rows. But how do we 
>>>>>>>>>>>>> know that the
>>>>>>>>>>>>> snapshot timestamps correspond to the actual timestamp range of 
>>>>>>>>>>>>> Dec 24th?
>>>>>>>>>>>>> Is it just "close enough" as long as we don't have out of order 
>>>>>>>>>>>>> timestamps?
>>>>>>>>>>>>> This is what I mean by needing to keep track of the association 
>>>>>>>>>>>>> between
>>>>>>>>>>>>> timestamps and snapshots after the metadata expires. Seems like 
>>>>>>>>>>>>> you either
>>>>>>>>>>>>> need to keep track of what the catalog's clock was for events you 
>>>>>>>>>>>>> care
>>>>>>>>>>>>> about, or you don't really care about exact timestamps.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jan 13, 2026 at 2:22 PM Russell Spitzer <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The key goal here is the ability to answer the question "what
>>>>>>>>>>>>>> happened to the table in some time window. (before < t < after)?"
>>>>>>>>>>>>>> Whether or not "t" is an atomic clock time is not as
>>>>>>>>>>>>>> important as the query between time bounds making sense.
>>>>>>>>>>>>>> Downstream applications (from what I know) are mostly
>>>>>>>>>>>>>> sensitive to getting discrete and well defined answers to
>>>>>>>>>>>>>> this question like:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1 < t < 2 should be exclusive of
>>>>>>>>>>>>>> 2 < t < 3 should be exclusive of
>>>>>>>>>>>>>> 3 < t < 4
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And the union of these should be the same as the query asking
>>>>>>>>>>>>>> for 1 < t < 4
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Currently this is not possible because we have no
>>>>>>>>>>>>>> guarantee of ordering in our timestamps
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Snapshots
>>>>>>>>>>>>>> A -> B -> C
>>>>>>>>>>>>>> Sequence numbers
>>>>>>>>>>>>>> 50 -> 51 ->  52
>>>>>>>>>>>>>> Timestamp
>>>>>>>>>>>>>> 3 -> 1 -> 2
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This makes time travel always a little wrong to start with.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The Java implementation only allows one minute of negative
>>>>>>>>>>>>>> time on commit so we actually kind of do have this as a
>>>>>>>>>>>>>> "light monotonicity" requirement but as noted above there is
>>>>>>>>>>>>>> no spec requirement for this.  While we do have sequence
>>>>>>>>>>>>>> number and row id, we still don't have a stable way of
>>>>>>>>>>>>>> associating these with a consistent time in an engine 
>>>>>>>>>>>>>> independent way.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ideally we just want to have one consistent way of answering
>>>>>>>>>>>>>> the question "what did the table look like at time t"
>>>>>>>>>>>>>> which I think we get by adding in a new field that is a
>>>>>>>>>>>>>> timestamp, set by the Catalog close to commit time,
>>>>>>>>>>>>>> that always goes up.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not sure we can really do this with an engine expression
>>>>>>>>>>>>>> since they won't know when the data is actually committed
>>>>>>>>>>>>>> when writing files?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jan 13, 2026 at 3:35 PM Anton Okolnychyi <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This seems like a lot of new complexity in the format. I
>>>>>>>>>>>>>>> would like us to explore whether we can build the considered 
>>>>>>>>>>>>>>> use cases on
>>>>>>>>>>>>>>> top of expression-based defaults instead.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We already plan to support CURRENT_TIMESTAMP() and similar
>>>>>>>>>>>>>>> functions that are part of the SQL standard definition for 
>>>>>>>>>>>>>>> default values.
>>>>>>>>>>>>>>> This would provide us a way to know the relative row order. 
>>>>>>>>>>>>>>> True, this
>>>>>>>>>>>>>>> usually will represent the start of the operation. We may define
>>>>>>>>>>>>>>> COMMIT_TIMESTAMP() or a similar expression for the actual 
>>>>>>>>>>>>>>> commit time, if
>>>>>>>>>>>>>>> there are use cases that need that. Plus, we may explore an 
>>>>>>>>>>>>>>> approach
>>>>>>>>>>>>>>> similar to MySQL that allows users to reset the default value 
>>>>>>>>>>>>>>> on update.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> вт, 13 січ. 2026 р. о 11:04 Russell Spitzer <
>>>>>>>>>>>>>>> [email protected]> пише:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think this is the right step forward. Our current
>>>>>>>>>>>>>>>> "timestamp" definition is too ambiguous to be useful so 
>>>>>>>>>>>>>>>> establishing
>>>>>>>>>>>>>>>> a well defined and monotonic timestamp could be really
>>>>>>>>>>>>>>>> great. I also like the ability for row's to know this value 
>>>>>>>>>>>>>>>> without
>>>>>>>>>>>>>>>> having to rely on snapshot information which can be
>>>>>>>>>>>>>>>> expired.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Jan 12, 2026 at 11:03 AM Steven Wu <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have revised the row timestamp proposal with the
>>>>>>>>>>>>>>>>> following changes.
>>>>>>>>>>>>>>>>> * a new commit_timestamp field in snapshot metadata that
>>>>>>>>>>>>>>>>> has nano-second precision.
>>>>>>>>>>>>>>>>> * this optional field is only set by the REST catalog
>>>>>>>>>>>>>>>>> server
>>>>>>>>>>>>>>>>> * it needs to be monotonic (e.g. implemented using Lamport
>>>>>>>>>>>>>>>>> timestamp)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for the clarification, Ryan.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For long-running streaming jobs that commit periodically,
>>>>>>>>>>>>>>>>>> it is difficult to establish the constant value of 
>>>>>>>>>>>>>>>>>> current_timestamp across
>>>>>>>>>>>>>>>>>> all writer tasks for each commit cycle. I guess streaming 
>>>>>>>>>>>>>>>>>> writers may just
>>>>>>>>>>>>>>>>>> need to write the wall clock time when appending a row to a 
>>>>>>>>>>>>>>>>>> data file for
>>>>>>>>>>>>>>>>>> the default value of current_timestamp.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I don't think that every row would have a different
>>>>>>>>>>>>>>>>>>> value. That would be up to the engine, but I would expect 
>>>>>>>>>>>>>>>>>>> engines to insert
>>>>>>>>>>>>>>>>>>> `CURRENT_TIMESTAMP` into the plan and then replace it with 
>>>>>>>>>>>>>>>>>>> a constant,
>>>>>>>>>>>>>>>>>>> resulting in a consistent value for all rows.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> You're right that this would not necessarily be the
>>>>>>>>>>>>>>>>>>> commit time. But neither is the commit timestamp from 
>>>>>>>>>>>>>>>>>>> Iceberg's snapshot.
>>>>>>>>>>>>>>>>>>> I'm not sure how we are going to define "good enough" for 
>>>>>>>>>>>>>>>>>>> this purpose. I
>>>>>>>>>>>>>>>>>>> think at least `CURRENT_TIMESTAMP` has reliable and known 
>>>>>>>>>>>>>>>>>>> behavior when you
>>>>>>>>>>>>>>>>>>> look at how it is handled in engines. And if you want the 
>>>>>>>>>>>>>>>>>>> Iceberg
>>>>>>>>>>>>>>>>>>> timestamp, then use a periodic query of the snapshot stable 
>>>>>>>>>>>>>>>>>>> to keep track
>>>>>>>>>>>>>>>>>>> of them in a table you can join to. I don't think this 
>>>>>>>>>>>>>>>>>>> rises to the need
>>>>>>>>>>>>>>>>>>> for a table feature unless we can guarantee that it is 
>>>>>>>>>>>>>>>>>>> correct.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> > Postgres `current_timestamp` captures the
>>>>>>>>>>>>>>>>>>>> transaction start time [1, 2]. Should we extend the same 
>>>>>>>>>>>>>>>>>>>> semantic to
>>>>>>>>>>>>>>>>>>>> Iceberg: all rows added in the same snapshot should have 
>>>>>>>>>>>>>>>>>>>> the same timestamp
>>>>>>>>>>>>>>>>>>>> value?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Let me clarify my last comment.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT
>>>>>>>>>>>>>>>>>>>> CURRENT_TIMESTAMP)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Since Postgres current_timestamp captures the
>>>>>>>>>>>>>>>>>>>> transaction start time, all rows added in the same insert 
>>>>>>>>>>>>>>>>>>>> transaction would
>>>>>>>>>>>>>>>>>>>> have the same value as the transaction timestamp with the 
>>>>>>>>>>>>>>>>>>>> column
>>>>>>>>>>>>>>>>>>>> definition above.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> If we extend a similar semantic to Iceberg, all rows
>>>>>>>>>>>>>>>>>>>> added in the same Iceberg transaction/snapshot should have 
>>>>>>>>>>>>>>>>>>>> the same
>>>>>>>>>>>>>>>>>>>> timestamp?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Ryan, I understand your comment for using
>>>>>>>>>>>>>>>>>>>> current_timestamp expression as column default value, you 
>>>>>>>>>>>>>>>>>>>> were thinking
>>>>>>>>>>>>>>>>>>>> that the engine would set the column value to the wall 
>>>>>>>>>>>>>>>>>>>> clock time when
>>>>>>>>>>>>>>>>>>>> appending a row to a data file, right? every row would 
>>>>>>>>>>>>>>>>>>>> almost have a
>>>>>>>>>>>>>>>>>>>> different timestamp value.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> `current_timestamp` expression may not always carry
>>>>>>>>>>>>>>>>>>>>> the right semantic for the use cases. E.g., latency 
>>>>>>>>>>>>>>>>>>>>> tracking is interested
>>>>>>>>>>>>>>>>>>>>> in when records are added / committed to the table, not 
>>>>>>>>>>>>>>>>>>>>> when the record was
>>>>>>>>>>>>>>>>>>>>> appended to an uncommitted data file in the processing 
>>>>>>>>>>>>>>>>>>>>> engine.
>>>>>>>>>>>>>>>>>>>>> Record creation and Iceberg commit can be minutes or even 
>>>>>>>>>>>>>>>>>>>>> hours apart.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Row timestamp inherited from snapshot timestamp has no
>>>>>>>>>>>>>>>>>>>>> overhead with the initial commit and has very minimal 
>>>>>>>>>>>>>>>>>>>>> storage overhead
>>>>>>>>>>>>>>>>>>>>> during file rewrite. Per-row current_timestamp would have 
>>>>>>>>>>>>>>>>>>>>> distinct values
>>>>>>>>>>>>>>>>>>>>> for every row and has more storage overhead.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> OLTP databases deal with small row-level transactions.
>>>>>>>>>>>>>>>>>>>>> Postgres `current_timestamp` captures the transaction 
>>>>>>>>>>>>>>>>>>>>> start time [1, 2].
>>>>>>>>>>>>>>>>>>>>> Should we extend the same semantic to Iceberg: all rows 
>>>>>>>>>>>>>>>>>>>>> added in the same
>>>>>>>>>>>>>>>>>>>>> snapshot should have the same timestamp value?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>> https://www.postgresql.org/docs/current/functions-datetime.html
>>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this
>>>>>>>>>>>>>>>>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>>>>>>> To support the created_by timestamp, we would need
>>>>>>>>>>>>>>>>>>>>>>> to implement the following row lineage behavior
>>>>>>>>>>>>>>>>>>>>>>> * Initially, it inherits from the snapshot timestamp
>>>>>>>>>>>>>>>>>>>>>>> * during rewrite (like compaction), it should be
>>>>>>>>>>>>>>>>>>>>>>> persisted into data files.
>>>>>>>>>>>>>>>>>>>>>>> * during update, it needs to be carried over from
>>>>>>>>>>>>>>>>>>>>>>> the previous row. This is similar to the row_id carry 
>>>>>>>>>>>>>>>>>>>>>>> over for row updates.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Sorry for the short hand.  These are not the same:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 1.  Insertion time - time the row was inserted.
>>>>>>>>>>>>>>>>>>>>>> 2.  Create by - The system that created the record.
>>>>>>>>>>>>>>>>>>>>>> 3.  Updated by - The system that last updated the
>>>>>>>>>>>>>>>>>>>>>> record.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Depending on the exact use-case these might or might
>>>>>>>>>>>>>>>>>>>>>> not have utility.  I'm just wondering if there will be 
>>>>>>>>>>>>>>>>>>>>>> more example like
>>>>>>>>>>>>>>>>>>>>>> this in the future.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> created_by column would incur likely significantly
>>>>>>>>>>>>>>>>>>>>>>> higher storage overhead compared to the updated_by 
>>>>>>>>>>>>>>>>>>>>>>> column. As rows are
>>>>>>>>>>>>>>>>>>>>>>> updated overtime, the cardinality for this column in 
>>>>>>>>>>>>>>>>>>>>>>> data files can be
>>>>>>>>>>>>>>>>>>>>>>> high. Hence, the created_by column may not compress 
>>>>>>>>>>>>>>>>>>>>>>> well. This is a similar
>>>>>>>>>>>>>>>>>>>>>>> problem for the row_id column. One side effect of 
>>>>>>>>>>>>>>>>>>>>>>> enabling row lineage by
>>>>>>>>>>>>>>>>>>>>>>> default for V3 tables is the storage overhead of row_id 
>>>>>>>>>>>>>>>>>>>>>>> column after
>>>>>>>>>>>>>>>>>>>>>>> compaction especially for narrow tables with few 
>>>>>>>>>>>>>>>>>>>>>>> columns.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I agree.  I think this analysis also shows that some
>>>>>>>>>>>>>>>>>>>>>> consumers of Iceberg might not necessarily want to have 
>>>>>>>>>>>>>>>>>>>>>> all these columns,
>>>>>>>>>>>>>>>>>>>>>> so we might want to make them configurable, rather than 
>>>>>>>>>>>>>>>>>>>>>> mandating them for
>>>>>>>>>>>>>>>>>>>>>> all tables. Ryan's thought on default values seems like 
>>>>>>>>>>>>>>>>>>>>>> it would solve the
>>>>>>>>>>>>>>>>>>>>>> issues I was raising.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> Micah
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> > An explicit timestamp column adds more burden to
>>>>>>>>>>>>>>>>>>>>>>> application developers. While some databases require an 
>>>>>>>>>>>>>>>>>>>>>>> explicit column in
>>>>>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto 
>>>>>>>>>>>>>>>>>>>>>>> set the column value.
>>>>>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to 
>>>>>>>>>>>>>>>>>>>>>>> the trigger timestamp.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Since the use cases don't require an exact
>>>>>>>>>>>>>>>>>>>>>>> timestamp, this seems like the best solution to get 
>>>>>>>>>>>>>>>>>>>>>>> what people want (an
>>>>>>>>>>>>>>>>>>>>>>> insertion timestamp) that has clear and well-defined 
>>>>>>>>>>>>>>>>>>>>>>> behavior. Since
>>>>>>>>>>>>>>>>>>>>>>> `current_timestamp` is defined by the SQL spec, it 
>>>>>>>>>>>>>>>>>>>>>>> makes sense to me that
>>>>>>>>>>>>>>>>>>>>>>> we could use it and have reasonable behavior.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I've talked with Anton about this before and maybe
>>>>>>>>>>>>>>>>>>>>>>> he'll jump in on this thread. I think that we may need 
>>>>>>>>>>>>>>>>>>>>>>> to extend default
>>>>>>>>>>>>>>>>>>>>>>> values to include default value expressions, like 
>>>>>>>>>>>>>>>>>>>>>>> `current_timestamp` that
>>>>>>>>>>>>>>>>>>>>>>> is allowed by the SQL spec. That would solve the 
>>>>>>>>>>>>>>>>>>>>>>> problem as well as some
>>>>>>>>>>>>>>>>>>>>>>> others (like `current_date` or `current_user`) and 
>>>>>>>>>>>>>>>>>>>>>>> would not create a
>>>>>>>>>>>>>>>>>>>>>>> potentially misleading (and heavyweight) timestamp 
>>>>>>>>>>>>>>>>>>>>>>> feature in the format.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> > Also some environments may have stronger clock
>>>>>>>>>>>>>>>>>>>>>>> service, like Spanner TrueTime service.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Even in cases like this, commit retries can reorder
>>>>>>>>>>>>>>>>>>>>>>> commits and make timestamps out of order. I don't think 
>>>>>>>>>>>>>>>>>>>>>>> that we should be
>>>>>>>>>>>>>>>>>>>>>>> making guarantees or even exposing metadata that people 
>>>>>>>>>>>>>>>>>>>>>>> might mistake as
>>>>>>>>>>>>>>>>>>>>>>> having those guarantees.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu <
>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Ryan, thanks a lot for the feedback!
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Regarding the concern for reliable timestamps, we
>>>>>>>>>>>>>>>>>>>>>>>> are not proposing using timestamps for ordering. With 
>>>>>>>>>>>>>>>>>>>>>>>> NTP in modern
>>>>>>>>>>>>>>>>>>>>>>>> computers, they are generally reliable enough for the 
>>>>>>>>>>>>>>>>>>>>>>>> intended use cases.
>>>>>>>>>>>>>>>>>>>>>>>> Also some environments may have stronger clock 
>>>>>>>>>>>>>>>>>>>>>>>> service, like Spanner
>>>>>>>>>>>>>>>>>>>>>>>> TrueTime service
>>>>>>>>>>>>>>>>>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency>
>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> >  joining to timestamps from the snapshots
>>>>>>>>>>>>>>>>>>>>>>>> metadata table.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> As you also mentioned, it depends on the snapshot
>>>>>>>>>>>>>>>>>>>>>>>> history, which is often retained for a few days due to 
>>>>>>>>>>>>>>>>>>>>>>>> performance reasons.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> > embedding a timestamp in DML (like
>>>>>>>>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an 
>>>>>>>>>>>>>>>>>>>>>>>> implicit one from table
>>>>>>>>>>>>>>>>>>>>>>>> metadata.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> An explicit timestamp column adds more burden to
>>>>>>>>>>>>>>>>>>>>>>>> application developers. While some databases require 
>>>>>>>>>>>>>>>>>>>>>>>> an explicit column in
>>>>>>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto 
>>>>>>>>>>>>>>>>>>>>>>>> set the column value.
>>>>>>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to 
>>>>>>>>>>>>>>>>>>>>>>>> the trigger timestamp.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Also, the timestamp set during computation (like
>>>>>>>>>>>>>>>>>>>>>>>> streaming ingestion or relative long batch 
>>>>>>>>>>>>>>>>>>>>>>>> computation) doesn't capture the
>>>>>>>>>>>>>>>>>>>>>>>> time the rows/files are added to the Iceberg table in 
>>>>>>>>>>>>>>>>>>>>>>>> a batch fashion.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> > And for those use cases, you could also keep a
>>>>>>>>>>>>>>>>>>>>>>>> longer history of snapshot timestamps, like storing a 
>>>>>>>>>>>>>>>>>>>>>>>> catalog's event log
>>>>>>>>>>>>>>>>>>>>>>>> for long-term access to timestamp info
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> this is not really consumable by joining the
>>>>>>>>>>>>>>>>>>>>>>>> regular table query with catalog event log. I would 
>>>>>>>>>>>>>>>>>>>>>>>> also imagine catalog
>>>>>>>>>>>>>>>>>>>>>>>> event log is capped at shorter retention (maybe a few 
>>>>>>>>>>>>>>>>>>>>>>>> months) compared to
>>>>>>>>>>>>>>>>>>>>>>>> data retention (could be a few years).
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I don't think it is a good idea to expose
>>>>>>>>>>>>>>>>>>>>>>>>> timestamps at the row level. Timestamps in metadata 
>>>>>>>>>>>>>>>>>>>>>>>>> that would be carried
>>>>>>>>>>>>>>>>>>>>>>>>> down to the row level already confuse people that 
>>>>>>>>>>>>>>>>>>>>>>>>> expect them to be useful
>>>>>>>>>>>>>>>>>>>>>>>>> or reliable, rather than for debugging. I think 
>>>>>>>>>>>>>>>>>>>>>>>>> extending this to the row
>>>>>>>>>>>>>>>>>>>>>>>>> level would only make the problem worse.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> You can already get this information by projecting
>>>>>>>>>>>>>>>>>>>>>>>>> the last updated sequence number, which is reliable, 
>>>>>>>>>>>>>>>>>>>>>>>>> and joining to
>>>>>>>>>>>>>>>>>>>>>>>>> timestamps from the snapshots metadata table. Of 
>>>>>>>>>>>>>>>>>>>>>>>>> course, the drawback there
>>>>>>>>>>>>>>>>>>>>>>>>> is losing the timestamp information when snapshots 
>>>>>>>>>>>>>>>>>>>>>>>>> expire, but since it
>>>>>>>>>>>>>>>>>>>>>>>>> isn't reliable anyway I'd be fine with that.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Some of the use cases, like auditing and
>>>>>>>>>>>>>>>>>>>>>>>>> compliance, are probably better served by embedding a 
>>>>>>>>>>>>>>>>>>>>>>>>> timestamp in DML
>>>>>>>>>>>>>>>>>>>>>>>>> (like `current_timestamp`) rather than relying on an 
>>>>>>>>>>>>>>>>>>>>>>>>> implicit one from
>>>>>>>>>>>>>>>>>>>>>>>>> table metadata. And for those use cases, you could 
>>>>>>>>>>>>>>>>>>>>>>>>> also keep a longer
>>>>>>>>>>>>>>>>>>>>>>>>> history of snapshot timestamps, like storing a 
>>>>>>>>>>>>>>>>>>>>>>>>> catalog's event log for
>>>>>>>>>>>>>>>>>>>>>>>>> long-term access to timestamp info. I think that 
>>>>>>>>>>>>>>>>>>>>>>>>> would be better than
>>>>>>>>>>>>>>>>>>>>>>>>> storing it at the row level.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu <
>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> For V4 spec, I have a small proposal [1] to
>>>>>>>>>>>>>>>>>>>>>>>>>> expose the row timestamp concept that can help with 
>>>>>>>>>>>>>>>>>>>>>>>>>> many use cases like
>>>>>>>>>>>>>>>>>>>>>>>>>> temporal queries, latency tracking, TTL, auditing 
>>>>>>>>>>>>>>>>>>>>>>>>>> and compliance.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> This *_last_updated_timestamp_ms * metadata
>>>>>>>>>>>>>>>>>>>>>>>>>> column behaves very similarly to the
>>>>>>>>>>>>>>>>>>>>>>>>>> *_last_updated_sequence_number* for row lineage.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>    - Initially, it inherits from the snapshot
>>>>>>>>>>>>>>>>>>>>>>>>>>    timestamp.
>>>>>>>>>>>>>>>>>>>>>>>>>>    - During rewrite (like compaction), its
>>>>>>>>>>>>>>>>>>>>>>>>>>    values are persisted in the data files.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear what you think.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] row timestamp proposal

Reply via email to