Re: [DISCUSS] row timestamp proposal

Steven Wu Thu, 22 Jan 2026 18:29:18 -0800

For row timestamp inheritance to work, I would need to implement the
plumbing. So I would imagine existing rows would have null values because
the inheritance plumbing was not there yet. This would be consistent with
upgrade behavior for the V3 row lineage:
https://iceberg.apache.org/spec/#row-lineage-for-upgraded-tables.


On Thu, Jan 22, 2026 at 4:09 PM Anton Okolnychyi <[email protected]>
wrote:

> Also, do we have a concrete plan for how to handle tables that would be
> upgraded to V4? What timestamp will we assign to existing rows?
>
> On Wed, Jan 21, 2026 at 3:59 PM Anton Okolnychyi <[email protected]>
> wrote:
>
>> If we ignore temporal queries that need strict snapshot boundaries and
>> can't be solved completely using row timestamps in case of mutations, you
>> mentioned other use cases when row timestamps may be helpful like TTL and
>> auditing. We can debate whether using CURRENT_TIMESTAMP() is enough for
>> them, but I don't really see a point given that we already have row lineage
>> in V3 and the storage overhead for one more field isn't likely to be
>> noticable. One of the problems with CURRENT_TIMESTAMP() is the required
>> action by the user. Having a reliable row timestamp populated automatically
>> is likely to be better, so +1.
>>
>> пт, 16 січ. 2026 р. о 14:30 Steven Wu <[email protected]> пише:
>>
>>> Joining with snapshot history also has significant complexity. It
>>> requires retaining the entire snapshot history with probably trimmed
>>> snapshot metadata. There are concerns on the size of the snapshot history
>>> for tables with frequent commits (like streaming ingestion). Do we maintain
>>> the unbounded trimmed snapshot history in the same table metadata, which
>>> could affect table metadata.json size? or store it separately somewhere
>>> (like in catalog), which would require the complexity of multi-entity
>>> transaction in catalog?
>>>
>>>
>>> On Fri, Jan 16, 2026 at 12:07 PM Russell Spitzer <
>>> [email protected]> wrote:
>>>
>>>> I've gone back and forth on the inherited columns. I think the thing
>>>> which keeps coming back to me is that I don't
>>>> like that the only way to determine the timestamp associated with a row
>>>> update/creation is to do a join back
>>>> against table metadata. While that's doable, It feels user unfriendly.
>>>>
>>>>
>>>>
>>>> On Fri, Jan 16, 2026 at 11:54 AM Steven Wu <[email protected]>
>>>> wrote:
>>>>
>>>>> Anton, you are right that the row-level deletes will be a problem for
>>>>> some of the mentioned use cases (like incremental processing). I have
>>>>> clarified the applicability of some use cases to "tables with inserts and
>>>>> updates only".
>>>>>
>>>>> Right now, we are only tracking modification/commit time (not
>>>>> insertion time) in case of updates.
>>>>>
>>>>> On Thu, Jan 15, 2026 at 6:33 PM Anton Okolnychyi <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I think there is clear consensus that making snapshot timestamps
>>>>>> strictly increasing is a positive thing. I am also +1.
>>>>>>
>>>>>> - How will row timestamps allow us to reliably implement incremental
>>>>>> consumption independent of the snapshot retention given that rows can be
>>>>>> added AND removed in a particular time frame? How can we capture all
>>>>>> changes by just looking at the latest snapshot?
>>>>>> - Some use cases in the doc need the insertion time and some need the
>>>>>> last modification time. Do we plan to support both?
>>>>>> - What do we expect the behavior to be in UPDATE and MERGE
>>>>>> operations?
>>>>>>
>>>>>> To be clear: I am not opposed to this change, just want to make sure
>>>>>> I understand all use cases that we aim to address and what would be
>>>>>> required in engines.
>>>>>>
>>>>>> чт, 15 січ. 2026 р. о 17:01 Maninder Parmar <
>>>>>> [email protected]> пише:
>>>>>>
>>>>>>> +1 for improving how the commit timestamps are
>>>>>>> assigned monotonically since this requirement has emerged over multiple
>>>>>>> discussions like notifications, multi-table transactions, time travel
>>>>>>> accuracy and row timestamps. It would be good to have a single 
>>>>>>> consistent
>>>>>>> way to represent and assign timestamps that could be leveraged across
>>>>>>> multiple features.
>>>>>>>
>>>>>>> On Thu, Jan 15, 2026 at 4:05 PM Ryan Blue <[email protected]> wrote:
>>>>>>>
>>>>>>>> Yeah, to add my perspective on that discussion, I think my primary
>>>>>>>> concern is that people expect timestamps to be monotonic and if they 
>>>>>>>> aren't
>>>>>>>> then a `_last_update_timestamp` field just makes the problem worse. 
>>>>>>>> But it
>>>>>>>> is _nice_ to have row-level timestamps. So I would be okay if we 
>>>>>>>> revisit
>>>>>>>> how we assign commit timestamps and improve it so that you get 
>>>>>>>> monotonic
>>>>>>>> behavior.
>>>>>>>>
>>>>>>>> On Thu, Jan 15, 2026 at 2:23 PM Steven Wu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> We had an offline discussion with Ryan. I revised the proposal as
>>>>>>>>> follows.
>>>>>>>>>
>>>>>>>>> 1. V4 would require writers to generate *monotonic* snapshot
>>>>>>>>> timestamps. The proposal doc has a section that describes a 
>>>>>>>>> recommended
>>>>>>>>> implementation using lamport timestamps.
>>>>>>>>> 2. Expose *last_update_timestamp* metadata column that inherits
>>>>>>>>> from snapshot timestamp
>>>>>>>>>
>>>>>>>>> This is a relatively low-friction change that can fix the time
>>>>>>>>> travel problem and enable use cases like latency tracking, temporal 
>>>>>>>>> query,
>>>>>>>>> TTL, auditing.
>>>>>>>>>
>>>>>>>>> There is no accuracy requirement on the timestamp values. In
>>>>>>>>> practice, modern servers with NTP have pretty reliable wall clocks. 
>>>>>>>>> E.g.,
>>>>>>>>> Java library implemented this validation
>>>>>>>>> <https://github.com/apache/iceberg/blob/035e0fb39d2a949f6343552ade0a7d6c2967e0db/core/src/main/java/org/apache/iceberg/TableMetadata.java#L369-L377>
>>>>>>>>>  that
>>>>>>>>> protects against backward clock drift up to one minute for snapshot
>>>>>>>>> timestamps. Don't think we have heard many complaints of commit 
>>>>>>>>> failure due
>>>>>>>>> to that clock drift validation.
>>>>>>>>>
>>>>>>>>> Would appreciate feedback on the revised proposal.
>>>>>>>>>
>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Steven
>>>>>>>>>
>>>>>>>>> On Tue, Jan 13, 2026 at 8:40 PM Anton Okolnychyi <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Steven, I was referring to the fact that CURRENT_TIMESTAMP() is
>>>>>>>>>> usually evaluated quite early in engines so we could theoretically 
>>>>>>>>>> have
>>>>>>>>>> another expression closer to the commit time. You are right, though, 
>>>>>>>>>> it
>>>>>>>>>> won't be the actual commit time given that we have to write it into 
>>>>>>>>>> the
>>>>>>>>>> files. Also, I don't think generating a timestamp for a row as it is 
>>>>>>>>>> being
>>>>>>>>>> written is going to be beneficial. To sum up, expression-based 
>>>>>>>>>> defaults
>>>>>>>>>> would allow us to capture the time the transaction or write starts, 
>>>>>>>>>> but not
>>>>>>>>>> the actual commit time.
>>>>>>>>>>
>>>>>>>>>> Russell, if the goal is to know what happened to the table in a
>>>>>>>>>> given time frame, isn't the changelog scan the way to go? It would 
>>>>>>>>>> assign
>>>>>>>>>> commit ordinals based on lineage and include row-level diffs. How 
>>>>>>>>>> would you
>>>>>>>>>> be able to determine changes with row timestamps by just looking at 
>>>>>>>>>> the
>>>>>>>>>> latest snapshot?
>>>>>>>>>>
>>>>>>>>>> It does seem promising to make snapshot timestamps strictly
>>>>>>>>>> increasing to avoid ambiguity during time travel.
>>>>>>>>>>
>>>>>>>>>> вт, 13 січ. 2026 р. о 16:33 Ryan Blue <[email protected]> пише:
>>>>>>>>>>
>>>>>>>>>>> > Whether or not "t" is an atomic clock time is not as important
>>>>>>>>>>> as the query between time bounds making sense.
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure I get it then. If we want monotonically increasing
>>>>>>>>>>> times, but they don't have to be real times then how do you know 
>>>>>>>>>>> what
>>>>>>>>>>> notion of "time" you care about for these filters? Or to put it 
>>>>>>>>>>> another
>>>>>>>>>>> way, how do you know that your "before" and "after" times are 
>>>>>>>>>>> reasonable?
>>>>>>>>>>> If the boundaries of these time queries can move around a bit, by 
>>>>>>>>>>> how much?
>>>>>>>>>>>
>>>>>>>>>>> It seems to me that row IDs can play an important role here
>>>>>>>>>>> because you have the order guarantee that we seem to want for this 
>>>>>>>>>>> use
>>>>>>>>>>> case: if snapshot A was committed before snapshot B, then the rows 
>>>>>>>>>>> from A
>>>>>>>>>>> have row IDs that are always less than the rows IDs of B. The 
>>>>>>>>>>> problem is
>>>>>>>>>>> that we don't know where those row IDs start and end once A and B 
>>>>>>>>>>> are no
>>>>>>>>>>> longer tracked. Using a "timestamp" seems to work, but I still 
>>>>>>>>>>> worry that
>>>>>>>>>>> without reliable timestamps that correspond with some guarantee to 
>>>>>>>>>>> real
>>>>>>>>>>> timestamps, we are creating a feature that seems reliable but isn't.
>>>>>>>>>>>
>>>>>>>>>>> I'm somewhat open to the idea of introducing a snapshot
>>>>>>>>>>> timestamp that the catalog guarantees is monotonically increasing. 
>>>>>>>>>>> But if
>>>>>>>>>>> we did that, wouldn't we still need to know the association between 
>>>>>>>>>>> these
>>>>>>>>>>> timestamps and snapshots after the snapshot metadata expires? My 
>>>>>>>>>>> mental
>>>>>>>>>>> model is that this would be used to look for data that arrived, 
>>>>>>>>>>> say, 3
>>>>>>>>>>> weeks ago on Dec 24th. Since the snapshots metadata is no longer 
>>>>>>>>>>> around we
>>>>>>>>>>> could use the row timestamp to find those rows. But how do we know 
>>>>>>>>>>> that the
>>>>>>>>>>> snapshot timestamps correspond to the actual timestamp range of Dec 
>>>>>>>>>>> 24th?
>>>>>>>>>>> Is it just "close enough" as long as we don't have out of order 
>>>>>>>>>>> timestamps?
>>>>>>>>>>> This is what I mean by needing to keep track of the association 
>>>>>>>>>>> between
>>>>>>>>>>> timestamps and snapshots after the metadata expires. Seems like you 
>>>>>>>>>>> either
>>>>>>>>>>> need to keep track of what the catalog's clock was for events you 
>>>>>>>>>>> care
>>>>>>>>>>> about, or you don't really care about exact timestamps.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 13, 2026 at 2:22 PM Russell Spitzer <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The key goal here is the ability to answer the question "what
>>>>>>>>>>>> happened to the table in some time window. (before < t < after)?"
>>>>>>>>>>>> Whether or not "t" is an atomic clock time is not as important
>>>>>>>>>>>> as the query between time bounds making sense.
>>>>>>>>>>>> Downstream applications (from what I know) are mostly sensitive
>>>>>>>>>>>> to getting discrete and well defined answers to
>>>>>>>>>>>> this question like:
>>>>>>>>>>>>
>>>>>>>>>>>> 1 < t < 2 should be exclusive of
>>>>>>>>>>>> 2 < t < 3 should be exclusive of
>>>>>>>>>>>> 3 < t < 4
>>>>>>>>>>>>
>>>>>>>>>>>> And the union of these should be the same as the query asking
>>>>>>>>>>>> for 1 < t < 4
>>>>>>>>>>>>
>>>>>>>>>>>> Currently this is not possible because we have no guarantee of
>>>>>>>>>>>> ordering in our timestamps
>>>>>>>>>>>>
>>>>>>>>>>>> Snapshots
>>>>>>>>>>>> A -> B -> C
>>>>>>>>>>>> Sequence numbers
>>>>>>>>>>>> 50 -> 51 ->  52
>>>>>>>>>>>> Timestamp
>>>>>>>>>>>> 3 -> 1 -> 2
>>>>>>>>>>>>
>>>>>>>>>>>> This makes time travel always a little wrong to start with.
>>>>>>>>>>>>
>>>>>>>>>>>> The Java implementation only allows one minute of negative time
>>>>>>>>>>>> on commit so we actually kind of do have this as a
>>>>>>>>>>>> "light monotonicity" requirement but as noted above there is no
>>>>>>>>>>>> spec requirement for this.  While we do have sequence
>>>>>>>>>>>> number and row id, we still don't have a stable way of
>>>>>>>>>>>> associating these with a consistent time in an engine independent 
>>>>>>>>>>>> way.
>>>>>>>>>>>>
>>>>>>>>>>>> Ideally we just want to have one consistent way of answering
>>>>>>>>>>>> the question "what did the table look like at time t"
>>>>>>>>>>>> which I think we get by adding in a new field that is a
>>>>>>>>>>>> timestamp, set by the Catalog close to commit time,
>>>>>>>>>>>> that always goes up.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not sure we can really do this with an engine expression
>>>>>>>>>>>> since they won't know when the data is actually committed
>>>>>>>>>>>> when writing files?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 13, 2026 at 3:35 PM Anton Okolnychyi <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> This seems like a lot of new complexity in the format. I would
>>>>>>>>>>>>> like us to explore whether we can build the considered use cases 
>>>>>>>>>>>>> on top of
>>>>>>>>>>>>> expression-based defaults instead.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We already plan to support CURRENT_TIMESTAMP() and similar
>>>>>>>>>>>>> functions that are part of the SQL standard definition for 
>>>>>>>>>>>>> default values.
>>>>>>>>>>>>> This would provide us a way to know the relative row order. True, 
>>>>>>>>>>>>> this
>>>>>>>>>>>>> usually will represent the start of the operation. We may define
>>>>>>>>>>>>> COMMIT_TIMESTAMP() or a similar expression for the actual commit 
>>>>>>>>>>>>> time, if
>>>>>>>>>>>>> there are use cases that need that. Plus, we may explore an 
>>>>>>>>>>>>> approach
>>>>>>>>>>>>> similar to MySQL that allows users to reset the default value on 
>>>>>>>>>>>>> update.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>
>>>>>>>>>>>>> вт, 13 січ. 2026 р. о 11:04 Russell Spitzer <
>>>>>>>>>>>>> [email protected]> пише:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think this is the right step forward. Our current
>>>>>>>>>>>>>> "timestamp" definition is too ambiguous to be useful so 
>>>>>>>>>>>>>> establishing
>>>>>>>>>>>>>> a well defined and monotonic timestamp could be really great.
>>>>>>>>>>>>>> I also like the ability for row's to know this value without
>>>>>>>>>>>>>> having to rely on snapshot information which can be expired.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Jan 12, 2026 at 11:03 AM Steven Wu <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have revised the row timestamp proposal with the following
>>>>>>>>>>>>>>> changes.
>>>>>>>>>>>>>>> * a new commit_timestamp field in snapshot metadata that has
>>>>>>>>>>>>>>> nano-second precision.
>>>>>>>>>>>>>>> * this optional field is only set by the REST catalog server
>>>>>>>>>>>>>>> * it needs to be monotonic (e.g. implemented using Lamport
>>>>>>>>>>>>>>> timestamp)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the clarification, Ryan.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For long-running streaming jobs that commit periodically,
>>>>>>>>>>>>>>>> it is difficult to establish the constant value of 
>>>>>>>>>>>>>>>> current_timestamp across
>>>>>>>>>>>>>>>> all writer tasks for each commit cycle. I guess streaming 
>>>>>>>>>>>>>>>> writers may just
>>>>>>>>>>>>>>>> need to write the wall clock time when appending a row to a 
>>>>>>>>>>>>>>>> data file for
>>>>>>>>>>>>>>>> the default value of current_timestamp.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I don't think that every row would have a different value.
>>>>>>>>>>>>>>>>> That would be up to the engine, but I would expect engines to 
>>>>>>>>>>>>>>>>> insert
>>>>>>>>>>>>>>>>> `CURRENT_TIMESTAMP` into the plan and then replace it with a 
>>>>>>>>>>>>>>>>> constant,
>>>>>>>>>>>>>>>>> resulting in a consistent value for all rows.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You're right that this would not necessarily be the commit
>>>>>>>>>>>>>>>>> time. But neither is the commit timestamp from Iceberg's 
>>>>>>>>>>>>>>>>> snapshot. I'm not
>>>>>>>>>>>>>>>>> sure how we are going to define "good enough" for this 
>>>>>>>>>>>>>>>>> purpose. I think at
>>>>>>>>>>>>>>>>> least `CURRENT_TIMESTAMP` has reliable and known behavior 
>>>>>>>>>>>>>>>>> when you look at
>>>>>>>>>>>>>>>>> how it is handled in engines. And if you want the Iceberg 
>>>>>>>>>>>>>>>>> timestamp, then
>>>>>>>>>>>>>>>>> use a periodic query of the snapshot stable to keep track of 
>>>>>>>>>>>>>>>>> them in a
>>>>>>>>>>>>>>>>> table you can join to. I don't think this rises to the need 
>>>>>>>>>>>>>>>>> for a table
>>>>>>>>>>>>>>>>> feature unless we can guarantee that it is correct.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> > Postgres `current_timestamp` captures the
>>>>>>>>>>>>>>>>>> transaction start time [1, 2]. Should we extend the same 
>>>>>>>>>>>>>>>>>> semantic to
>>>>>>>>>>>>>>>>>> Iceberg: all rows added in the same snapshot should have the 
>>>>>>>>>>>>>>>>>> same timestamp
>>>>>>>>>>>>>>>>>> value?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Let me clarify my last comment.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT
>>>>>>>>>>>>>>>>>> CURRENT_TIMESTAMP)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Since Postgres current_timestamp captures the transaction
>>>>>>>>>>>>>>>>>> start time, all rows added in the same insert transaction 
>>>>>>>>>>>>>>>>>> would have the
>>>>>>>>>>>>>>>>>> same value as the transaction timestamp with the column 
>>>>>>>>>>>>>>>>>> definition above.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If we extend a similar semantic to Iceberg, all rows
>>>>>>>>>>>>>>>>>> added in the same Iceberg transaction/snapshot should have 
>>>>>>>>>>>>>>>>>> the same
>>>>>>>>>>>>>>>>>> timestamp?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Ryan, I understand your comment for using
>>>>>>>>>>>>>>>>>> current_timestamp expression as column default value, you 
>>>>>>>>>>>>>>>>>> were thinking
>>>>>>>>>>>>>>>>>> that the engine would set the column value to the wall clock 
>>>>>>>>>>>>>>>>>> time when
>>>>>>>>>>>>>>>>>> appending a row to a data file, right? every row would 
>>>>>>>>>>>>>>>>>> almost have a
>>>>>>>>>>>>>>>>>> different timestamp value.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> `current_timestamp` expression may not always carry the
>>>>>>>>>>>>>>>>>>> right semantic for the use cases. E.g., latency tracking is 
>>>>>>>>>>>>>>>>>>> interested in
>>>>>>>>>>>>>>>>>>> when records are added / committed to the table, not when 
>>>>>>>>>>>>>>>>>>> the record was
>>>>>>>>>>>>>>>>>>> appended to an uncommitted data file in the processing 
>>>>>>>>>>>>>>>>>>> engine.
>>>>>>>>>>>>>>>>>>> Record creation and Iceberg commit can be minutes or even 
>>>>>>>>>>>>>>>>>>> hours apart.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Row timestamp inherited from snapshot timestamp has no
>>>>>>>>>>>>>>>>>>> overhead with the initial commit and has very minimal 
>>>>>>>>>>>>>>>>>>> storage overhead
>>>>>>>>>>>>>>>>>>> during file rewrite. Per-row current_timestamp would have 
>>>>>>>>>>>>>>>>>>> distinct values
>>>>>>>>>>>>>>>>>>> for every row and has more storage overhead.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> OLTP databases deal with small row-level transactions.
>>>>>>>>>>>>>>>>>>> Postgres `current_timestamp` captures the transaction start 
>>>>>>>>>>>>>>>>>>> time [1, 2].
>>>>>>>>>>>>>>>>>>> Should we extend the same semantic to Iceberg: all rows 
>>>>>>>>>>>>>>>>>>> added in the same
>>>>>>>>>>>>>>>>>>> snapshot should have the same timestamp value?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>> https://www.postgresql.org/docs/current/functions-datetime.html
>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this
>>>>>>>>>>>>>>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>>>>> To support the created_by timestamp, we would need to
>>>>>>>>>>>>>>>>>>>>> implement the following row lineage behavior
>>>>>>>>>>>>>>>>>>>>> * Initially, it inherits from the snapshot timestamp
>>>>>>>>>>>>>>>>>>>>> * during rewrite (like compaction), it should be
>>>>>>>>>>>>>>>>>>>>> persisted into data files.
>>>>>>>>>>>>>>>>>>>>> * during update, it needs to be carried over from the
>>>>>>>>>>>>>>>>>>>>> previous row. This is similar to the row_id carry over 
>>>>>>>>>>>>>>>>>>>>> for row updates.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Sorry for the short hand.  These are not the same:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 1.  Insertion time - time the row was inserted.
>>>>>>>>>>>>>>>>>>>> 2.  Create by - The system that created the record.
>>>>>>>>>>>>>>>>>>>> 3.  Updated by - The system that last updated the
>>>>>>>>>>>>>>>>>>>> record.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Depending on the exact use-case these might or might
>>>>>>>>>>>>>>>>>>>> not have utility.  I'm just wondering if there will be 
>>>>>>>>>>>>>>>>>>>> more example like
>>>>>>>>>>>>>>>>>>>> this in the future.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> created_by column would incur likely significantly
>>>>>>>>>>>>>>>>>>>>> higher storage overhead compared to the updated_by 
>>>>>>>>>>>>>>>>>>>>> column. As rows are
>>>>>>>>>>>>>>>>>>>>> updated overtime, the cardinality for this column in data 
>>>>>>>>>>>>>>>>>>>>> files can be
>>>>>>>>>>>>>>>>>>>>> high. Hence, the created_by column may not compress well. 
>>>>>>>>>>>>>>>>>>>>> This is a similar
>>>>>>>>>>>>>>>>>>>>> problem for the row_id column. One side effect of 
>>>>>>>>>>>>>>>>>>>>> enabling row lineage by
>>>>>>>>>>>>>>>>>>>>> default for V3 tables is the storage overhead of row_id 
>>>>>>>>>>>>>>>>>>>>> column after
>>>>>>>>>>>>>>>>>>>>> compaction especially for narrow tables with few columns.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I agree.  I think this analysis also shows that some
>>>>>>>>>>>>>>>>>>>> consumers of Iceberg might not necessarily want to have 
>>>>>>>>>>>>>>>>>>>> all these columns,
>>>>>>>>>>>>>>>>>>>> so we might want to make them configurable, rather than 
>>>>>>>>>>>>>>>>>>>> mandating them for
>>>>>>>>>>>>>>>>>>>> all tables. Ryan's thought on default values seems like it 
>>>>>>>>>>>>>>>>>>>> would solve the
>>>>>>>>>>>>>>>>>>>> issues I was raising.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Micah
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> > An explicit timestamp column adds more burden to
>>>>>>>>>>>>>>>>>>>>> application developers. While some databases require an 
>>>>>>>>>>>>>>>>>>>>> explicit column in
>>>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set 
>>>>>>>>>>>>>>>>>>>>> the column value.
>>>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to the 
>>>>>>>>>>>>>>>>>>>>> trigger timestamp.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Since the use cases don't require an exact timestamp,
>>>>>>>>>>>>>>>>>>>>> this seems like the best solution to get what people want 
>>>>>>>>>>>>>>>>>>>>> (an insertion
>>>>>>>>>>>>>>>>>>>>> timestamp) that has clear and well-defined behavior. Since
>>>>>>>>>>>>>>>>>>>>> `current_timestamp` is defined by the SQL spec, it makes 
>>>>>>>>>>>>>>>>>>>>> sense to me that
>>>>>>>>>>>>>>>>>>>>> we could use it and have reasonable behavior.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I've talked with Anton about this before and maybe
>>>>>>>>>>>>>>>>>>>>> he'll jump in on this thread. I think that we may need to 
>>>>>>>>>>>>>>>>>>>>> extend default
>>>>>>>>>>>>>>>>>>>>> values to include default value expressions, like 
>>>>>>>>>>>>>>>>>>>>> `current_timestamp` that
>>>>>>>>>>>>>>>>>>>>> is allowed by the SQL spec. That would solve the problem 
>>>>>>>>>>>>>>>>>>>>> as well as some
>>>>>>>>>>>>>>>>>>>>> others (like `current_date` or `current_user`) and would 
>>>>>>>>>>>>>>>>>>>>> not create a
>>>>>>>>>>>>>>>>>>>>> potentially misleading (and heavyweight) timestamp 
>>>>>>>>>>>>>>>>>>>>> feature in the format.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> > Also some environments may have stronger clock
>>>>>>>>>>>>>>>>>>>>> service, like Spanner TrueTime service.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Even in cases like this, commit retries can reorder
>>>>>>>>>>>>>>>>>>>>> commits and make timestamps out of order. I don't think 
>>>>>>>>>>>>>>>>>>>>> that we should be
>>>>>>>>>>>>>>>>>>>>> making guarantees or even exposing metadata that people 
>>>>>>>>>>>>>>>>>>>>> might mistake as
>>>>>>>>>>>>>>>>>>>>> having those guarantees.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Ryan, thanks a lot for the feedback!
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Regarding the concern for reliable timestamps, we are
>>>>>>>>>>>>>>>>>>>>>> not proposing using timestamps for ordering. With NTP in 
>>>>>>>>>>>>>>>>>>>>>> modern computers,
>>>>>>>>>>>>>>>>>>>>>> they are generally reliable enough for the intended use 
>>>>>>>>>>>>>>>>>>>>>> cases. Also some
>>>>>>>>>>>>>>>>>>>>>> environments may have stronger clock service, like 
>>>>>>>>>>>>>>>>>>>>>> Spanner
>>>>>>>>>>>>>>>>>>>>>> TrueTime service
>>>>>>>>>>>>>>>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency>
>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> >  joining to timestamps from the snapshots metadata
>>>>>>>>>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> As you also mentioned, it depends on the snapshot
>>>>>>>>>>>>>>>>>>>>>> history, which is often retained for a few days due to 
>>>>>>>>>>>>>>>>>>>>>> performance reasons.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> > embedding a timestamp in DML (like
>>>>>>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit 
>>>>>>>>>>>>>>>>>>>>>> one from table
>>>>>>>>>>>>>>>>>>>>>> metadata.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> An explicit timestamp column adds more burden to
>>>>>>>>>>>>>>>>>>>>>> application developers. While some databases require an 
>>>>>>>>>>>>>>>>>>>>>> explicit column in
>>>>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set 
>>>>>>>>>>>>>>>>>>>>>> the column value.
>>>>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to 
>>>>>>>>>>>>>>>>>>>>>> the trigger timestamp.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Also, the timestamp set during computation (like
>>>>>>>>>>>>>>>>>>>>>> streaming ingestion or relative long batch computation) 
>>>>>>>>>>>>>>>>>>>>>> doesn't capture the
>>>>>>>>>>>>>>>>>>>>>> time the rows/files are added to the Iceberg table in a 
>>>>>>>>>>>>>>>>>>>>>> batch fashion.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> > And for those use cases, you could also keep a
>>>>>>>>>>>>>>>>>>>>>> longer history of snapshot timestamps, like storing a 
>>>>>>>>>>>>>>>>>>>>>> catalog's event log
>>>>>>>>>>>>>>>>>>>>>> for long-term access to timestamp info
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> this is not really consumable by joining the regular
>>>>>>>>>>>>>>>>>>>>>> table query with catalog event log. I would also imagine 
>>>>>>>>>>>>>>>>>>>>>> catalog event log
>>>>>>>>>>>>>>>>>>>>>> is capped at shorter retention (maybe a few months) 
>>>>>>>>>>>>>>>>>>>>>> compared to data
>>>>>>>>>>>>>>>>>>>>>> retention (could be a few years).
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I don't think it is a good idea to expose timestamps
>>>>>>>>>>>>>>>>>>>>>>> at the row level. Timestamps in metadata that would be 
>>>>>>>>>>>>>>>>>>>>>>> carried down to the
>>>>>>>>>>>>>>>>>>>>>>> row level already confuse people that expect them to be 
>>>>>>>>>>>>>>>>>>>>>>> useful or reliable,
>>>>>>>>>>>>>>>>>>>>>>> rather than for debugging. I think extending this to 
>>>>>>>>>>>>>>>>>>>>>>> the row level would
>>>>>>>>>>>>>>>>>>>>>>> only make the problem worse.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> You can already get this information by projecting
>>>>>>>>>>>>>>>>>>>>>>> the last updated sequence number, which is reliable, 
>>>>>>>>>>>>>>>>>>>>>>> and joining to
>>>>>>>>>>>>>>>>>>>>>>> timestamps from the snapshots metadata table. Of 
>>>>>>>>>>>>>>>>>>>>>>> course, the drawback there
>>>>>>>>>>>>>>>>>>>>>>> is losing the timestamp information when snapshots 
>>>>>>>>>>>>>>>>>>>>>>> expire, but since it
>>>>>>>>>>>>>>>>>>>>>>> isn't reliable anyway I'd be fine with that.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Some of the use cases, like auditing and compliance,
>>>>>>>>>>>>>>>>>>>>>>> are probably better served by embedding a timestamp in 
>>>>>>>>>>>>>>>>>>>>>>> DML (like
>>>>>>>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit 
>>>>>>>>>>>>>>>>>>>>>>> one from table
>>>>>>>>>>>>>>>>>>>>>>> metadata. And for those use cases, you could also keep 
>>>>>>>>>>>>>>>>>>>>>>> a longer history of
>>>>>>>>>>>>>>>>>>>>>>> snapshot timestamps, like storing a catalog's event log 
>>>>>>>>>>>>>>>>>>>>>>> for long-term
>>>>>>>>>>>>>>>>>>>>>>> access to timestamp info. I think that would be better 
>>>>>>>>>>>>>>>>>>>>>>> than storing it at
>>>>>>>>>>>>>>>>>>>>>>> the row level.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu <
>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> For V4 spec, I have a small proposal [1] to expose
>>>>>>>>>>>>>>>>>>>>>>>> the row timestamp concept that can help with many use 
>>>>>>>>>>>>>>>>>>>>>>>> cases like temporal
>>>>>>>>>>>>>>>>>>>>>>>> queries, latency tracking, TTL, auditing and 
>>>>>>>>>>>>>>>>>>>>>>>> compliance.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> This *_last_updated_timestamp_ms * metadata column
>>>>>>>>>>>>>>>>>>>>>>>> behaves very similarly to the
>>>>>>>>>>>>>>>>>>>>>>>> *_last_updated_sequence_number* for row lineage.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>    - Initially, it inherits from the snapshot
>>>>>>>>>>>>>>>>>>>>>>>>    timestamp.
>>>>>>>>>>>>>>>>>>>>>>>>    - During rewrite (like compaction), its values
>>>>>>>>>>>>>>>>>>>>>>>>    are persisted in the data files.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear what you think.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] row timestamp proposal

Reply via email to