Re: [DISCUSS] row timestamp proposal

Anton Okolnychyi Mon, 26 Jan 2026 09:59:39 -0800

Cool, sounds like a plan then? Thanks for answering all the questions,
Steven!


чт, 22 січ. 2026 р. о 18:29 Steven Wu <[email protected]> пише:

> For row timestamp inheritance to work, I would need to implement the
> plumbing. So I would imagine existing rows would have null values because
> the inheritance plumbing was not there yet. This would be consistent with
> upgrade behavior for the V3 row lineage:
> https://iceberg.apache.org/spec/#row-lineage-for-upgraded-tables.
>
> On Thu, Jan 22, 2026 at 4:09 PM Anton Okolnychyi <[email protected]>
> wrote:
>
>> Also, do we have a concrete plan for how to handle tables that would be
>> upgraded to V4? What timestamp will we assign to existing rows?
>>
>> On Wed, Jan 21, 2026 at 3:59 PM Anton Okolnychyi <[email protected]>
>> wrote:
>>
>>> If we ignore temporal queries that need strict snapshot boundaries and
>>> can't be solved completely using row timestamps in case of mutations, you
>>> mentioned other use cases when row timestamps may be helpful like TTL and
>>> auditing. We can debate whether using CURRENT_TIMESTAMP() is enough for
>>> them, but I don't really see a point given that we already have row lineage
>>> in V3 and the storage overhead for one more field isn't likely to be
>>> noticable. One of the problems with CURRENT_TIMESTAMP() is the required
>>> action by the user. Having a reliable row timestamp populated automatically
>>> is likely to be better, so +1.
>>>
>>> пт, 16 січ. 2026 р. о 14:30 Steven Wu <[email protected]> пише:
>>>
>>>> Joining with snapshot history also has significant complexity. It
>>>> requires retaining the entire snapshot history with probably trimmed
>>>> snapshot metadata. There are concerns on the size of the snapshot history
>>>> for tables with frequent commits (like streaming ingestion). Do we maintain
>>>> the unbounded trimmed snapshot history in the same table metadata, which
>>>> could affect table metadata.json size? or store it separately somewhere
>>>> (like in catalog), which would require the complexity of multi-entity
>>>> transaction in catalog?
>>>>
>>>>
>>>> On Fri, Jan 16, 2026 at 12:07 PM Russell Spitzer <
>>>> [email protected]> wrote:
>>>>
>>>>> I've gone back and forth on the inherited columns. I think the thing
>>>>> which keeps coming back to me is that I don't
>>>>> like that the only way to determine the timestamp associated with a
>>>>> row update/creation is to do a join back
>>>>> against table metadata. While that's doable, It feels user unfriendly.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jan 16, 2026 at 11:54 AM Steven Wu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Anton, you are right that the row-level deletes will be a problem for
>>>>>> some of the mentioned use cases (like incremental processing). I have
>>>>>> clarified the applicability of some use cases to "tables with inserts and
>>>>>> updates only".
>>>>>>
>>>>>> Right now, we are only tracking modification/commit time (not
>>>>>> insertion time) in case of updates.
>>>>>>
>>>>>> On Thu, Jan 15, 2026 at 6:33 PM Anton Okolnychyi <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I think there is clear consensus that making snapshot timestamps
>>>>>>> strictly increasing is a positive thing. I am also +1.
>>>>>>>
>>>>>>> - How will row timestamps allow us to reliably implement incremental
>>>>>>> consumption independent of the snapshot retention given that rows can be
>>>>>>> added AND removed in a particular time frame? How can we capture all
>>>>>>> changes by just looking at the latest snapshot?
>>>>>>> - Some use cases in the doc need the insertion time and some need
>>>>>>> the last modification time. Do we plan to support both?
>>>>>>> - What do we expect the behavior to be in UPDATE and MERGE
>>>>>>> operations?
>>>>>>>
>>>>>>> To be clear: I am not opposed to this change, just want to make sure
>>>>>>> I understand all use cases that we aim to address and what would be
>>>>>>> required in engines.
>>>>>>>
>>>>>>> чт, 15 січ. 2026 р. о 17:01 Maninder Parmar <
>>>>>>> [email protected]> пише:
>>>>>>>
>>>>>>>> +1 for improving how the commit timestamps are
>>>>>>>> assigned monotonically since this requirement has emerged over multiple
>>>>>>>> discussions like notifications, multi-table transactions, time travel
>>>>>>>> accuracy and row timestamps. It would be good to have a single 
>>>>>>>> consistent
>>>>>>>> way to represent and assign timestamps that could be leveraged across
>>>>>>>> multiple features.
>>>>>>>>
>>>>>>>> On Thu, Jan 15, 2026 at 4:05 PM Ryan Blue <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Yeah, to add my perspective on that discussion, I think my primary
>>>>>>>>> concern is that people expect timestamps to be monotonic and if they 
>>>>>>>>> aren't
>>>>>>>>> then a `_last_update_timestamp` field just makes the problem worse. 
>>>>>>>>> But it
>>>>>>>>> is _nice_ to have row-level timestamps. So I would be okay if we 
>>>>>>>>> revisit
>>>>>>>>> how we assign commit timestamps and improve it so that you get 
>>>>>>>>> monotonic
>>>>>>>>> behavior.
>>>>>>>>>
>>>>>>>>> On Thu, Jan 15, 2026 at 2:23 PM Steven Wu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> We had an offline discussion with Ryan. I revised the proposal as
>>>>>>>>>> follows.
>>>>>>>>>>
>>>>>>>>>> 1. V4 would require writers to generate *monotonic* snapshot
>>>>>>>>>> timestamps. The proposal doc has a section that describes a 
>>>>>>>>>> recommended
>>>>>>>>>> implementation using lamport timestamps.
>>>>>>>>>> 2. Expose *last_update_timestamp* metadata column that inherits
>>>>>>>>>> from snapshot timestamp
>>>>>>>>>>
>>>>>>>>>> This is a relatively low-friction change that can fix the time
>>>>>>>>>> travel problem and enable use cases like latency tracking, temporal 
>>>>>>>>>> query,
>>>>>>>>>> TTL, auditing.
>>>>>>>>>>
>>>>>>>>>> There is no accuracy requirement on the timestamp values. In
>>>>>>>>>> practice, modern servers with NTP have pretty reliable wall clocks. 
>>>>>>>>>> E.g.,
>>>>>>>>>> Java library implemented this validation
>>>>>>>>>> <https://github.com/apache/iceberg/blob/035e0fb39d2a949f6343552ade0a7d6c2967e0db/core/src/main/java/org/apache/iceberg/TableMetadata.java#L369-L377>
>>>>>>>>>>  that
>>>>>>>>>> protects against backward clock drift up to one minute for snapshot
>>>>>>>>>> timestamps. Don't think we have heard many complaints of commit 
>>>>>>>>>> failure due
>>>>>>>>>> to that clock drift validation.
>>>>>>>>>>
>>>>>>>>>> Would appreciate feedback on the revised proposal.
>>>>>>>>>>
>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Steven
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 13, 2026 at 8:40 PM Anton Okolnychyi <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Steven, I was referring to the fact that CURRENT_TIMESTAMP() is
>>>>>>>>>>> usually evaluated quite early in engines so we could theoretically 
>>>>>>>>>>> have
>>>>>>>>>>> another expression closer to the commit time. You are right, 
>>>>>>>>>>> though, it
>>>>>>>>>>> won't be the actual commit time given that we have to write it into 
>>>>>>>>>>> the
>>>>>>>>>>> files. Also, I don't think generating a timestamp for a row as it 
>>>>>>>>>>> is being
>>>>>>>>>>> written is going to be beneficial. To sum up, expression-based 
>>>>>>>>>>> defaults
>>>>>>>>>>> would allow us to capture the time the transaction or write starts, 
>>>>>>>>>>> but not
>>>>>>>>>>> the actual commit time.
>>>>>>>>>>>
>>>>>>>>>>> Russell, if the goal is to know what happened to the table in a
>>>>>>>>>>> given time frame, isn't the changelog scan the way to go? It would 
>>>>>>>>>>> assign
>>>>>>>>>>> commit ordinals based on lineage and include row-level diffs. How 
>>>>>>>>>>> would you
>>>>>>>>>>> be able to determine changes with row timestamps by just looking at 
>>>>>>>>>>> the
>>>>>>>>>>> latest snapshot?
>>>>>>>>>>>
>>>>>>>>>>> It does seem promising to make snapshot timestamps strictly
>>>>>>>>>>> increasing to avoid ambiguity during time travel.
>>>>>>>>>>>
>>>>>>>>>>> вт, 13 січ. 2026 р. о 16:33 Ryan Blue <[email protected]> пише:
>>>>>>>>>>>
>>>>>>>>>>>> > Whether or not "t" is an atomic clock time is not as
>>>>>>>>>>>> important as the query between time bounds making sense.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not sure I get it then. If we want monotonically increasing
>>>>>>>>>>>> times, but they don't have to be real times then how do you know 
>>>>>>>>>>>> what
>>>>>>>>>>>> notion of "time" you care about for these filters? Or to put it 
>>>>>>>>>>>> another
>>>>>>>>>>>> way, how do you know that your "before" and "after" times are 
>>>>>>>>>>>> reasonable?
>>>>>>>>>>>> If the boundaries of these time queries can move around a bit, by 
>>>>>>>>>>>> how much?
>>>>>>>>>>>>
>>>>>>>>>>>> It seems to me that row IDs can play an important role here
>>>>>>>>>>>> because you have the order guarantee that we seem to want for this 
>>>>>>>>>>>> use
>>>>>>>>>>>> case: if snapshot A was committed before snapshot B, then the rows 
>>>>>>>>>>>> from A
>>>>>>>>>>>> have row IDs that are always less than the rows IDs of B. The 
>>>>>>>>>>>> problem is
>>>>>>>>>>>> that we don't know where those row IDs start and end once A and B 
>>>>>>>>>>>> are no
>>>>>>>>>>>> longer tracked. Using a "timestamp" seems to work, but I still 
>>>>>>>>>>>> worry that
>>>>>>>>>>>> without reliable timestamps that correspond with some guarantee to 
>>>>>>>>>>>> real
>>>>>>>>>>>> timestamps, we are creating a feature that seems reliable but 
>>>>>>>>>>>> isn't.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm somewhat open to the idea of introducing a snapshot
>>>>>>>>>>>> timestamp that the catalog guarantees is monotonically increasing. 
>>>>>>>>>>>> But if
>>>>>>>>>>>> we did that, wouldn't we still need to know the association 
>>>>>>>>>>>> between these
>>>>>>>>>>>> timestamps and snapshots after the snapshot metadata expires? My 
>>>>>>>>>>>> mental
>>>>>>>>>>>> model is that this would be used to look for data that arrived, 
>>>>>>>>>>>> say, 3
>>>>>>>>>>>> weeks ago on Dec 24th. Since the snapshots metadata is no longer 
>>>>>>>>>>>> around we
>>>>>>>>>>>> could use the row timestamp to find those rows. But how do we know 
>>>>>>>>>>>> that the
>>>>>>>>>>>> snapshot timestamps correspond to the actual timestamp range of 
>>>>>>>>>>>> Dec 24th?
>>>>>>>>>>>> Is it just "close enough" as long as we don't have out of order 
>>>>>>>>>>>> timestamps?
>>>>>>>>>>>> This is what I mean by needing to keep track of the association 
>>>>>>>>>>>> between
>>>>>>>>>>>> timestamps and snapshots after the metadata expires. Seems like 
>>>>>>>>>>>> you either
>>>>>>>>>>>> need to keep track of what the catalog's clock was for events you 
>>>>>>>>>>>> care
>>>>>>>>>>>> about, or you don't really care about exact timestamps.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 13, 2026 at 2:22 PM Russell Spitzer <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> The key goal here is the ability to answer the question "what
>>>>>>>>>>>>> happened to the table in some time window. (before < t < after)?"
>>>>>>>>>>>>> Whether or not "t" is an atomic clock time is not as important
>>>>>>>>>>>>> as the query between time bounds making sense.
>>>>>>>>>>>>> Downstream applications (from what I know) are mostly
>>>>>>>>>>>>> sensitive to getting discrete and well defined answers to
>>>>>>>>>>>>> this question like:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1 < t < 2 should be exclusive of
>>>>>>>>>>>>> 2 < t < 3 should be exclusive of
>>>>>>>>>>>>> 3 < t < 4
>>>>>>>>>>>>>
>>>>>>>>>>>>> And the union of these should be the same as the query asking
>>>>>>>>>>>>> for 1 < t < 4
>>>>>>>>>>>>>
>>>>>>>>>>>>> Currently this is not possible because we have no guarantee of
>>>>>>>>>>>>> ordering in our timestamps
>>>>>>>>>>>>>
>>>>>>>>>>>>> Snapshots
>>>>>>>>>>>>> A -> B -> C
>>>>>>>>>>>>> Sequence numbers
>>>>>>>>>>>>> 50 -> 51 ->  52
>>>>>>>>>>>>> Timestamp
>>>>>>>>>>>>> 3 -> 1 -> 2
>>>>>>>>>>>>>
>>>>>>>>>>>>> This makes time travel always a little wrong to start with.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The Java implementation only allows one minute of negative
>>>>>>>>>>>>> time on commit so we actually kind of do have this as a
>>>>>>>>>>>>> "light monotonicity" requirement but as noted above there is
>>>>>>>>>>>>> no spec requirement for this.  While we do have sequence
>>>>>>>>>>>>> number and row id, we still don't have a stable way of
>>>>>>>>>>>>> associating these with a consistent time in an engine independent 
>>>>>>>>>>>>> way.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ideally we just want to have one consistent way of answering
>>>>>>>>>>>>> the question "what did the table look like at time t"
>>>>>>>>>>>>> which I think we get by adding in a new field that is a
>>>>>>>>>>>>> timestamp, set by the Catalog close to commit time,
>>>>>>>>>>>>> that always goes up.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not sure we can really do this with an engine expression
>>>>>>>>>>>>> since they won't know when the data is actually committed
>>>>>>>>>>>>> when writing files?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jan 13, 2026 at 3:35 PM Anton Okolnychyi <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> This seems like a lot of new complexity in the format. I
>>>>>>>>>>>>>> would like us to explore whether we can build the considered use 
>>>>>>>>>>>>>> cases on
>>>>>>>>>>>>>> top of expression-based defaults instead.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We already plan to support CURRENT_TIMESTAMP() and similar
>>>>>>>>>>>>>> functions that are part of the SQL standard definition for 
>>>>>>>>>>>>>> default values.
>>>>>>>>>>>>>> This would provide us a way to know the relative row order. 
>>>>>>>>>>>>>> True, this
>>>>>>>>>>>>>> usually will represent the start of the operation. We may define
>>>>>>>>>>>>>> COMMIT_TIMESTAMP() or a similar expression for the actual commit 
>>>>>>>>>>>>>> time, if
>>>>>>>>>>>>>> there are use cases that need that. Plus, we may explore an 
>>>>>>>>>>>>>> approach
>>>>>>>>>>>>>> similar to MySQL that allows users to reset the default value on 
>>>>>>>>>>>>>> update.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> вт, 13 січ. 2026 р. о 11:04 Russell Spitzer <
>>>>>>>>>>>>>> [email protected]> пише:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think this is the right step forward. Our current
>>>>>>>>>>>>>>> "timestamp" definition is too ambiguous to be useful so 
>>>>>>>>>>>>>>> establishing
>>>>>>>>>>>>>>> a well defined and monotonic timestamp could be really
>>>>>>>>>>>>>>> great. I also like the ability for row's to know this value 
>>>>>>>>>>>>>>> without
>>>>>>>>>>>>>>> having to rely on snapshot information which can be expired.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Jan 12, 2026 at 11:03 AM Steven Wu <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have revised the row timestamp proposal with the
>>>>>>>>>>>>>>>> following changes.
>>>>>>>>>>>>>>>> * a new commit_timestamp field in snapshot metadata that
>>>>>>>>>>>>>>>> has nano-second precision.
>>>>>>>>>>>>>>>> * this optional field is only set by the REST catalog server
>>>>>>>>>>>>>>>> * it needs to be monotonic (e.g. implemented using Lamport
>>>>>>>>>>>>>>>> timestamp)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for the clarification, Ryan.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For long-running streaming jobs that commit periodically,
>>>>>>>>>>>>>>>>> it is difficult to establish the constant value of 
>>>>>>>>>>>>>>>>> current_timestamp across
>>>>>>>>>>>>>>>>> all writer tasks for each commit cycle. I guess streaming 
>>>>>>>>>>>>>>>>> writers may just
>>>>>>>>>>>>>>>>> need to write the wall clock time when appending a row to a 
>>>>>>>>>>>>>>>>> data file for
>>>>>>>>>>>>>>>>> the default value of current_timestamp.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I don't think that every row would have a different
>>>>>>>>>>>>>>>>>> value. That would be up to the engine, but I would expect 
>>>>>>>>>>>>>>>>>> engines to insert
>>>>>>>>>>>>>>>>>> `CURRENT_TIMESTAMP` into the plan and then replace it with a 
>>>>>>>>>>>>>>>>>> constant,
>>>>>>>>>>>>>>>>>> resulting in a consistent value for all rows.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> You're right that this would not necessarily be the
>>>>>>>>>>>>>>>>>> commit time. But neither is the commit timestamp from 
>>>>>>>>>>>>>>>>>> Iceberg's snapshot.
>>>>>>>>>>>>>>>>>> I'm not sure how we are going to define "good enough" for 
>>>>>>>>>>>>>>>>>> this purpose. I
>>>>>>>>>>>>>>>>>> think at least `CURRENT_TIMESTAMP` has reliable and known 
>>>>>>>>>>>>>>>>>> behavior when you
>>>>>>>>>>>>>>>>>> look at how it is handled in engines. And if you want the 
>>>>>>>>>>>>>>>>>> Iceberg
>>>>>>>>>>>>>>>>>> timestamp, then use a periodic query of the snapshot stable 
>>>>>>>>>>>>>>>>>> to keep track
>>>>>>>>>>>>>>>>>> of them in a table you can join to. I don't think this rises 
>>>>>>>>>>>>>>>>>> to the need
>>>>>>>>>>>>>>>>>> for a table feature unless we can guarantee that it is 
>>>>>>>>>>>>>>>>>> correct.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> > Postgres `current_timestamp` captures the
>>>>>>>>>>>>>>>>>>> transaction start time [1, 2]. Should we extend the same 
>>>>>>>>>>>>>>>>>>> semantic to
>>>>>>>>>>>>>>>>>>> Iceberg: all rows added in the same snapshot should have 
>>>>>>>>>>>>>>>>>>> the same timestamp
>>>>>>>>>>>>>>>>>>> value?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Let me clarify my last comment.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT
>>>>>>>>>>>>>>>>>>> CURRENT_TIMESTAMP)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Since Postgres current_timestamp captures the
>>>>>>>>>>>>>>>>>>> transaction start time, all rows added in the same insert 
>>>>>>>>>>>>>>>>>>> transaction would
>>>>>>>>>>>>>>>>>>> have the same value as the transaction timestamp with the 
>>>>>>>>>>>>>>>>>>> column
>>>>>>>>>>>>>>>>>>> definition above.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> If we extend a similar semantic to Iceberg, all rows
>>>>>>>>>>>>>>>>>>> added in the same Iceberg transaction/snapshot should have 
>>>>>>>>>>>>>>>>>>> the same
>>>>>>>>>>>>>>>>>>> timestamp?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Ryan, I understand your comment for using
>>>>>>>>>>>>>>>>>>> current_timestamp expression as column default value, you 
>>>>>>>>>>>>>>>>>>> were thinking
>>>>>>>>>>>>>>>>>>> that the engine would set the column value to the wall 
>>>>>>>>>>>>>>>>>>> clock time when
>>>>>>>>>>>>>>>>>>> appending a row to a data file, right? every row would 
>>>>>>>>>>>>>>>>>>> almost have a
>>>>>>>>>>>>>>>>>>> different timestamp value.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> `current_timestamp` expression may not always carry the
>>>>>>>>>>>>>>>>>>>> right semantic for the use cases. E.g., latency tracking 
>>>>>>>>>>>>>>>>>>>> is interested in
>>>>>>>>>>>>>>>>>>>> when records are added / committed to the table, not when 
>>>>>>>>>>>>>>>>>>>> the record was
>>>>>>>>>>>>>>>>>>>> appended to an uncommitted data file in the processing 
>>>>>>>>>>>>>>>>>>>> engine.
>>>>>>>>>>>>>>>>>>>> Record creation and Iceberg commit can be minutes or even 
>>>>>>>>>>>>>>>>>>>> hours apart.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Row timestamp inherited from snapshot timestamp has no
>>>>>>>>>>>>>>>>>>>> overhead with the initial commit and has very minimal 
>>>>>>>>>>>>>>>>>>>> storage overhead
>>>>>>>>>>>>>>>>>>>> during file rewrite. Per-row current_timestamp would have 
>>>>>>>>>>>>>>>>>>>> distinct values
>>>>>>>>>>>>>>>>>>>> for every row and has more storage overhead.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> OLTP databases deal with small row-level transactions.
>>>>>>>>>>>>>>>>>>>> Postgres `current_timestamp` captures the transaction 
>>>>>>>>>>>>>>>>>>>> start time [1, 2].
>>>>>>>>>>>>>>>>>>>> Should we extend the same semantic to Iceberg: all rows 
>>>>>>>>>>>>>>>>>>>> added in the same
>>>>>>>>>>>>>>>>>>>> snapshot should have the same timestamp value?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>> https://www.postgresql.org/docs/current/functions-datetime.html
>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this
>>>>>>>>>>>>>>>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>>>>>> To support the created_by timestamp, we would need to
>>>>>>>>>>>>>>>>>>>>>> implement the following row lineage behavior
>>>>>>>>>>>>>>>>>>>>>> * Initially, it inherits from the snapshot timestamp
>>>>>>>>>>>>>>>>>>>>>> * during rewrite (like compaction), it should be
>>>>>>>>>>>>>>>>>>>>>> persisted into data files.
>>>>>>>>>>>>>>>>>>>>>> * during update, it needs to be carried over from the
>>>>>>>>>>>>>>>>>>>>>> previous row. This is similar to the row_id carry over 
>>>>>>>>>>>>>>>>>>>>>> for row updates.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Sorry for the short hand.  These are not the same:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 1.  Insertion time - time the row was inserted.
>>>>>>>>>>>>>>>>>>>>> 2.  Create by - The system that created the record.
>>>>>>>>>>>>>>>>>>>>> 3.  Updated by - The system that last updated the
>>>>>>>>>>>>>>>>>>>>> record.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Depending on the exact use-case these might or might
>>>>>>>>>>>>>>>>>>>>> not have utility.  I'm just wondering if there will be 
>>>>>>>>>>>>>>>>>>>>> more example like
>>>>>>>>>>>>>>>>>>>>> this in the future.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> created_by column would incur likely significantly
>>>>>>>>>>>>>>>>>>>>>> higher storage overhead compared to the updated_by 
>>>>>>>>>>>>>>>>>>>>>> column. As rows are
>>>>>>>>>>>>>>>>>>>>>> updated overtime, the cardinality for this column in 
>>>>>>>>>>>>>>>>>>>>>> data files can be
>>>>>>>>>>>>>>>>>>>>>> high. Hence, the created_by column may not compress 
>>>>>>>>>>>>>>>>>>>>>> well. This is a similar
>>>>>>>>>>>>>>>>>>>>>> problem for the row_id column. One side effect of 
>>>>>>>>>>>>>>>>>>>>>> enabling row lineage by
>>>>>>>>>>>>>>>>>>>>>> default for V3 tables is the storage overhead of row_id 
>>>>>>>>>>>>>>>>>>>>>> column after
>>>>>>>>>>>>>>>>>>>>>> compaction especially for narrow tables with few columns.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I agree.  I think this analysis also shows that some
>>>>>>>>>>>>>>>>>>>>> consumers of Iceberg might not necessarily want to have 
>>>>>>>>>>>>>>>>>>>>> all these columns,
>>>>>>>>>>>>>>>>>>>>> so we might want to make them configurable, rather than 
>>>>>>>>>>>>>>>>>>>>> mandating them for
>>>>>>>>>>>>>>>>>>>>> all tables. Ryan's thought on default values seems like 
>>>>>>>>>>>>>>>>>>>>> it would solve the
>>>>>>>>>>>>>>>>>>>>> issues I was raising.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Micah
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> > An explicit timestamp column adds more burden to
>>>>>>>>>>>>>>>>>>>>>> application developers. While some databases require an 
>>>>>>>>>>>>>>>>>>>>>> explicit column in
>>>>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set 
>>>>>>>>>>>>>>>>>>>>>> the column value.
>>>>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to 
>>>>>>>>>>>>>>>>>>>>>> the trigger timestamp.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Since the use cases don't require an exact timestamp,
>>>>>>>>>>>>>>>>>>>>>> this seems like the best solution to get what people 
>>>>>>>>>>>>>>>>>>>>>> want (an insertion
>>>>>>>>>>>>>>>>>>>>>> timestamp) that has clear and well-defined behavior. 
>>>>>>>>>>>>>>>>>>>>>> Since
>>>>>>>>>>>>>>>>>>>>>> `current_timestamp` is defined by the SQL spec, it makes 
>>>>>>>>>>>>>>>>>>>>>> sense to me that
>>>>>>>>>>>>>>>>>>>>>> we could use it and have reasonable behavior.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I've talked with Anton about this before and maybe
>>>>>>>>>>>>>>>>>>>>>> he'll jump in on this thread. I think that we may need 
>>>>>>>>>>>>>>>>>>>>>> to extend default
>>>>>>>>>>>>>>>>>>>>>> values to include default value expressions, like 
>>>>>>>>>>>>>>>>>>>>>> `current_timestamp` that
>>>>>>>>>>>>>>>>>>>>>> is allowed by the SQL spec. That would solve the problem 
>>>>>>>>>>>>>>>>>>>>>> as well as some
>>>>>>>>>>>>>>>>>>>>>> others (like `current_date` or `current_user`) and would 
>>>>>>>>>>>>>>>>>>>>>> not create a
>>>>>>>>>>>>>>>>>>>>>> potentially misleading (and heavyweight) timestamp 
>>>>>>>>>>>>>>>>>>>>>> feature in the format.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> > Also some environments may have stronger clock
>>>>>>>>>>>>>>>>>>>>>> service, like Spanner TrueTime service.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Even in cases like this, commit retries can reorder
>>>>>>>>>>>>>>>>>>>>>> commits and make timestamps out of order. I don't think 
>>>>>>>>>>>>>>>>>>>>>> that we should be
>>>>>>>>>>>>>>>>>>>>>> making guarantees or even exposing metadata that people 
>>>>>>>>>>>>>>>>>>>>>> might mistake as
>>>>>>>>>>>>>>>>>>>>>> having those guarantees.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Ryan, thanks a lot for the feedback!
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Regarding the concern for reliable timestamps, we
>>>>>>>>>>>>>>>>>>>>>>> are not proposing using timestamps for ordering. With 
>>>>>>>>>>>>>>>>>>>>>>> NTP in modern
>>>>>>>>>>>>>>>>>>>>>>> computers, they are generally reliable enough for the 
>>>>>>>>>>>>>>>>>>>>>>> intended use cases.
>>>>>>>>>>>>>>>>>>>>>>> Also some environments may have stronger clock service, 
>>>>>>>>>>>>>>>>>>>>>>> like Spanner
>>>>>>>>>>>>>>>>>>>>>>> TrueTime service
>>>>>>>>>>>>>>>>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency>
>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> >  joining to timestamps from the snapshots metadata
>>>>>>>>>>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> As you also mentioned, it depends on the snapshot
>>>>>>>>>>>>>>>>>>>>>>> history, which is often retained for a few days due to 
>>>>>>>>>>>>>>>>>>>>>>> performance reasons.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> > embedding a timestamp in DML (like
>>>>>>>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit 
>>>>>>>>>>>>>>>>>>>>>>> one from table
>>>>>>>>>>>>>>>>>>>>>>> metadata.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> An explicit timestamp column adds more burden to
>>>>>>>>>>>>>>>>>>>>>>> application developers. While some databases require an 
>>>>>>>>>>>>>>>>>>>>>>> explicit column in
>>>>>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto 
>>>>>>>>>>>>>>>>>>>>>>> set the column value.
>>>>>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to 
>>>>>>>>>>>>>>>>>>>>>>> the trigger timestamp.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Also, the timestamp set during computation (like
>>>>>>>>>>>>>>>>>>>>>>> streaming ingestion or relative long batch computation) 
>>>>>>>>>>>>>>>>>>>>>>> doesn't capture the
>>>>>>>>>>>>>>>>>>>>>>> time the rows/files are added to the Iceberg table in a 
>>>>>>>>>>>>>>>>>>>>>>> batch fashion.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> > And for those use cases, you could also keep a
>>>>>>>>>>>>>>>>>>>>>>> longer history of snapshot timestamps, like storing a 
>>>>>>>>>>>>>>>>>>>>>>> catalog's event log
>>>>>>>>>>>>>>>>>>>>>>> for long-term access to timestamp info
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> this is not really consumable by joining the regular
>>>>>>>>>>>>>>>>>>>>>>> table query with catalog event log. I would also 
>>>>>>>>>>>>>>>>>>>>>>> imagine catalog event log
>>>>>>>>>>>>>>>>>>>>>>> is capped at shorter retention (maybe a few months) 
>>>>>>>>>>>>>>>>>>>>>>> compared to data
>>>>>>>>>>>>>>>>>>>>>>> retention (could be a few years).
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <
>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I don't think it is a good idea to expose
>>>>>>>>>>>>>>>>>>>>>>>> timestamps at the row level. Timestamps in metadata 
>>>>>>>>>>>>>>>>>>>>>>>> that would be carried
>>>>>>>>>>>>>>>>>>>>>>>> down to the row level already confuse people that 
>>>>>>>>>>>>>>>>>>>>>>>> expect them to be useful
>>>>>>>>>>>>>>>>>>>>>>>> or reliable, rather than for debugging. I think 
>>>>>>>>>>>>>>>>>>>>>>>> extending this to the row
>>>>>>>>>>>>>>>>>>>>>>>> level would only make the problem worse.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> You can already get this information by projecting
>>>>>>>>>>>>>>>>>>>>>>>> the last updated sequence number, which is reliable, 
>>>>>>>>>>>>>>>>>>>>>>>> and joining to
>>>>>>>>>>>>>>>>>>>>>>>> timestamps from the snapshots metadata table. Of 
>>>>>>>>>>>>>>>>>>>>>>>> course, the drawback there
>>>>>>>>>>>>>>>>>>>>>>>> is losing the timestamp information when snapshots 
>>>>>>>>>>>>>>>>>>>>>>>> expire, but since it
>>>>>>>>>>>>>>>>>>>>>>>> isn't reliable anyway I'd be fine with that.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Some of the use cases, like auditing and
>>>>>>>>>>>>>>>>>>>>>>>> compliance, are probably better served by embedding a 
>>>>>>>>>>>>>>>>>>>>>>>> timestamp in DML
>>>>>>>>>>>>>>>>>>>>>>>> (like `current_timestamp`) rather than relying on an 
>>>>>>>>>>>>>>>>>>>>>>>> implicit one from
>>>>>>>>>>>>>>>>>>>>>>>> table metadata. And for those use cases, you could 
>>>>>>>>>>>>>>>>>>>>>>>> also keep a longer
>>>>>>>>>>>>>>>>>>>>>>>> history of snapshot timestamps, like storing a 
>>>>>>>>>>>>>>>>>>>>>>>> catalog's event log for
>>>>>>>>>>>>>>>>>>>>>>>> long-term access to timestamp info. I think that would 
>>>>>>>>>>>>>>>>>>>>>>>> be better than
>>>>>>>>>>>>>>>>>>>>>>>> storing it at the row level.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> For V4 spec, I have a small proposal [1] to expose
>>>>>>>>>>>>>>>>>>>>>>>>> the row timestamp concept that can help with many use 
>>>>>>>>>>>>>>>>>>>>>>>>> cases like temporal
>>>>>>>>>>>>>>>>>>>>>>>>> queries, latency tracking, TTL, auditing and 
>>>>>>>>>>>>>>>>>>>>>>>>> compliance.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> This *_last_updated_timestamp_ms * metadata
>>>>>>>>>>>>>>>>>>>>>>>>> column behaves very similarly to the
>>>>>>>>>>>>>>>>>>>>>>>>> *_last_updated_sequence_number* for row lineage.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>    - Initially, it inherits from the snapshot
>>>>>>>>>>>>>>>>>>>>>>>>>    timestamp.
>>>>>>>>>>>>>>>>>>>>>>>>>    - During rewrite (like compaction), its values
>>>>>>>>>>>>>>>>>>>>>>>>>    are persisted in the data files.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear what you think.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] row timestamp proposal

Reply via email to