Re: [DISCUSS] row timestamp proposal

Anton Okolnychyi Thu, 15 Jan 2026 18:33:03 -0800

I think there is clear consensus that making snapshot timestamps strictly
increasing is a positive thing. I am also +1.


- How will row timestamps allow us to reliably implement incremental
consumption independent of the snapshot retention given that rows can be
added AND removed in a particular time frame? How can we capture all
changes by just looking at the latest snapshot?
- Some use cases in the doc need the insertion time and some need the last
modification time. Do we plan to support both?
- What do we expect the behavior to be in UPDATE and MERGE operations?

To be clear: I am not opposed to this change, just want to make sure I
understand all use cases that we aim to address and what would be required
in engines.

чт, 15 січ. 2026 р. о 17:01 Maninder Parmar <[email protected]>
пише:

> +1 for improving how the commit timestamps are assigned monotonically
> since this requirement has emerged over multiple discussions like
> notifications, multi-table transactions, time travel accuracy and row
> timestamps. It would be good to have a single consistent way to represent
> and assign timestamps that could be leveraged across multiple features.
>
> On Thu, Jan 15, 2026 at 4:05 PM Ryan Blue <[email protected]> wrote:
>
>> Yeah, to add my perspective on that discussion, I think my primary
>> concern is that people expect timestamps to be monotonic and if they aren't
>> then a `_last_update_timestamp` field just makes the problem worse. But it
>> is _nice_ to have row-level timestamps. So I would be okay if we revisit
>> how we assign commit timestamps and improve it so that you get monotonic
>> behavior.
>>
>> On Thu, Jan 15, 2026 at 2:23 PM Steven Wu <[email protected]> wrote:
>>
>>> We had an offline discussion with Ryan. I revised the proposal as
>>> follows.
>>>
>>> 1. V4 would require writers to generate *monotonic* snapshot
>>> timestamps. The proposal doc has a section that describes a recommended
>>> implementation using lamport timestamps.
>>> 2. Expose *last_update_timestamp* metadata column that inherits from
>>> snapshot timestamp
>>>
>>> This is a relatively low-friction change that can fix the time travel
>>> problem and enable use cases like latency tracking, temporal query, TTL,
>>> auditing.
>>>
>>> There is no accuracy requirement on the timestamp values. In practice,
>>> modern servers with NTP have pretty reliable wall clocks. E.g., Java
>>> library implemented this validation
>>> <https://github.com/apache/iceberg/blob/035e0fb39d2a949f6343552ade0a7d6c2967e0db/core/src/main/java/org/apache/iceberg/TableMetadata.java#L369-L377>
>>>  that
>>> protects against backward clock drift up to one minute for snapshot
>>> timestamps. Don't think we have heard many complaints of commit failure due
>>> to that clock drift validation.
>>>
>>> Would appreciate feedback on the revised proposal.
>>>
>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0
>>>
>>> Thanks,
>>> Steven
>>>
>>> On Tue, Jan 13, 2026 at 8:40 PM Anton Okolnychyi <[email protected]>
>>> wrote:
>>>
>>>> Steven, I was referring to the fact that CURRENT_TIMESTAMP() is usually
>>>> evaluated quite early in engines so we could theoretically have another
>>>> expression closer to the commit time. You are right, though, it won't be
>>>> the actual commit time given that we have to write it into the files. Also,
>>>> I don't think generating a timestamp for a row as it is being written is
>>>> going to be beneficial. To sum up, expression-based defaults would allow us
>>>> to capture the time the transaction or write starts, but not the actual
>>>> commit time.
>>>>
>>>> Russell, if the goal is to know what happened to the table in a given
>>>> time frame, isn't the changelog scan the way to go? It would assign commit
>>>> ordinals based on lineage and include row-level diffs. How would you be
>>>> able to determine changes with row timestamps by just looking at the latest
>>>> snapshot?
>>>>
>>>> It does seem promising to make snapshot timestamps strictly increasing
>>>> to avoid ambiguity during time travel.
>>>>
>>>> вт, 13 січ. 2026 р. о 16:33 Ryan Blue <[email protected]> пише:
>>>>
>>>>> > Whether or not "t" is an atomic clock time is not as important as
>>>>> the query between time bounds making sense.
>>>>>
>>>>> I'm not sure I get it then. If we want monotonically increasing times,
>>>>> but they don't have to be real times then how do you know what notion of
>>>>> "time" you care about for these filters? Or to put it another way, how do
>>>>> you know that your "before" and "after" times are reasonable? If the
>>>>> boundaries of these time queries can move around a bit, by how much?
>>>>>
>>>>> It seems to me that row IDs can play an important role here because
>>>>> you have the order guarantee that we seem to want for this use case: if
>>>>> snapshot A was committed before snapshot B, then the rows from A have row
>>>>> IDs that are always less than the rows IDs of B. The problem is that we
>>>>> don't know where those row IDs start and end once A and B are no longer
>>>>> tracked. Using a "timestamp" seems to work, but I still worry that without
>>>>> reliable timestamps that correspond with some guarantee to real 
>>>>> timestamps,
>>>>> we are creating a feature that seems reliable but isn't.
>>>>>
>>>>> I'm somewhat open to the idea of introducing a snapshot timestamp that
>>>>> the catalog guarantees is monotonically increasing. But if we did that,
>>>>> wouldn't we still need to know the association between these timestamps 
>>>>> and
>>>>> snapshots after the snapshot metadata expires? My mental model is that 
>>>>> this
>>>>> would be used to look for data that arrived, say, 3 weeks ago on Dec 24th.
>>>>> Since the snapshots metadata is no longer around we could use the row
>>>>> timestamp to find those rows. But how do we know that the snapshot
>>>>> timestamps correspond to the actual timestamp range of Dec 24th? Is it 
>>>>> just
>>>>> "close enough" as long as we don't have out of order timestamps? This is
>>>>> what I mean by needing to keep track of the association between timestamps
>>>>> and snapshots after the metadata expires. Seems like you either need to
>>>>> keep track of what the catalog's clock was for events you care about, or
>>>>> you don't really care about exact timestamps.
>>>>>
>>>>> On Tue, Jan 13, 2026 at 2:22 PM Russell Spitzer <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> The key goal here is the ability to answer the question "what
>>>>>> happened to the table in some time window. (before < t < after)?"
>>>>>> Whether or not "t" is an atomic clock time is not as important as the
>>>>>> query between time bounds making sense.
>>>>>> Downstream applications (from what I know) are mostly sensitive to
>>>>>> getting discrete and well defined answers to
>>>>>> this question like:
>>>>>>
>>>>>> 1 < t < 2 should be exclusive of
>>>>>> 2 < t < 3 should be exclusive of
>>>>>> 3 < t < 4
>>>>>>
>>>>>> And the union of these should be the same as the query asking for 1 <
>>>>>> t < 4
>>>>>>
>>>>>> Currently this is not possible because we have no guarantee of
>>>>>> ordering in our timestamps
>>>>>>
>>>>>> Snapshots
>>>>>> A -> B -> C
>>>>>> Sequence numbers
>>>>>> 50 -> 51 ->  52
>>>>>> Timestamp
>>>>>> 3 -> 1 -> 2
>>>>>>
>>>>>> This makes time travel always a little wrong to start with.
>>>>>>
>>>>>> The Java implementation only allows one minute of negative time on
>>>>>> commit so we actually kind of do have this as a
>>>>>> "light monotonicity" requirement but as noted above there is no spec
>>>>>> requirement for this.  While we do have sequence
>>>>>> number and row id, we still don't have a stable way of associating
>>>>>> these with a consistent time in an engine independent way.
>>>>>>
>>>>>> Ideally we just want to have one consistent way of answering the
>>>>>> question "what did the table look like at time t"
>>>>>> which I think we get by adding in a new field that is a timestamp,
>>>>>> set by the Catalog close to commit time,
>>>>>> that always goes up.
>>>>>>
>>>>>> I'm not sure we can really do this with an engine expression since
>>>>>> they won't know when the data is actually committed
>>>>>> when writing files?
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 13, 2026 at 3:35 PM Anton Okolnychyi <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> This seems like a lot of new complexity in the format. I would like
>>>>>>> us to explore whether we can build the considered use cases on top of
>>>>>>> expression-based defaults instead.
>>>>>>>
>>>>>>> We already plan to support CURRENT_TIMESTAMP() and similar functions
>>>>>>> that are part of the SQL standard definition for default values. This 
>>>>>>> would
>>>>>>> provide us a way to know the relative row order. True, this usually will
>>>>>>> represent the start of the operation. We may define COMMIT_TIMESTAMP() 
>>>>>>> or a
>>>>>>> similar expression for the actual commit time, if there are use cases 
>>>>>>> that
>>>>>>> need that. Plus, we may explore an approach similar to MySQL that allows
>>>>>>> users to reset the default value on update.
>>>>>>>
>>>>>>> - Anton
>>>>>>>
>>>>>>> вт, 13 січ. 2026 р. о 11:04 Russell Spitzer <
>>>>>>> [email protected]> пише:
>>>>>>>
>>>>>>>> I think this is the right step forward. Our current "timestamp"
>>>>>>>> definition is too ambiguous to be useful so establishing
>>>>>>>> a well defined and monotonic timestamp could be really great. I
>>>>>>>> also like the ability for row's to know this value without
>>>>>>>> having to rely on snapshot information which can be expired.
>>>>>>>>
>>>>>>>> On Mon, Jan 12, 2026 at 11:03 AM Steven Wu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I have revised the row timestamp proposal with the following
>>>>>>>>> changes.
>>>>>>>>> * a new commit_timestamp field in snapshot metadata that has
>>>>>>>>> nano-second precision.
>>>>>>>>> * this optional field is only set by the REST catalog server
>>>>>>>>> * it needs to be monotonic (e.g. implemented using Lamport
>>>>>>>>> timestamp)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Steven
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for the clarification, Ryan.
>>>>>>>>>>
>>>>>>>>>> For long-running streaming jobs that commit periodically, it is
>>>>>>>>>> difficult to establish the constant value of current_timestamp 
>>>>>>>>>> across all
>>>>>>>>>> writer tasks for each commit cycle. I guess streaming writers may 
>>>>>>>>>> just need
>>>>>>>>>> to write the wall clock time when appending a row to a data file for 
>>>>>>>>>> the
>>>>>>>>>> default value of current_timestamp.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I don't think that every row would have a different value. That
>>>>>>>>>>> would be up to the engine, but I would expect engines to insert
>>>>>>>>>>> `CURRENT_TIMESTAMP` into the plan and then replace it with a 
>>>>>>>>>>> constant,
>>>>>>>>>>> resulting in a consistent value for all rows.
>>>>>>>>>>>
>>>>>>>>>>> You're right that this would not necessarily be the commit time.
>>>>>>>>>>> But neither is the commit timestamp from Iceberg's snapshot. I'm 
>>>>>>>>>>> not sure
>>>>>>>>>>> how we are going to define "good enough" for this purpose. I think 
>>>>>>>>>>> at least
>>>>>>>>>>> `CURRENT_TIMESTAMP` has reliable and known behavior when you look 
>>>>>>>>>>> at how it
>>>>>>>>>>> is handled in engines. And if you want the Iceberg timestamp, then 
>>>>>>>>>>> use a
>>>>>>>>>>> periodic query of the snapshot stable to keep track of them in a 
>>>>>>>>>>> table you
>>>>>>>>>>> can join to. I don't think this rises to the need for a table 
>>>>>>>>>>> feature
>>>>>>>>>>> unless we can guarantee that it is correct.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> > Postgres `current_timestamp` captures the transaction start
>>>>>>>>>>>> time [1, 2]. Should we extend the same semantic to Iceberg: all 
>>>>>>>>>>>> rows added
>>>>>>>>>>>> in the same snapshot should have the same timestamp value?
>>>>>>>>>>>>
>>>>>>>>>>>> Let me clarify my last comment.
>>>>>>>>>>>>
>>>>>>>>>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP)
>>>>>>>>>>>>
>>>>>>>>>>>> Since Postgres current_timestamp captures the transaction start
>>>>>>>>>>>> time, all rows added in the same insert transaction would have the 
>>>>>>>>>>>> same
>>>>>>>>>>>> value as the transaction timestamp with the column definition 
>>>>>>>>>>>> above.
>>>>>>>>>>>>
>>>>>>>>>>>> If we extend a similar semantic to Iceberg, all rows added in
>>>>>>>>>>>> the same Iceberg transaction/snapshot should have the same 
>>>>>>>>>>>> timestamp?
>>>>>>>>>>>>
>>>>>>>>>>>> Ryan, I understand your comment for using current_timestamp
>>>>>>>>>>>> expression as column default value, you were thinking that the 
>>>>>>>>>>>> engine would
>>>>>>>>>>>> set the column value to the wall clock time when appending a row 
>>>>>>>>>>>> to a data
>>>>>>>>>>>> file, right? every row would almost have a different timestamp 
>>>>>>>>>>>> value.
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> `current_timestamp` expression may not always carry the right
>>>>>>>>>>>>> semantic for the use cases. E.g., latency tracking is interested 
>>>>>>>>>>>>> in when
>>>>>>>>>>>>> records are added / committed to the table, not when the record 
>>>>>>>>>>>>> was
>>>>>>>>>>>>> appended to an uncommitted data file in the processing engine.
>>>>>>>>>>>>> Record creation and Iceberg commit can be minutes or even hours 
>>>>>>>>>>>>> apart.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Row timestamp inherited from snapshot timestamp has no
>>>>>>>>>>>>> overhead with the initial commit and has very minimal storage 
>>>>>>>>>>>>> overhead
>>>>>>>>>>>>> during file rewrite. Per-row current_timestamp would have 
>>>>>>>>>>>>> distinct values
>>>>>>>>>>>>> for every row and has more storage overhead.
>>>>>>>>>>>>>
>>>>>>>>>>>>> OLTP databases deal with small row-level transactions.
>>>>>>>>>>>>> Postgres `current_timestamp` captures the transaction start time 
>>>>>>>>>>>>> [1, 2].
>>>>>>>>>>>>> Should we extend the same semantic to Iceberg: all rows added in 
>>>>>>>>>>>>> the same
>>>>>>>>>>>>> snapshot should have the same timestamp value?
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://www.postgresql.org/docs/current/functions-datetime.html
>>>>>>>>>>>>> [2]
>>>>>>>>>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this proposal.
>>>>>>>>>>>>>>> To support the created_by timestamp, we would need to
>>>>>>>>>>>>>>> implement the following row lineage behavior
>>>>>>>>>>>>>>> * Initially, it inherits from the snapshot timestamp
>>>>>>>>>>>>>>> * during rewrite (like compaction), it should be persisted
>>>>>>>>>>>>>>> into data files.
>>>>>>>>>>>>>>> * during update, it needs to be carried over from the
>>>>>>>>>>>>>>> previous row. This is similar to the row_id carry over for row 
>>>>>>>>>>>>>>> updates.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sorry for the short hand.  These are not the same:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1.  Insertion time - time the row was inserted.
>>>>>>>>>>>>>> 2.  Create by - The system that created the record.
>>>>>>>>>>>>>> 3.  Updated by - The system that last updated the record.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Depending on the exact use-case these might or might not have
>>>>>>>>>>>>>> utility.  I'm just wondering if there will be more example like 
>>>>>>>>>>>>>> this in the
>>>>>>>>>>>>>> future.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> created_by column would incur likely significantly higher
>>>>>>>>>>>>>>> storage overhead compared to the updated_by column. As rows are 
>>>>>>>>>>>>>>> updated
>>>>>>>>>>>>>>> overtime, the cardinality for this column in data files can be 
>>>>>>>>>>>>>>> high. Hence,
>>>>>>>>>>>>>>> the created_by column may not compress well. This is a similar 
>>>>>>>>>>>>>>> problem for
>>>>>>>>>>>>>>> the row_id column. One side effect of enabling row lineage by 
>>>>>>>>>>>>>>> default for
>>>>>>>>>>>>>>> V3 tables is the storage overhead of row_id column after 
>>>>>>>>>>>>>>> compaction
>>>>>>>>>>>>>>> especially for narrow tables with few columns.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I agree.  I think this analysis also shows that some
>>>>>>>>>>>>>> consumers of Iceberg might not necessarily want to have all 
>>>>>>>>>>>>>> these columns,
>>>>>>>>>>>>>> so we might want to make them configurable, rather than 
>>>>>>>>>>>>>> mandating them for
>>>>>>>>>>>>>> all tables. Ryan's thought on default values seems like it would 
>>>>>>>>>>>>>> solve the
>>>>>>>>>>>>>> issues I was raising.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Micah
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> > An explicit timestamp column adds more burden to
>>>>>>>>>>>>>>> application developers. While some databases require an 
>>>>>>>>>>>>>>> explicit column in
>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set the 
>>>>>>>>>>>>>>> column value.
>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to the 
>>>>>>>>>>>>>>> trigger timestamp.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Since the use cases don't require an exact timestamp, this
>>>>>>>>>>>>>>> seems like the best solution to get what people want (an 
>>>>>>>>>>>>>>> insertion
>>>>>>>>>>>>>>> timestamp) that has clear and well-defined behavior. Since
>>>>>>>>>>>>>>> `current_timestamp` is defined by the SQL spec, it makes sense 
>>>>>>>>>>>>>>> to me that
>>>>>>>>>>>>>>> we could use it and have reasonable behavior.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've talked with Anton about this before and maybe he'll
>>>>>>>>>>>>>>> jump in on this thread. I think that we may need to extend 
>>>>>>>>>>>>>>> default values
>>>>>>>>>>>>>>> to include default value expressions, like `current_timestamp` 
>>>>>>>>>>>>>>> that is
>>>>>>>>>>>>>>> allowed by the SQL spec. That would solve the problem as well 
>>>>>>>>>>>>>>> as some
>>>>>>>>>>>>>>> others (like `current_date` or `current_user`) and would not 
>>>>>>>>>>>>>>> create a
>>>>>>>>>>>>>>> potentially misleading (and heavyweight) timestamp feature in 
>>>>>>>>>>>>>>> the format.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> > Also some environments may have stronger clock service,
>>>>>>>>>>>>>>> like Spanner TrueTime service.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Even in cases like this, commit retries can reorder commits
>>>>>>>>>>>>>>> and make timestamps out of order. I don't think that we should 
>>>>>>>>>>>>>>> be making
>>>>>>>>>>>>>>> guarantees or even exposing metadata that people might mistake 
>>>>>>>>>>>>>>> as having
>>>>>>>>>>>>>>> those guarantees.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ryan, thanks a lot for the feedback!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regarding the concern for reliable timestamps, we are not
>>>>>>>>>>>>>>>> proposing using timestamps for ordering. With NTP in modern 
>>>>>>>>>>>>>>>> computers, they
>>>>>>>>>>>>>>>> are generally reliable enough for the intended use cases. Also 
>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>> environments may have stronger clock service, like Spanner
>>>>>>>>>>>>>>>> TrueTime service
>>>>>>>>>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency>
>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> >  joining to timestamps from the snapshots metadata table.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As you also mentioned, it depends on the snapshot history,
>>>>>>>>>>>>>>>> which is often retained for a few days due to performance 
>>>>>>>>>>>>>>>> reasons.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > embedding a timestamp in DML (like `current_timestamp`)
>>>>>>>>>>>>>>>> rather than relying on an implicit one from table metadata.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> An explicit timestamp column adds more burden to
>>>>>>>>>>>>>>>> application developers. While some databases require an 
>>>>>>>>>>>>>>>> explicit column in
>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set the 
>>>>>>>>>>>>>>>> column value.
>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to the 
>>>>>>>>>>>>>>>> trigger timestamp.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also, the timestamp set during computation (like streaming
>>>>>>>>>>>>>>>> ingestion or relative long batch computation) doesn't capture 
>>>>>>>>>>>>>>>> the time the
>>>>>>>>>>>>>>>> rows/files are added to the Iceberg table in a batch fashion.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > And for those use cases, you could also keep a longer
>>>>>>>>>>>>>>>> history of snapshot timestamps, like storing a catalog's event 
>>>>>>>>>>>>>>>> log for
>>>>>>>>>>>>>>>> long-term access to timestamp info
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> this is not really consumable by joining the regular table
>>>>>>>>>>>>>>>> query with catalog event log. I would also imagine catalog 
>>>>>>>>>>>>>>>> event log is
>>>>>>>>>>>>>>>> capped at shorter retention (maybe a few months) compared to 
>>>>>>>>>>>>>>>> data retention
>>>>>>>>>>>>>>>> (could be a few years).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I don't think it is a good idea to expose timestamps at
>>>>>>>>>>>>>>>>> the row level. Timestamps in metadata that would be carried 
>>>>>>>>>>>>>>>>> down to the row
>>>>>>>>>>>>>>>>> level already confuse people that expect them to be useful or 
>>>>>>>>>>>>>>>>> reliable,
>>>>>>>>>>>>>>>>> rather than for debugging. I think extending this to the row 
>>>>>>>>>>>>>>>>> level would
>>>>>>>>>>>>>>>>> only make the problem worse.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You can already get this information by projecting the
>>>>>>>>>>>>>>>>> last updated sequence number, which is reliable, and joining 
>>>>>>>>>>>>>>>>> to timestamps
>>>>>>>>>>>>>>>>> from the snapshots metadata table. Of course, the drawback 
>>>>>>>>>>>>>>>>> there is losing
>>>>>>>>>>>>>>>>> the timestamp information when snapshots expire, but since it 
>>>>>>>>>>>>>>>>> isn't
>>>>>>>>>>>>>>>>> reliable anyway I'd be fine with that.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Some of the use cases, like auditing and compliance, are
>>>>>>>>>>>>>>>>> probably better served by embedding a timestamp in DML (like
>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit one 
>>>>>>>>>>>>>>>>> from table
>>>>>>>>>>>>>>>>> metadata. And for those use cases, you could also keep a 
>>>>>>>>>>>>>>>>> longer history of
>>>>>>>>>>>>>>>>> snapshot timestamps, like storing a catalog's event log for 
>>>>>>>>>>>>>>>>> long-term
>>>>>>>>>>>>>>>>> access to timestamp info. I think that would be better than 
>>>>>>>>>>>>>>>>> storing it at
>>>>>>>>>>>>>>>>> the row level.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For V4 spec, I have a small proposal [1] to expose the
>>>>>>>>>>>>>>>>>> row timestamp concept that can help with many use cases like 
>>>>>>>>>>>>>>>>>> temporal
>>>>>>>>>>>>>>>>>> queries, latency tracking, TTL, auditing and compliance.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This *_last_updated_timestamp_ms * metadata column
>>>>>>>>>>>>>>>>>> behaves very similarly to the
>>>>>>>>>>>>>>>>>> *_last_updated_sequence_number* for row lineage.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    - Initially, it inherits from the snapshot timestamp.
>>>>>>>>>>>>>>>>>>    - During rewrite (like compaction), its values are
>>>>>>>>>>>>>>>>>>    persisted in the data files.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Would love to hear what you think.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] row timestamp proposal

Reply via email to