Re: [DISCUSS] row timestamp proposal

Ryan Blue Tue, 13 Jan 2026 16:33:54 -0800

> Whether or not "t" is an atomic clock time is not as important as the
query between time bounds making sense.


I'm not sure I get it then. If we want monotonically increasing times, but
they don't have to be real times then how do you know what notion of "time"
you care about for these filters? Or to put it another way, how do you know
that your "before" and "after" times are reasonable? If the boundaries of
these time queries can move around a bit, by how much?

It seems to me that row IDs can play an important role here because you
have the order guarantee that we seem to want for this use case: if
snapshot A was committed before snapshot B, then the rows from A have row
IDs that are always less than the rows IDs of B. The problem is that we
don't know where those row IDs start and end once A and B are no longer
tracked. Using a "timestamp" seems to work, but I still worry that without
reliable timestamps that correspond with some guarantee to real timestamps,
we are creating a feature that seems reliable but isn't.

I'm somewhat open to the idea of introducing a snapshot timestamp that the
catalog guarantees is monotonically increasing. But if we did that,
wouldn't we still need to know the association between these timestamps and
snapshots after the snapshot metadata expires? My mental model is that this
would be used to look for data that arrived, say, 3 weeks ago on Dec 24th.
Since the snapshots metadata is no longer around we could use the row
timestamp to find those rows. But how do we know that the snapshot
timestamps correspond to the actual timestamp range of Dec 24th? Is it just
"close enough" as long as we don't have out of order timestamps? This is
what I mean by needing to keep track of the association between timestamps
and snapshots after the metadata expires. Seems like you either need to
keep track of what the catalog's clock was for events you care about, or
you don't really care about exact timestamps.

On Tue, Jan 13, 2026 at 2:22 PM Russell Spitzer <[email protected]>
wrote:

> The key goal here is the ability to answer the question "what happened to
> the table in some time window. (before < t < after)?"
> Whether or not "t" is an atomic clock time is not as important as the
> query between time bounds making sense.
> Downstream applications (from what I know) are mostly sensitive to getting
> discrete and well defined answers to
> this question like:
>
> 1 < t < 2 should be exclusive of
> 2 < t < 3 should be exclusive of
> 3 < t < 4
>
> And the union of these should be the same as the query asking for 1 < t < 4
>
> Currently this is not possible because we have no guarantee of ordering in
> our timestamps
>
> Snapshots
> A -> B -> C
> Sequence numbers
> 50 -> 51 ->  52
> Timestamp
> 3 -> 1 -> 2
>
> This makes time travel always a little wrong to start with.
>
> The Java implementation only allows one minute of negative time on commit
> so we actually kind of do have this as a
> "light monotonicity" requirement but as noted above there is no spec
> requirement for this.  While we do have sequence
> number and row id, we still don't have a stable way of associating these
> with a consistent time in an engine independent way.
>
> Ideally we just want to have one consistent way of answering the question
> "what did the table look like at time t"
> which I think we get by adding in a new field that is a timestamp, set by
> the Catalog close to commit time,
> that always goes up.
>
> I'm not sure we can really do this with an engine expression since they
> won't know when the data is actually committed
> when writing files?
>
>
> On Tue, Jan 13, 2026 at 3:35 PM Anton Okolnychyi <[email protected]>
> wrote:
>
>> This seems like a lot of new complexity in the format. I would like us to
>> explore whether we can build the considered use cases on top of
>> expression-based defaults instead.
>>
>> We already plan to support CURRENT_TIMESTAMP() and similar functions that
>> are part of the SQL standard definition for default values. This would
>> provide us a way to know the relative row order. True, this usually will
>> represent the start of the operation. We may define COMMIT_TIMESTAMP() or a
>> similar expression for the actual commit time, if there are use cases that
>> need that. Plus, we may explore an approach similar to MySQL that allows
>> users to reset the default value on update.
>>
>> - Anton
>>
>> вт, 13 січ. 2026 р. о 11:04 Russell Spitzer <[email protected]>
>> пише:
>>
>>> I think this is the right step forward. Our current "timestamp"
>>> definition is too ambiguous to be useful so establishing
>>> a well defined and monotonic timestamp could be really great. I also
>>> like the ability for row's to know this value without
>>> having to rely on snapshot information which can be expired.
>>>
>>> On Mon, Jan 12, 2026 at 11:03 AM Steven Wu <[email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have revised the row timestamp proposal with the following changes.
>>>> * a new commit_timestamp field in snapshot metadata that has
>>>> nano-second precision.
>>>> * this optional field is only set by the REST catalog server
>>>> * it needs to be monotonic (e.g. implemented using Lamport timestamp)
>>>>
>>>>
>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh
>>>>
>>>> Thanks,
>>>> Steven
>>>>
>>>>
>>>> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu <[email protected]> wrote:
>>>>
>>>>> Thanks for the clarification, Ryan.
>>>>>
>>>>> For long-running streaming jobs that commit periodically, it is
>>>>> difficult to establish the constant value of current_timestamp across all
>>>>> writer tasks for each commit cycle. I guess streaming writers may just 
>>>>> need
>>>>> to write the wall clock time when appending a row to a data file for the
>>>>> default value of current_timestamp.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <[email protected]> wrote:
>>>>>
>>>>>> I don't think that every row would have a different value. That
>>>>>> would be up to the engine, but I would expect engines to insert
>>>>>> `CURRENT_TIMESTAMP` into the plan and then replace it with a constant,
>>>>>> resulting in a consistent value for all rows.
>>>>>>
>>>>>> You're right that this would not necessarily be the commit time. But
>>>>>> neither is the commit timestamp from Iceberg's snapshot. I'm not sure how
>>>>>> we are going to define "good enough" for this purpose. I think at least
>>>>>> `CURRENT_TIMESTAMP` has reliable and known behavior when you look at how 
>>>>>> it
>>>>>> is handled in engines. And if you want the Iceberg timestamp, then use a
>>>>>> periodic query of the snapshot stable to keep track of them in a table 
>>>>>> you
>>>>>> can join to. I don't think this rises to the need for a table feature
>>>>>> unless we can guarantee that it is correct.
>>>>>>
>>>>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> > Postgres `current_timestamp` captures the transaction start time
>>>>>>> [1, 2]. Should we extend the same semantic to Iceberg: all rows added in
>>>>>>> the same snapshot should have the same timestamp value?
>>>>>>>
>>>>>>> Let me clarify my last comment.
>>>>>>>
>>>>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP)
>>>>>>>
>>>>>>> Since Postgres current_timestamp captures the transaction start
>>>>>>> time, all rows added in the same insert transaction would have the same
>>>>>>> value as the transaction timestamp with the column definition above.
>>>>>>>
>>>>>>> If we extend a similar semantic to Iceberg, all rows added in the
>>>>>>> same Iceberg transaction/snapshot should have the same timestamp?
>>>>>>>
>>>>>>> Ryan, I understand your comment for using current_timestamp
>>>>>>> expression as column default value, you were thinking that the engine 
>>>>>>> would
>>>>>>> set the column value to the wall clock time when appending a row to a 
>>>>>>> data
>>>>>>> file, right? every row would almost have a different timestamp value.
>>>>>>>
>>>>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> `current_timestamp` expression may not always carry the right
>>>>>>>> semantic for the use cases. E.g., latency tracking is interested in 
>>>>>>>> when
>>>>>>>> records are added / committed to the table, not when the record was
>>>>>>>> appended to an uncommitted data file in the processing engine.
>>>>>>>> Record creation and Iceberg commit can be minutes or even hours apart.
>>>>>>>>
>>>>>>>> Row timestamp inherited from snapshot timestamp has no overhead
>>>>>>>> with the initial commit and has very minimal storage overhead during 
>>>>>>>> file
>>>>>>>> rewrite. Per-row current_timestamp would have distinct values for 
>>>>>>>> every row
>>>>>>>> and has more storage overhead.
>>>>>>>>
>>>>>>>> OLTP databases deal with small row-level transactions. Postgres
>>>>>>>> `current_timestamp` captures the transaction start time [1, 2]. Should 
>>>>>>>> we
>>>>>>>> extend the same semantic to Iceberg: all rows added in the same 
>>>>>>>> snapshot
>>>>>>>> should have the same timestamp value?
>>>>>>>>
>>>>>>>> [1] https://www.postgresql.org/docs/current/functions-datetime.html
>>>>>>>> [2]
>>>>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this proposal.
>>>>>>>>>> To support the created_by timestamp, we would need to
>>>>>>>>>> implement the following row lineage behavior
>>>>>>>>>> * Initially, it inherits from the snapshot timestamp
>>>>>>>>>> * during rewrite (like compaction), it should be persisted into
>>>>>>>>>> data files.
>>>>>>>>>> * during update, it needs to be carried over from the previous
>>>>>>>>>> row. This is similar to the row_id carry over for row updates.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sorry for the short hand.  These are not the same:
>>>>>>>>>
>>>>>>>>> 1.  Insertion time - time the row was inserted.
>>>>>>>>> 2.  Create by - The system that created the record.
>>>>>>>>> 3.  Updated by - The system that last updated the record.
>>>>>>>>>
>>>>>>>>> Depending on the exact use-case these might or might not have
>>>>>>>>> utility.  I'm just wondering if there will be more example like this 
>>>>>>>>> in the
>>>>>>>>> future.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> created_by column would incur likely significantly higher storage
>>>>>>>>>> overhead compared to the updated_by column. As rows are updated 
>>>>>>>>>> overtime,
>>>>>>>>>> the cardinality for this column in data files can be high. Hence, the
>>>>>>>>>> created_by column may not compress well. This is a similar problem 
>>>>>>>>>> for the
>>>>>>>>>> row_id column. One side effect of enabling row lineage by default 
>>>>>>>>>> for V3
>>>>>>>>>> tables is the storage overhead of row_id column after compaction 
>>>>>>>>>> especially
>>>>>>>>>> for narrow tables with few columns.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I agree.  I think this analysis also shows that some consumers of
>>>>>>>>> Iceberg might not necessarily want to have all these columns, so we 
>>>>>>>>> might
>>>>>>>>> want to make them configurable, rather than mandating them for all 
>>>>>>>>> tables.
>>>>>>>>> Ryan's thought on default values seems like it would solve the issues 
>>>>>>>>> I was
>>>>>>>>> raising.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Micah
>>>>>>>>>
>>>>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> > An explicit timestamp column adds more burden to application
>>>>>>>>>> developers. While some databases require an explicit column in the 
>>>>>>>>>> schema,
>>>>>>>>>> those databases provide triggers to auto set the column value. For 
>>>>>>>>>> Iceberg,
>>>>>>>>>> the snapshot timestamp is the closest to the trigger timestamp.
>>>>>>>>>>
>>>>>>>>>> Since the use cases don't require an exact timestamp, this seems
>>>>>>>>>> like the best solution to get what people want (an insertion 
>>>>>>>>>> timestamp)
>>>>>>>>>> that has clear and well-defined behavior. Since `current_timestamp` 
>>>>>>>>>> is
>>>>>>>>>> defined by the SQL spec, it makes sense to me that we could use it 
>>>>>>>>>> and have
>>>>>>>>>> reasonable behavior.
>>>>>>>>>>
>>>>>>>>>> I've talked with Anton about this before and maybe he'll jump in
>>>>>>>>>> on this thread. I think that we may need to extend default values to
>>>>>>>>>> include default value expressions, like `current_timestamp` that is 
>>>>>>>>>> allowed
>>>>>>>>>> by the SQL spec. That would solve the problem as well as some others 
>>>>>>>>>> (like
>>>>>>>>>> `current_date` or `current_user`) and would not create a potentially
>>>>>>>>>> misleading (and heavyweight) timestamp feature in the format.
>>>>>>>>>>
>>>>>>>>>> > Also some environments may have stronger clock service, like
>>>>>>>>>> Spanner TrueTime service.
>>>>>>>>>>
>>>>>>>>>> Even in cases like this, commit retries can reorder commits and
>>>>>>>>>> make timestamps out of order. I don't think that we should be making
>>>>>>>>>> guarantees or even exposing metadata that people might mistake as 
>>>>>>>>>> having
>>>>>>>>>> those guarantees.
>>>>>>>>>>
>>>>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Ryan, thanks a lot for the feedback!
>>>>>>>>>>>
>>>>>>>>>>> Regarding the concern for reliable timestamps, we are not
>>>>>>>>>>> proposing using timestamps for ordering. With NTP in modern 
>>>>>>>>>>> computers, they
>>>>>>>>>>> are generally reliable enough for the intended use cases. Also some
>>>>>>>>>>> environments may have stronger clock service, like Spanner
>>>>>>>>>>> TrueTime service
>>>>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency>
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>>> >  joining to timestamps from the snapshots metadata table.
>>>>>>>>>>>
>>>>>>>>>>> As you also mentioned, it depends on the snapshot history, which
>>>>>>>>>>> is often retained for a few days due to performance reasons.
>>>>>>>>>>>
>>>>>>>>>>> > embedding a timestamp in DML (like `current_timestamp`) rather
>>>>>>>>>>> than relying on an implicit one from table metadata.
>>>>>>>>>>>
>>>>>>>>>>> An explicit timestamp column adds more burden to application
>>>>>>>>>>> developers. While some databases require an explicit column in the 
>>>>>>>>>>> schema,
>>>>>>>>>>> those databases provide triggers to auto set the column value. For 
>>>>>>>>>>> Iceberg,
>>>>>>>>>>> the snapshot timestamp is the closest to the trigger timestamp.
>>>>>>>>>>>
>>>>>>>>>>> Also, the timestamp set during computation (like streaming
>>>>>>>>>>> ingestion or relative long batch computation) doesn't capture the 
>>>>>>>>>>> time the
>>>>>>>>>>> rows/files are added to the Iceberg table in a batch fashion.
>>>>>>>>>>>
>>>>>>>>>>> > And for those use cases, you could also keep a longer history
>>>>>>>>>>> of snapshot timestamps, like storing a catalog's event log for 
>>>>>>>>>>> long-term
>>>>>>>>>>> access to timestamp info
>>>>>>>>>>>
>>>>>>>>>>> this is not really consumable by joining the regular table query
>>>>>>>>>>> with catalog event log. I would also imagine catalog event log is 
>>>>>>>>>>> capped at
>>>>>>>>>>> shorter retention (maybe a few months) compared to data retention 
>>>>>>>>>>> (could be
>>>>>>>>>>> a few years).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I don't think it is a good idea to expose timestamps at the row
>>>>>>>>>>>> level. Timestamps in metadata that would be carried down to the 
>>>>>>>>>>>> row level
>>>>>>>>>>>> already confuse people that expect them to be useful or reliable, 
>>>>>>>>>>>> rather
>>>>>>>>>>>> than for debugging. I think extending this to the row level would 
>>>>>>>>>>>> only make
>>>>>>>>>>>> the problem worse.
>>>>>>>>>>>>
>>>>>>>>>>>> You can already get this information by projecting the last
>>>>>>>>>>>> updated sequence number, which is reliable, and joining to 
>>>>>>>>>>>> timestamps from
>>>>>>>>>>>> the snapshots metadata table. Of course, the drawback there is 
>>>>>>>>>>>> losing the
>>>>>>>>>>>> timestamp information when snapshots expire, but since it isn't 
>>>>>>>>>>>> reliable
>>>>>>>>>>>> anyway I'd be fine with that.
>>>>>>>>>>>>
>>>>>>>>>>>> Some of the use cases, like auditing and compliance, are
>>>>>>>>>>>> probably better served by embedding a timestamp in DML (like
>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit one from 
>>>>>>>>>>>> table
>>>>>>>>>>>> metadata. And for those use cases, you could also keep a longer 
>>>>>>>>>>>> history of
>>>>>>>>>>>> snapshot timestamps, like storing a catalog's event log for 
>>>>>>>>>>>> long-term
>>>>>>>>>>>> access to timestamp info. I think that would be better than 
>>>>>>>>>>>> storing it at
>>>>>>>>>>>> the row level.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> For V4 spec, I have a small proposal [1] to expose the row
>>>>>>>>>>>>> timestamp concept that can help with many use cases like temporal 
>>>>>>>>>>>>> queries,
>>>>>>>>>>>>> latency tracking, TTL, auditing and compliance.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This *_last_updated_timestamp_ms * metadata column behaves
>>>>>>>>>>>>> very similarly to the *_last_updated_sequence_number* for row
>>>>>>>>>>>>> lineage.
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - Initially, it inherits from the snapshot timestamp.
>>>>>>>>>>>>>    - During rewrite (like compaction), its values are
>>>>>>>>>>>>>    persisted in the data files.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Would love to hear what you think.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: [DISCUSS] row timestamp proposal

Reply via email to