Re: [DISCUSS] row timestamp proposal

Russell Spitzer Tue, 13 Jan 2026 11:04:08 -0800

I think this is the right step forward. Our current "timestamp" definition
is too ambiguous to be useful so establishing
a well defined and monotonic timestamp could be really great. I also like
the ability for row's to know this value without
having to rely on snapshot information which can be expired.


On Mon, Jan 12, 2026 at 11:03 AM Steven Wu <[email protected]> wrote:

> Hi all,
>
> I have revised the row timestamp proposal with the following changes.
> * a new commit_timestamp field in snapshot metadata that has nano-second
> precision.
> * this optional field is only set by the REST catalog server
> * it needs to be monotonic (e.g. implemented using Lamport timestamp)
>
>
> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh
>
> Thanks,
> Steven
>
>
> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu <[email protected]> wrote:
>
>> Thanks for the clarification, Ryan.
>>
>> For long-running streaming jobs that commit periodically, it is difficult
>> to establish the constant value of current_timestamp across all writer
>> tasks for each commit cycle. I guess streaming writers may just need to
>> write the wall clock time when appending a row to a data file for the
>> default value of current_timestamp.
>>
>>
>>
>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <[email protected]> wrote:
>>
>>> I don't think that every row would have a different value. That would be
>>> up to the engine, but I would expect engines to insert `CURRENT_TIMESTAMP`
>>> into the plan and then replace it with a constant, resulting in a
>>> consistent value for all rows.
>>>
>>> You're right that this would not necessarily be the commit time. But
>>> neither is the commit timestamp from Iceberg's snapshot. I'm not sure how
>>> we are going to define "good enough" for this purpose. I think at least
>>> `CURRENT_TIMESTAMP` has reliable and known behavior when you look at how it
>>> is handled in engines. And if you want the Iceberg timestamp, then use a
>>> periodic query of the snapshot stable to keep track of them in a table you
>>> can join to. I don't think this rises to the need for a table feature
>>> unless we can guarantee that it is correct.
>>>
>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu <[email protected]> wrote:
>>>
>>>> > Postgres `current_timestamp` captures the transaction start time [1,
>>>> 2]. Should we extend the same semantic to Iceberg: all rows added in the
>>>> same snapshot should have the same timestamp value?
>>>>
>>>> Let me clarify my last comment.
>>>>
>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP)
>>>>
>>>> Since Postgres current_timestamp captures the transaction start time,
>>>> all rows added in the same insert transaction would have the same value as
>>>> the transaction timestamp with the column definition above.
>>>>
>>>> If we extend a similar semantic to Iceberg, all rows added in the same
>>>> Iceberg transaction/snapshot should have the same timestamp?
>>>>
>>>> Ryan, I understand your comment for using current_timestamp expression
>>>> as column default value, you were thinking that the engine would set the
>>>> column value to the wall clock time when appending a row to a data file,
>>>> right? every row would almost have a different timestamp value.
>>>>
>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>> `current_timestamp` expression may not always carry the right semantic
>>>>> for the use cases. E.g., latency tracking is interested in when records 
>>>>> are
>>>>> added / committed to the table, not when the record was appended to an
>>>>> uncommitted data file in the processing engine. Record creation and 
>>>>> Iceberg
>>>>> commit can be minutes or even hours apart.
>>>>>
>>>>> Row timestamp inherited from snapshot timestamp has no overhead with
>>>>> the initial commit and has very minimal storage overhead during file
>>>>> rewrite. Per-row current_timestamp would have distinct values for every 
>>>>> row
>>>>> and has more storage overhead.
>>>>>
>>>>> OLTP databases deal with small row-level transactions. Postgres
>>>>> `current_timestamp` captures the transaction start time [1, 2]. Should we
>>>>> extend the same semantic to Iceberg: all rows added in the same snapshot
>>>>> should have the same timestamp value?
>>>>>
>>>>> [1] https://www.postgresql.org/docs/current/functions-datetime.html
>>>>> [2]
>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp
>>>>>
>>>>>
>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield <[email protected]>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this proposal.
>>>>>>> To support the created_by timestamp, we would need to implement the
>>>>>>> following row lineage behavior
>>>>>>> * Initially, it inherits from the snapshot timestamp
>>>>>>> * during rewrite (like compaction), it should be persisted into data
>>>>>>> files.
>>>>>>> * during update, it needs to be carried over from the previous row.
>>>>>>> This is similar to the row_id carry over for row updates.
>>>>>>
>>>>>>
>>>>>> Sorry for the short hand.  These are not the same:
>>>>>>
>>>>>> 1.  Insertion time - time the row was inserted.
>>>>>> 2.  Create by - The system that created the record.
>>>>>> 3.  Updated by - The system that last updated the record.
>>>>>>
>>>>>> Depending on the exact use-case these might or might not have
>>>>>> utility.  I'm just wondering if there will be more example like this in 
>>>>>> the
>>>>>> future.
>>>>>>
>>>>>>
>>>>>> created_by column would incur likely significantly higher storage
>>>>>>> overhead compared to the updated_by column. As rows are updated 
>>>>>>> overtime,
>>>>>>> the cardinality for this column in data files can be high. Hence, the
>>>>>>> created_by column may not compress well. This is a similar problem for 
>>>>>>> the
>>>>>>> row_id column. One side effect of enabling row lineage by default for V3
>>>>>>> tables is the storage overhead of row_id column after compaction 
>>>>>>> especially
>>>>>>> for narrow tables with few columns.
>>>>>>
>>>>>>
>>>>>> I agree.  I think this analysis also shows that some consumers of
>>>>>> Iceberg might not necessarily want to have all these columns, so we might
>>>>>> want to make them configurable, rather than mandating them for all 
>>>>>> tables.
>>>>>> Ryan's thought on default values seems like it would solve the issues I 
>>>>>> was
>>>>>> raising.
>>>>>>
>>>>>> Thanks,
>>>>>> Micah
>>>>>>
>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <[email protected]> wrote:
>>>>>>
>>>>>>> > An explicit timestamp column adds more burden to application
>>>>>>> developers. While some databases require an explicit column in the 
>>>>>>> schema,
>>>>>>> those databases provide triggers to auto set the column value. For 
>>>>>>> Iceberg,
>>>>>>> the snapshot timestamp is the closest to the trigger timestamp.
>>>>>>>
>>>>>>> Since the use cases don't require an exact timestamp, this seems
>>>>>>> like the best solution to get what people want (an insertion timestamp)
>>>>>>> that has clear and well-defined behavior. Since `current_timestamp` is
>>>>>>> defined by the SQL spec, it makes sense to me that we could use it and 
>>>>>>> have
>>>>>>> reasonable behavior.
>>>>>>>
>>>>>>> I've talked with Anton about this before and maybe he'll jump in on
>>>>>>> this thread. I think that we may need to extend default values to 
>>>>>>> include
>>>>>>> default value expressions, like `current_timestamp` that is allowed by 
>>>>>>> the
>>>>>>> SQL spec. That would solve the problem as well as some others (like
>>>>>>> `current_date` or `current_user`) and would not create a potentially
>>>>>>> misleading (and heavyweight) timestamp feature in the format.
>>>>>>>
>>>>>>> > Also some environments may have stronger clock service, like
>>>>>>> Spanner TrueTime service.
>>>>>>>
>>>>>>> Even in cases like this, commit retries can reorder commits and make
>>>>>>> timestamps out of order. I don't think that we should be making 
>>>>>>> guarantees
>>>>>>> or even exposing metadata that people might mistake as having those
>>>>>>> guarantees.
>>>>>>>
>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ryan, thanks a lot for the feedback!
>>>>>>>>
>>>>>>>> Regarding the concern for reliable timestamps, we are not proposing
>>>>>>>> using timestamps for ordering. With NTP in modern computers, they are
>>>>>>>> generally reliable enough for the intended use cases. Also some
>>>>>>>> environments may have stronger clock service, like Spanner
>>>>>>>> TrueTime service
>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency>
>>>>>>>> .
>>>>>>>>
>>>>>>>> >  joining to timestamps from the snapshots metadata table.
>>>>>>>>
>>>>>>>> As you also mentioned, it depends on the snapshot history, which is
>>>>>>>> often retained for a few days due to performance reasons.
>>>>>>>>
>>>>>>>> > embedding a timestamp in DML (like `current_timestamp`) rather
>>>>>>>> than relying on an implicit one from table metadata.
>>>>>>>>
>>>>>>>> An explicit timestamp column adds more burden to application
>>>>>>>> developers. While some databases require an explicit column in the 
>>>>>>>> schema,
>>>>>>>> those databases provide triggers to auto set the column value. For 
>>>>>>>> Iceberg,
>>>>>>>> the snapshot timestamp is the closest to the trigger timestamp.
>>>>>>>>
>>>>>>>> Also, the timestamp set during computation (like streaming
>>>>>>>> ingestion or relative long batch computation) doesn't capture the time 
>>>>>>>> the
>>>>>>>> rows/files are added to the Iceberg table in a batch fashion.
>>>>>>>>
>>>>>>>> > And for those use cases, you could also keep a longer history of
>>>>>>>> snapshot timestamps, like storing a catalog's event log for long-term
>>>>>>>> access to timestamp info
>>>>>>>>
>>>>>>>> this is not really consumable by joining the regular table query
>>>>>>>> with catalog event log. I would also imagine catalog event log is 
>>>>>>>> capped at
>>>>>>>> shorter retention (maybe a few months) compared to data retention 
>>>>>>>> (could be
>>>>>>>> a few years).
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> I don't think it is a good idea to expose timestamps at the row
>>>>>>>>> level. Timestamps in metadata that would be carried down to the row 
>>>>>>>>> level
>>>>>>>>> already confuse people that expect them to be useful or reliable, 
>>>>>>>>> rather
>>>>>>>>> than for debugging. I think extending this to the row level would 
>>>>>>>>> only make
>>>>>>>>> the problem worse.
>>>>>>>>>
>>>>>>>>> You can already get this information by projecting the last
>>>>>>>>> updated sequence number, which is reliable, and joining to timestamps 
>>>>>>>>> from
>>>>>>>>> the snapshots metadata table. Of course, the drawback there is losing 
>>>>>>>>> the
>>>>>>>>> timestamp information when snapshots expire, but since it isn't 
>>>>>>>>> reliable
>>>>>>>>> anyway I'd be fine with that.
>>>>>>>>>
>>>>>>>>> Some of the use cases, like auditing and compliance, are probably
>>>>>>>>> better served by embedding a timestamp in DML (like 
>>>>>>>>> `current_timestamp`)
>>>>>>>>> rather than relying on an implicit one from table metadata. And for
>>>>>>>>> those use cases, you could also keep a longer history of snapshot
>>>>>>>>> timestamps, like storing a catalog's event log for long-term access to
>>>>>>>>> timestamp info. I think that would be better than storing it at the 
>>>>>>>>> row
>>>>>>>>> level.
>>>>>>>>>
>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> For V4 spec, I have a small proposal [1] to expose the row
>>>>>>>>>> timestamp concept that can help with many use cases like temporal 
>>>>>>>>>> queries,
>>>>>>>>>> latency tracking, TTL, auditing and compliance.
>>>>>>>>>>
>>>>>>>>>> This *_last_updated_timestamp_ms * metadata column behaves very
>>>>>>>>>> similarly to the *_last_updated_sequence_number* for row lineage.
>>>>>>>>>>
>>>>>>>>>>    - Initially, it inherits from the snapshot timestamp.
>>>>>>>>>>    - During rewrite (like compaction), its values are persisted
>>>>>>>>>>    in the data files.
>>>>>>>>>>
>>>>>>>>>> Would love to hear what you think.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Steven
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>

Re: [DISCUSS] row timestamp proposal

Reply via email to