Re: [DISCUSS] row timestamp proposal

Steven Wu Mon, 12 Jan 2026 09:03:00 -0800

Hi all,

I have revised the row timestamp proposal with the following changes.
* a new commit_timestamp field in snapshot metadata that has nano-second
precision.
* this optional field is only set by the REST catalog server
* it needs to be monotonic (e.g. implemented using Lamport timestamp)


https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh

Thanks,
Steven


On Fri, Dec 12, 2025 at 2:36 PM Steven Wu <[email protected]> wrote:

> Thanks for the clarification, Ryan.
>
> For long-running streaming jobs that commit periodically, it is difficult
> to establish the constant value of current_timestamp across all writer
> tasks for each commit cycle. I guess streaming writers may just need to
> write the wall clock time when appending a row to a data file for the
> default value of current_timestamp.
>
>
>
> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <[email protected]> wrote:
>
>> I don't think that every row would have a different value. That would be
>> up to the engine, but I would expect engines to insert `CURRENT_TIMESTAMP`
>> into the plan and then replace it with a constant, resulting in a
>> consistent value for all rows.
>>
>> You're right that this would not necessarily be the commit time. But
>> neither is the commit timestamp from Iceberg's snapshot. I'm not sure how
>> we are going to define "good enough" for this purpose. I think at least
>> `CURRENT_TIMESTAMP` has reliable and known behavior when you look at how it
>> is handled in engines. And if you want the Iceberg timestamp, then use a
>> periodic query of the snapshot stable to keep track of them in a table you
>> can join to. I don't think this rises to the need for a table feature
>> unless we can guarantee that it is correct.
>>
>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu <[email protected]> wrote:
>>
>>> > Postgres `current_timestamp` captures the transaction start time [1,
>>> 2]. Should we extend the same semantic to Iceberg: all rows added in the
>>> same snapshot should have the same timestamp value?
>>>
>>> Let me clarify my last comment.
>>>
>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP)
>>>
>>> Since Postgres current_timestamp captures the transaction start time,
>>> all rows added in the same insert transaction would have the same value as
>>> the transaction timestamp with the column definition above.
>>>
>>> If we extend a similar semantic to Iceberg, all rows added in the same
>>> Iceberg transaction/snapshot should have the same timestamp?
>>>
>>> Ryan, I understand your comment for using current_timestamp expression
>>> as column default value, you were thinking that the engine would set the
>>> column value to the wall clock time when appending a row to a data file,
>>> right? every row would almost have a different timestamp value.
>>>
>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu <[email protected]> wrote:
>>>
>>>>
>>>> `current_timestamp` expression may not always carry the right semantic
>>>> for the use cases. E.g., latency tracking is interested in when records are
>>>> added / committed to the table, not when the record was appended to an
>>>> uncommitted data file in the processing engine. Record creation and Iceberg
>>>> commit can be minutes or even hours apart.
>>>>
>>>> Row timestamp inherited from snapshot timestamp has no overhead with
>>>> the initial commit and has very minimal storage overhead during file
>>>> rewrite. Per-row current_timestamp would have distinct values for every row
>>>> and has more storage overhead.
>>>>
>>>> OLTP databases deal with small row-level transactions. Postgres
>>>> `current_timestamp` captures the transaction start time [1, 2]. Should we
>>>> extend the same semantic to Iceberg: all rows added in the same snapshot
>>>> should have the same timestamp value?
>>>>
>>>> [1] https://www.postgresql.org/docs/current/functions-datetime.html
>>>> [2]
>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp
>>>>
>>>>
>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>>> Micah, are 1 and 2 the same? 3 is covered by this proposal.
>>>>>> To support the created_by timestamp, we would need to implement the
>>>>>> following row lineage behavior
>>>>>> * Initially, it inherits from the snapshot timestamp
>>>>>> * during rewrite (like compaction), it should be persisted into data
>>>>>> files.
>>>>>> * during update, it needs to be carried over from the previous row.
>>>>>> This is similar to the row_id carry over for row updates.
>>>>>
>>>>>
>>>>> Sorry for the short hand.  These are not the same:
>>>>>
>>>>> 1.  Insertion time - time the row was inserted.
>>>>> 2.  Create by - The system that created the record.
>>>>> 3.  Updated by - The system that last updated the record.
>>>>>
>>>>> Depending on the exact use-case these might or might not have
>>>>> utility.  I'm just wondering if there will be more example like this in 
>>>>> the
>>>>> future.
>>>>>
>>>>>
>>>>> created_by column would incur likely significantly higher storage
>>>>>> overhead compared to the updated_by column. As rows are updated overtime,
>>>>>> the cardinality for this column in data files can be high. Hence, the
>>>>>> created_by column may not compress well. This is a similar problem for 
>>>>>> the
>>>>>> row_id column. One side effect of enabling row lineage by default for V3
>>>>>> tables is the storage overhead of row_id column after compaction 
>>>>>> especially
>>>>>> for narrow tables with few columns.
>>>>>
>>>>>
>>>>> I agree.  I think this analysis also shows that some consumers of
>>>>> Iceberg might not necessarily want to have all these columns, so we might
>>>>> want to make them configurable, rather than mandating them for all tables.
>>>>> Ryan's thought on default values seems like it would solve the issues I 
>>>>> was
>>>>> raising.
>>>>>
>>>>> Thanks,
>>>>> Micah
>>>>>
>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <[email protected]> wrote:
>>>>>
>>>>>> > An explicit timestamp column adds more burden to application
>>>>>> developers. While some databases require an explicit column in the 
>>>>>> schema,
>>>>>> those databases provide triggers to auto set the column value. For 
>>>>>> Iceberg,
>>>>>> the snapshot timestamp is the closest to the trigger timestamp.
>>>>>>
>>>>>> Since the use cases don't require an exact timestamp, this seems like
>>>>>> the best solution to get what people want (an insertion timestamp) that 
>>>>>> has
>>>>>> clear and well-defined behavior. Since `current_timestamp` is defined by
>>>>>> the SQL spec, it makes sense to me that we could use it and have 
>>>>>> reasonable
>>>>>> behavior.
>>>>>>
>>>>>> I've talked with Anton about this before and maybe he'll jump in on
>>>>>> this thread. I think that we may need to extend default values to include
>>>>>> default value expressions, like `current_timestamp` that is allowed by 
>>>>>> the
>>>>>> SQL spec. That would solve the problem as well as some others (like
>>>>>> `current_date` or `current_user`) and would not create a potentially
>>>>>> misleading (and heavyweight) timestamp feature in the format.
>>>>>>
>>>>>> > Also some environments may have stronger clock service, like
>>>>>> Spanner TrueTime service.
>>>>>>
>>>>>> Even in cases like this, commit retries can reorder commits and make
>>>>>> timestamps out of order. I don't think that we should be making 
>>>>>> guarantees
>>>>>> or even exposing metadata that people might mistake as having those
>>>>>> guarantees.
>>>>>>
>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Ryan, thanks a lot for the feedback!
>>>>>>>
>>>>>>> Regarding the concern for reliable timestamps, we are not proposing
>>>>>>> using timestamps for ordering. With NTP in modern computers, they are
>>>>>>> generally reliable enough for the intended use cases. Also some
>>>>>>> environments may have stronger clock service, like Spanner TrueTime
>>>>>>> service
>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency>
>>>>>>> .
>>>>>>>
>>>>>>> >  joining to timestamps from the snapshots metadata table.
>>>>>>>
>>>>>>> As you also mentioned, it depends on the snapshot history, which is
>>>>>>> often retained for a few days due to performance reasons.
>>>>>>>
>>>>>>> > embedding a timestamp in DML (like `current_timestamp`) rather
>>>>>>> than relying on an implicit one from table metadata.
>>>>>>>
>>>>>>> An explicit timestamp column adds more burden to application
>>>>>>> developers. While some databases require an explicit column in the 
>>>>>>> schema,
>>>>>>> those databases provide triggers to auto set the column value. For 
>>>>>>> Iceberg,
>>>>>>> the snapshot timestamp is the closest to the trigger timestamp.
>>>>>>>
>>>>>>> Also, the timestamp set during computation (like streaming ingestion
>>>>>>> or relative long batch computation) doesn't capture the time the 
>>>>>>> rows/files
>>>>>>> are added to the Iceberg table in a batch fashion.
>>>>>>>
>>>>>>> > And for those use cases, you could also keep a longer history of
>>>>>>> snapshot timestamps, like storing a catalog's event log for long-term
>>>>>>> access to timestamp info
>>>>>>>
>>>>>>> this is not really consumable by joining the regular table query
>>>>>>> with catalog event log. I would also imagine catalog event log is 
>>>>>>> capped at
>>>>>>> shorter retention (maybe a few months) compared to data retention 
>>>>>>> (could be
>>>>>>> a few years).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <[email protected]> wrote:
>>>>>>>
>>>>>>>> I don't think it is a good idea to expose timestamps at the row
>>>>>>>> level. Timestamps in metadata that would be carried down to the row 
>>>>>>>> level
>>>>>>>> already confuse people that expect them to be useful or reliable, 
>>>>>>>> rather
>>>>>>>> than for debugging. I think extending this to the row level would only 
>>>>>>>> make
>>>>>>>> the problem worse.
>>>>>>>>
>>>>>>>> You can already get this information by projecting the last updated
>>>>>>>> sequence number, which is reliable, and joining to timestamps from the
>>>>>>>> snapshots metadata table. Of course, the drawback there is losing the
>>>>>>>> timestamp information when snapshots expire, but since it isn't 
>>>>>>>> reliable
>>>>>>>> anyway I'd be fine with that.
>>>>>>>>
>>>>>>>> Some of the use cases, like auditing and compliance, are probably
>>>>>>>> better served by embedding a timestamp in DML (like 
>>>>>>>> `current_timestamp`)
>>>>>>>> rather than relying on an implicit one from table metadata. And for
>>>>>>>> those use cases, you could also keep a longer history of snapshot
>>>>>>>> timestamps, like storing a catalog's event log for long-term access to
>>>>>>>> timestamp info. I think that would be better than storing it at the row
>>>>>>>> level.
>>>>>>>>
>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> For V4 spec, I have a small proposal [1] to expose the row
>>>>>>>>> timestamp concept that can help with many use cases like temporal 
>>>>>>>>> queries,
>>>>>>>>> latency tracking, TTL, auditing and compliance.
>>>>>>>>>
>>>>>>>>> This *_last_updated_timestamp_ms * metadata column behaves very
>>>>>>>>> similarly to the *_last_updated_sequence_number* for row lineage.
>>>>>>>>>
>>>>>>>>>    - Initially, it inherits from the snapshot timestamp.
>>>>>>>>>    - During rewrite (like compaction), its values are persisted
>>>>>>>>>    in the data files.
>>>>>>>>>
>>>>>>>>> Would love to hear what you think.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Steven
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>

Re: [DISCUSS] row timestamp proposal

Reply via email to