Hi all, I have revised the row timestamp proposal with the following changes. * a new commit_timestamp field in snapshot metadata that has nano-second precision. * this optional field is only set by the REST catalog server * it needs to be monotonic (e.g. implemented using Lamport timestamp)
https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh Thanks, Steven On Fri, Dec 12, 2025 at 2:36 PM Steven Wu <[email protected]> wrote: > Thanks for the clarification, Ryan. > > For long-running streaming jobs that commit periodically, it is difficult > to establish the constant value of current_timestamp across all writer > tasks for each commit cycle. I guess streaming writers may just need to > write the wall clock time when appending a row to a data file for the > default value of current_timestamp. > > > > On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <[email protected]> wrote: > >> I don't think that every row would have a different value. That would be >> up to the engine, but I would expect engines to insert `CURRENT_TIMESTAMP` >> into the plan and then replace it with a constant, resulting in a >> consistent value for all rows. >> >> You're right that this would not necessarily be the commit time. But >> neither is the commit timestamp from Iceberg's snapshot. I'm not sure how >> we are going to define "good enough" for this purpose. I think at least >> `CURRENT_TIMESTAMP` has reliable and known behavior when you look at how it >> is handled in engines. And if you want the Iceberg timestamp, then use a >> periodic query of the snapshot stable to keep track of them in a table you >> can join to. I don't think this rises to the need for a table feature >> unless we can guarantee that it is correct. >> >> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu <[email protected]> wrote: >> >>> > Postgres `current_timestamp` captures the transaction start time [1, >>> 2]. Should we extend the same semantic to Iceberg: all rows added in the >>> same snapshot should have the same timestamp value? >>> >>> Let me clarify my last comment. >>> >>> created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP) >>> >>> Since Postgres current_timestamp captures the transaction start time, >>> all rows added in the same insert transaction would have the same value as >>> the transaction timestamp with the column definition above. >>> >>> If we extend a similar semantic to Iceberg, all rows added in the same >>> Iceberg transaction/snapshot should have the same timestamp? >>> >>> Ryan, I understand your comment for using current_timestamp expression >>> as column default value, you were thinking that the engine would set the >>> column value to the wall clock time when appending a row to a data file, >>> right? every row would almost have a different timestamp value. >>> >>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu <[email protected]> wrote: >>> >>>> >>>> `current_timestamp` expression may not always carry the right semantic >>>> for the use cases. E.g., latency tracking is interested in when records are >>>> added / committed to the table, not when the record was appended to an >>>> uncommitted data file in the processing engine. Record creation and Iceberg >>>> commit can be minutes or even hours apart. >>>> >>>> Row timestamp inherited from snapshot timestamp has no overhead with >>>> the initial commit and has very minimal storage overhead during file >>>> rewrite. Per-row current_timestamp would have distinct values for every row >>>> and has more storage overhead. >>>> >>>> OLTP databases deal with small row-level transactions. Postgres >>>> `current_timestamp` captures the transaction start time [1, 2]. Should we >>>> extend the same semantic to Iceberg: all rows added in the same snapshot >>>> should have the same timestamp value? >>>> >>>> [1] https://www.postgresql.org/docs/current/functions-datetime.html >>>> [2] >>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp >>>> >>>> >>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield <[email protected]> >>>> wrote: >>>> >>>>> >>>>>> Micah, are 1 and 2 the same? 3 is covered by this proposal. >>>>>> To support the created_by timestamp, we would need to implement the >>>>>> following row lineage behavior >>>>>> * Initially, it inherits from the snapshot timestamp >>>>>> * during rewrite (like compaction), it should be persisted into data >>>>>> files. >>>>>> * during update, it needs to be carried over from the previous row. >>>>>> This is similar to the row_id carry over for row updates. >>>>> >>>>> >>>>> Sorry for the short hand. These are not the same: >>>>> >>>>> 1. Insertion time - time the row was inserted. >>>>> 2. Create by - The system that created the record. >>>>> 3. Updated by - The system that last updated the record. >>>>> >>>>> Depending on the exact use-case these might or might not have >>>>> utility. I'm just wondering if there will be more example like this in >>>>> the >>>>> future. >>>>> >>>>> >>>>> created_by column would incur likely significantly higher storage >>>>>> overhead compared to the updated_by column. As rows are updated overtime, >>>>>> the cardinality for this column in data files can be high. Hence, the >>>>>> created_by column may not compress well. This is a similar problem for >>>>>> the >>>>>> row_id column. One side effect of enabling row lineage by default for V3 >>>>>> tables is the storage overhead of row_id column after compaction >>>>>> especially >>>>>> for narrow tables with few columns. >>>>> >>>>> >>>>> I agree. I think this analysis also shows that some consumers of >>>>> Iceberg might not necessarily want to have all these columns, so we might >>>>> want to make them configurable, rather than mandating them for all tables. >>>>> Ryan's thought on default values seems like it would solve the issues I >>>>> was >>>>> raising. >>>>> >>>>> Thanks, >>>>> Micah >>>>> >>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <[email protected]> wrote: >>>>> >>>>>> > An explicit timestamp column adds more burden to application >>>>>> developers. While some databases require an explicit column in the >>>>>> schema, >>>>>> those databases provide triggers to auto set the column value. For >>>>>> Iceberg, >>>>>> the snapshot timestamp is the closest to the trigger timestamp. >>>>>> >>>>>> Since the use cases don't require an exact timestamp, this seems like >>>>>> the best solution to get what people want (an insertion timestamp) that >>>>>> has >>>>>> clear and well-defined behavior. Since `current_timestamp` is defined by >>>>>> the SQL spec, it makes sense to me that we could use it and have >>>>>> reasonable >>>>>> behavior. >>>>>> >>>>>> I've talked with Anton about this before and maybe he'll jump in on >>>>>> this thread. I think that we may need to extend default values to include >>>>>> default value expressions, like `current_timestamp` that is allowed by >>>>>> the >>>>>> SQL spec. That would solve the problem as well as some others (like >>>>>> `current_date` or `current_user`) and would not create a potentially >>>>>> misleading (and heavyweight) timestamp feature in the format. >>>>>> >>>>>> > Also some environments may have stronger clock service, like >>>>>> Spanner TrueTime service. >>>>>> >>>>>> Even in cases like this, commit retries can reorder commits and make >>>>>> timestamps out of order. I don't think that we should be making >>>>>> guarantees >>>>>> or even exposing metadata that people might mistake as having those >>>>>> guarantees. >>>>>> >>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Ryan, thanks a lot for the feedback! >>>>>>> >>>>>>> Regarding the concern for reliable timestamps, we are not proposing >>>>>>> using timestamps for ordering. With NTP in modern computers, they are >>>>>>> generally reliable enough for the intended use cases. Also some >>>>>>> environments may have stronger clock service, like Spanner TrueTime >>>>>>> service >>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency> >>>>>>> . >>>>>>> >>>>>>> > joining to timestamps from the snapshots metadata table. >>>>>>> >>>>>>> As you also mentioned, it depends on the snapshot history, which is >>>>>>> often retained for a few days due to performance reasons. >>>>>>> >>>>>>> > embedding a timestamp in DML (like `current_timestamp`) rather >>>>>>> than relying on an implicit one from table metadata. >>>>>>> >>>>>>> An explicit timestamp column adds more burden to application >>>>>>> developers. While some databases require an explicit column in the >>>>>>> schema, >>>>>>> those databases provide triggers to auto set the column value. For >>>>>>> Iceberg, >>>>>>> the snapshot timestamp is the closest to the trigger timestamp. >>>>>>> >>>>>>> Also, the timestamp set during computation (like streaming ingestion >>>>>>> or relative long batch computation) doesn't capture the time the >>>>>>> rows/files >>>>>>> are added to the Iceberg table in a batch fashion. >>>>>>> >>>>>>> > And for those use cases, you could also keep a longer history of >>>>>>> snapshot timestamps, like storing a catalog's event log for long-term >>>>>>> access to timestamp info >>>>>>> >>>>>>> this is not really consumable by joining the regular table query >>>>>>> with catalog event log. I would also imagine catalog event log is >>>>>>> capped at >>>>>>> shorter retention (maybe a few months) compared to data retention >>>>>>> (could be >>>>>>> a few years). >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <[email protected]> wrote: >>>>>>> >>>>>>>> I don't think it is a good idea to expose timestamps at the row >>>>>>>> level. Timestamps in metadata that would be carried down to the row >>>>>>>> level >>>>>>>> already confuse people that expect them to be useful or reliable, >>>>>>>> rather >>>>>>>> than for debugging. I think extending this to the row level would only >>>>>>>> make >>>>>>>> the problem worse. >>>>>>>> >>>>>>>> You can already get this information by projecting the last updated >>>>>>>> sequence number, which is reliable, and joining to timestamps from the >>>>>>>> snapshots metadata table. Of course, the drawback there is losing the >>>>>>>> timestamp information when snapshots expire, but since it isn't >>>>>>>> reliable >>>>>>>> anyway I'd be fine with that. >>>>>>>> >>>>>>>> Some of the use cases, like auditing and compliance, are probably >>>>>>>> better served by embedding a timestamp in DML (like >>>>>>>> `current_timestamp`) >>>>>>>> rather than relying on an implicit one from table metadata. And for >>>>>>>> those use cases, you could also keep a longer history of snapshot >>>>>>>> timestamps, like storing a catalog's event log for long-term access to >>>>>>>> timestamp info. I think that would be better than storing it at the row >>>>>>>> level. >>>>>>>> >>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> For V4 spec, I have a small proposal [1] to expose the row >>>>>>>>> timestamp concept that can help with many use cases like temporal >>>>>>>>> queries, >>>>>>>>> latency tracking, TTL, auditing and compliance. >>>>>>>>> >>>>>>>>> This *_last_updated_timestamp_ms * metadata column behaves very >>>>>>>>> similarly to the *_last_updated_sequence_number* for row lineage. >>>>>>>>> >>>>>>>>> - Initially, it inherits from the snapshot timestamp. >>>>>>>>> - During rewrite (like compaction), its values are persisted >>>>>>>>> in the data files. >>>>>>>>> >>>>>>>>> Would love to hear what you think. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Steven >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing >>>>>>>>> >>>>>>>>> >>>>>>>>>
