> Postgres `current_timestamp` captures the transaction start time [1, 2]. Should we extend the same semantic to Iceberg: all rows added in the same snapshot should have the same timestamp value?
Let me clarify my last comment. created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP) Since Postgres current_timestamp captures the transaction start time, all rows added in the same insert transaction would have the same value as the transaction timestamp with the column definition above. If we extend a similar semantic to Iceberg, all rows added in the same Iceberg transaction/snapshot should have the same timestamp? Ryan, I understand your comment for using current_timestamp expression as column default value, you were thinking that the engine would set the column value to the wall clock time when appending a row to a data file, right? every row would almost have a different timestamp value. On Fri, Dec 12, 2025 at 10:26 AM Steven Wu <[email protected]> wrote: > > `current_timestamp` expression may not always carry the right semantic for > the use cases. E.g., latency tracking is interested in when records are > added / committed to the table, not when the record was appended to an > uncommitted data file in the processing engine. Record creation and Iceberg > commit can be minutes or even hours apart. > > Row timestamp inherited from snapshot timestamp has no overhead with the > initial commit and has very minimal storage overhead during file rewrite. > Per-row current_timestamp would have distinct values for every row and has > more storage overhead. > > OLTP databases deal with small row-level transactions. Postgres > `current_timestamp` captures the transaction start time [1, 2]. Should we > extend the same semantic to Iceberg: all rows added in the same snapshot > should have the same timestamp value? > > [1] https://www.postgresql.org/docs/current/functions-datetime.html > [2] > https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp > > > On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield <[email protected]> > wrote: > >> >>> Micah, are 1 and 2 the same? 3 is covered by this proposal. >>> To support the created_by timestamp, we would need to implement the >>> following row lineage behavior >>> * Initially, it inherits from the snapshot timestamp >>> * during rewrite (like compaction), it should be persisted into data >>> files. >>> * during update, it needs to be carried over from the previous row. This >>> is similar to the row_id carry over for row updates. >> >> >> Sorry for the short hand. These are not the same: >> >> 1. Insertion time - time the row was inserted. >> 2. Create by - The system that created the record. >> 3. Updated by - The system that last updated the record. >> >> Depending on the exact use-case these might or might not have utility. >> I'm just wondering if there will be more example like this in the future. >> >> >> created_by column would incur likely significantly higher storage >>> overhead compared to the updated_by column. As rows are updated overtime, >>> the cardinality for this column in data files can be high. Hence, the >>> created_by column may not compress well. This is a similar problem for the >>> row_id column. One side effect of enabling row lineage by default for V3 >>> tables is the storage overhead of row_id column after compaction especially >>> for narrow tables with few columns. >> >> >> I agree. I think this analysis also shows that some consumers of Iceberg >> might not necessarily want to have all these columns, so we might want to >> make them configurable, rather than mandating them for all tables. Ryan's >> thought on default values seems like it would solve the issues I was >> raising. >> >> Thanks, >> Micah >> >> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <[email protected]> wrote: >> >>> > An explicit timestamp column adds more burden to application >>> developers. While some databases require an explicit column in the schema, >>> those databases provide triggers to auto set the column value. For Iceberg, >>> the snapshot timestamp is the closest to the trigger timestamp. >>> >>> Since the use cases don't require an exact timestamp, this seems like >>> the best solution to get what people want (an insertion timestamp) that has >>> clear and well-defined behavior. Since `current_timestamp` is defined by >>> the SQL spec, it makes sense to me that we could use it and have reasonable >>> behavior. >>> >>> I've talked with Anton about this before and maybe he'll jump in on this >>> thread. I think that we may need to extend default values to include >>> default value expressions, like `current_timestamp` that is allowed by the >>> SQL spec. That would solve the problem as well as some others (like >>> `current_date` or `current_user`) and would not create a potentially >>> misleading (and heavyweight) timestamp feature in the format. >>> >>> > Also some environments may have stronger clock service, like Spanner >>> TrueTime service. >>> >>> Even in cases like this, commit retries can reorder commits and make >>> timestamps out of order. I don't think that we should be making guarantees >>> or even exposing metadata that people might mistake as having those >>> guarantees. >>> >>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu <[email protected]> wrote: >>> >>>> Ryan, thanks a lot for the feedback! >>>> >>>> Regarding the concern for reliable timestamps, we are not proposing >>>> using timestamps for ordering. With NTP in modern computers, they are >>>> generally reliable enough for the intended use cases. Also some >>>> environments may have stronger clock service, like Spanner TrueTime >>>> service >>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency> >>>> . >>>> >>>> > joining to timestamps from the snapshots metadata table. >>>> >>>> As you also mentioned, it depends on the snapshot history, which is >>>> often retained for a few days due to performance reasons. >>>> >>>> > embedding a timestamp in DML (like `current_timestamp`) rather than >>>> relying on an implicit one from table metadata. >>>> >>>> An explicit timestamp column adds more burden to application >>>> developers. While some databases require an explicit column in the schema, >>>> those databases provide triggers to auto set the column value. For Iceberg, >>>> the snapshot timestamp is the closest to the trigger timestamp. >>>> >>>> Also, the timestamp set during computation (like streaming ingestion or >>>> relative long batch computation) doesn't capture the time the rows/files >>>> are added to the Iceberg table in a batch fashion. >>>> >>>> > And for those use cases, you could also keep a longer history of >>>> snapshot timestamps, like storing a catalog's event log for long-term >>>> access to timestamp info >>>> >>>> this is not really consumable by joining the regular table query with >>>> catalog event log. I would also imagine catalog event log is capped at >>>> shorter retention (maybe a few months) compared to data retention (could be >>>> a few years). >>>> >>>> >>>> >>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <[email protected]> wrote: >>>> >>>>> I don't think it is a good idea to expose timestamps at the row level. >>>>> Timestamps in metadata that would be carried down to the row level already >>>>> confuse people that expect them to be useful or reliable, rather than for >>>>> debugging. I think extending this to the row level would only make the >>>>> problem worse. >>>>> >>>>> You can already get this information by projecting the last updated >>>>> sequence number, which is reliable, and joining to timestamps from the >>>>> snapshots metadata table. Of course, the drawback there is losing the >>>>> timestamp information when snapshots expire, but since it isn't reliable >>>>> anyway I'd be fine with that. >>>>> >>>>> Some of the use cases, like auditing and compliance, are probably >>>>> better served by embedding a timestamp in DML (like `current_timestamp`) >>>>> rather than relying on an implicit one from table metadata. And for >>>>> those use cases, you could also keep a longer history of snapshot >>>>> timestamps, like storing a catalog's event log for long-term access to >>>>> timestamp info. I think that would be better than storing it at the row >>>>> level. >>>>> >>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu <[email protected]> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> For V4 spec, I have a small proposal [1] to expose the row timestamp >>>>>> concept that can help with many use cases like temporal queries, latency >>>>>> tracking, TTL, auditing and compliance. >>>>>> >>>>>> This *_last_updated_timestamp_ms * metadata column behaves very >>>>>> similarly to the *_last_updated_sequence_number* for row lineage. >>>>>> >>>>>> - Initially, it inherits from the snapshot timestamp. >>>>>> - During rewrite (like compaction), its values are persisted in >>>>>> the data files. >>>>>> >>>>>> Would love to hear what you think. >>>>>> >>>>>> Thanks, >>>>>> Steven >>>>>> >>>>>> [1] >>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing >>>>>> >>>>>> >>>>>>
