Re: [DISCUSS] row timestamp proposal

Steven Wu Thu, 11 Dec 2025 15:26:18 -0800

> Other common common audit fields that some people might want without
keeping snapshot history:
1.  insertion time
2.  Created by.
3.  Updated by.


Micah, are 1 and 2 the same? 3 is covered by this proposal.

To support the created_by timestamp, we would need to implement the
following row lineage behavior
* Initially, it inherits from the snapshot timestamp
* during rewrite (like compaction), it should be persisted into data files.
* during update, it needs to be carried over from the previous row. This is
similar to the row_id carry over for row updates.

created_by column would incur likely significantly higher storage overhead
compared to the updated_by column. As rows are updated overtime, the
cardinality for this column in data files can be high. Hence, the
created_by column may not compress well. This is a similar problem for the
row_id column. One side effect of enabling row lineage by default for V3
tables is the storage overhead of row_id column after compaction especially
for narrow tables with few columns.

updated_by column has very few distinct values in a data file and
compresses very well in columnar format. Its storage overhead is very
minimal, which  is the same for the last_updated_sequence_number column.

The updated_by column is probably a little more useful row metadata for the
mentioned use cases, which captures the time when a row is added to or
updated in the table. It also has minimal computation and storage overhead.

> This part of the discussion reminds of my proposal awhile back to keep
expired snapshot metadata.
> Is there also a concern about storage, by adding so many metadata fields
to each row?  It does seem that if we can calculate it by existing fields
like last_updated_sequence_number and joining with catalog information, it
seems better for the spec.

Szehon, thanks a lot for bringing up the previous proposal. Yes, it can
help retain the expired snapshots in a separate storage (like catalog). If
an Iceberg table has streaming ingestion that commits every minute, 2 years
of snapshots would have ~1M snapshots. So they would likely need to be
persisted into blob storage, compacted, maintained.

With separate catalog storage for the expired snapshots, how could users
consume the information? Are we thinking about a new metadata table like
$EXPIRED_SNAPSHOTS? A built-in metadata column (like
_last_updated_sequence_number or last_updated_timestamp_ms) is very easy to
consume.

For the last_updated_timestamp_ms column in this proposal, the storage
overhead is very minimal as described above.




On Thu, Dec 11, 2025 at 10:13 AM Szehon Ho <[email protected]> wrote:

> this is not really consumable by joining the regular table query with
>> catalog event log. I would also imagine catalog event log is capped at
>> shorter retention (maybe a few months) compared to data retention (could be
>> a few years).
>
>
> This part of the discussion reminds of my proposal awhile back to keep
> expired snapshot metadata :
> https://lists.apache.org/thread/l9m0mp44byx9kzzzmolxnrdqlzbympb8
> <https://lists.apache.org/thread/l9m0mp44byx9kzzzmolxnrdqlzbympb8> I
> think it would have been nice to have this formally in Iceberg (a list of
> expired snapshots) but the discussion concluded that its probably easier
> for the catalog to do.  Maybe we could formalize it in the catalog spec
> that allows for optional list of expired snapshots?
>
> Is there also a concern about storage, by adding so many metadata fields
> to each row?  It does seem that if we can calculate it by existing fields
> like last_updated_sequence_number and joining with catalog information, it
> seems better for the spec.
>
> Thanks
> Szehon
>
> On Thu, Dec 11, 2025 at 10:02 AM Micah Kornfield <[email protected]>
> wrote:
>
>> An explicit timestamp column adds more burden to application developers.
>>> While some databases require an explicit column in the schema, those
>>> databases provide triggers to auto set the column value. For Iceberg, the
>>> snapshot timestamp is the closest to the trigger timestamp.
>>
>>
>> I wonder if we should be looking at maybe generalizing the audit column
>> in Iceberg and letting this be configured at a table level.  Other common
>> common audit fields that some people might want without keeping snapshot
>> history:
>>
>> 1.  insertion time
>> 2.  Created by.
>> 3.  Updated by.
>>
>>
>>
>>
>> On Tue, Dec 9, 2025 at 2:23 PM Steven Wu <[email protected]> wrote:
>>
>>> Ryan, thanks a lot for the feedback!
>>>
>>> Regarding the concern for reliable timestamps, we are not proposing
>>> using timestamps for ordering. With NTP in modern computers, they are
>>> generally reliable enough for the intended use cases. Also some
>>> environments may have stronger clock service, like Spanner TrueTime
>>> service
>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency>
>>> .
>>>
>>> >  joining to timestamps from the snapshots metadata table.
>>>
>>> As you also mentioned, it depends on the snapshot history, which is
>>> often retained for a few days due to performance reasons.
>>>
>>> > embedding a timestamp in DML (like `current_timestamp`) rather than
>>> relying on an implicit one from table metadata.
>>>
>>> An explicit timestamp column adds more burden to application developers.
>>> While some databases require an explicit column in the schema, those
>>> databases provide triggers to auto set the column value. For Iceberg, the
>>> snapshot timestamp is the closest to the trigger timestamp.
>>>
>>> Also, the timestamp set during computation (like streaming ingestion or
>>> relative long batch computation) doesn't capture the time the rows/files
>>> are added to the Iceberg table in a batch fashion.
>>>
>>> > And for those use cases, you could also keep a longer history of
>>> snapshot timestamps, like storing a catalog's event log for long-term
>>> access to timestamp info
>>>
>>> this is not really consumable by joining the regular table query with
>>> catalog event log. I would also imagine catalog event log is capped at
>>> shorter retention (maybe a few months) compared to data retention (could be
>>> a few years).
>>>
>>>
>>>
>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <[email protected]> wrote:
>>>
>>>> I don't think it is a good idea to expose timestamps at the row level.
>>>> Timestamps in metadata that would be carried down to the row level already
>>>> confuse people that expect them to be useful or reliable, rather than for
>>>> debugging. I think extending this to the row level would only make the
>>>> problem worse.
>>>>
>>>> You can already get this information by projecting the last updated
>>>> sequence number, which is reliable, and joining to timestamps from the
>>>> snapshots metadata table. Of course, the drawback there is losing the
>>>> timestamp information when snapshots expire, but since it isn't reliable
>>>> anyway I'd be fine with that.
>>>>
>>>> Some of the use cases, like auditing and compliance, are probably
>>>> better served by embedding a timestamp in DML (like `current_timestamp`)
>>>> rather than relying on an implicit one from table metadata. And for
>>>> those use cases, you could also keep a longer history of snapshot
>>>> timestamps, like storing a catalog's event log for long-term access to
>>>> timestamp info. I think that would be better than storing it at the row
>>>> level.
>>>>
>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu <[email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> For V4 spec, I have a small proposal [1] to expose the row timestamp
>>>>> concept that can help with many use cases like temporal queries, latency
>>>>> tracking, TTL, auditing and compliance.
>>>>>
>>>>> This *_last_updated_timestamp_ms * metadata column behaves very
>>>>> similarly to the *_last_updated_sequence_number* for row lineage.
>>>>>
>>>>>    - Initially, it inherits from the snapshot timestamp.
>>>>>    - During rewrite (like compaction), its values are persisted in
>>>>>    the data files.
>>>>>
>>>>> Would love to hear what you think.
>>>>>
>>>>> Thanks,
>>>>> Steven
>>>>>
>>>>> [1]
>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing
>>>>>
>>>>>
>>>>>

Re: [DISCUSS] row timestamp proposal

Reply via email to