Hi All,

We are exploring CDC ingestion from DBs to Iceberg, certain DB like TiDB
have row level TTL enabled, there seem to be following ways to handle ttl-
1. Emit an explicit CDC event for the row as soon as the TTL window expires
2. Handle TTL at reader layer by filtering expired rows in Iceberg reader
3. or delegate responsibility of filtering out expired rows to the end user
(periodically clean up expired rows from Iceberg snapshot)
4. or provide a view to the user with a filter added to remove expired rows

It Seems like DBs like TiDB clean up the TTL periodically via a GC process
and emit corresponding CDC events after the GC and not immediately after
the TTL expiry... so it may happen that the CDC event is not emitted in the
following case-
1. current time: 10,   key: Key1, value : value1, TTL-100
2. current time: 110, key: Key1, value : value2, TTL-200  -- TTL updated to
200, value also updated
3. current time : 150 GC process to cleanup TTL runs at time and emits CDC
for expired rows

Now from time 100 -> 110, the record should not be visible as it had
expired but since no CDC event was emitted so Iceberg will  show record as
live between time 100 -> 110 as well...

I wanted to know, how is it handled by other folks or is there any
recommendation for handling TTL records in Iceberg CDC ingestion?

regards,
Aditya

Reply via email to