Hey everyone,

I’d like to share a design proposal for a new Iceberg Incremental CDC
Source in Apache Beam:

*Design Doc*:
https://docs.google.com/document/d/1_W6nDpiHKCk2oKrs-ICBn5IEeYm1AXhE0ItaK2-rrng
*Draft PR*: https://github.com/apache/beam/pull/37191

Currently, Beam’s IcebergIO supports streaming reads for append-only
snapshots. This proposal introduces a native streaming source capable of
processing full CDC events (inserts, updates, and deletes) using Iceberg’s
IncrementalChangelogScan API.

The doc has an initial intro to Iceberg CDC then jumps into some
performance optimizations, specifically using snapshot, partition, and file
metadata to bypass expensive shuffles when possible.

Would appreciate any feedback or thoughts on this approach!

Thanks,
Ahmed Abualsaud

Reply via email to