hongthang152 opened a new issue, #16342: URL: https://github.com/apache/iceberg/issues/16342
### Query engine Spark EMR ### Question Hi folks, We're running Spark 3.5 + Iceberg 1.6 on a large-scale data pipeline that performs frequent MERGE INTO operations on Iceberg tables. We need to produce a change data feed (CDC) — i.e., for every merge job, we want to know which rows were inserted, updated, or deleted — so downstream consumers can process only the delta. ## Our constraints 1. **We use Merge-on-Read (MoR)** for write performance reasons. With Copy-on-Write, our Spark plan uses a full outer join which is suboptimal and frequently causes Spark jobs to time out on our data volumes. 2. **`create_changelog_view` does not support MoR tables.** It only works with Copy-on-Write today. 3. **Switching to CoW just to get `create_changelog_view`** is not viable — beyond the full outer join timeout issue, the write amplification makes it impractical for our merge-heavy workload. ## What we've considered - **Post-merge changelog via `create_changelog_view`** — blocked by lack of MoR support. - **Generating CDC at write time** (inside the Spark writer) — this works for us today via a patched Iceberg build, but we'd prefer a supported upstream path. - **Upstream contribution** — we're open to contributing a mechanism that allows customizing or extending the Spark writer (e.g., an SPI/plugin point) so that CDC records can be captured during the merge write path without forking Iceberg. ## Questions 1. Is there a recommended approach for producing CDC output from MERGE operations on MoR tables that we're missing? 2. We're aware there's an open PR to support `create_changelog_view` with Merge-on-Read — is there a timeline or known blockers for that work? 3. Would the community be open to a contribution that adds a writer-level extension point (SPI) for capturing row-level changes during merge? We'd keep the CDC logic external — the contribution would just be the hook/interface in the writer. Any guidance on the preferred direction would be really appreciated. Happy to provide more details on our workload characteristics if helpful. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
