Hi Spark devs, It looks like my original email might have landed in some spam folders, so I am just bumping this thread for visibility.
For quick reference, here are the links to the proposal again: - *SPIP Document:* https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing - *JIRA:* https://issues.apache.org/jira/browse/SPARK-55668 Looking forward to your thoughts and feedback! Thanks, Gengliang On Fri, Feb 27, 2026 at 1:13 PM Szehon Ho <[email protected]> wrote: > +1 (non binding) > > This is a great idea, look forward to a standard user experience for CDC > for DSV2 data source, and centralizing the complicated share logic. > > Also this is somehow shown in my Spam folder :) , hope this brings it out. > > Thanks > Szehon > > On Tue, Feb 24, 2026 at 4:37 PM Gengliang Wang <[email protected]> wrote: > >> Hi all, >> >> I'd like to open a discussion on a new SPIP to introduce Change Data >> Capture (CDC) support to Apache Spark, targeting the Spark 4.2 release. >> >> - >> >> SPIP Document: <https://docs.google.com/document/d/> >> >> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing >> - >> >> JIRA: >> >> <https://www.google.com/search?q=https://issues.apache.org/jira/browse/SPARK-> >> https://issues.apache.org/jira/browse/SPARK-55668 >> >> Motivation >> >> Currently, querying row-level changes (inserts, updates, deletes) from a >> table requires connector-specific syntax. This fragmentation breaks query >> portability across different storage formats and forces each connector to >> reinvent complex post-processing logic: >> >> - >> >> Delta Lake: Uses table_changes() >> - >> >> Iceberg: Uses .changes virtual tables >> - >> >> Hudi: Relies on custom incremental read options >> >> There is no universal, engine-level standard in Spark to ask "show me >> what changed." >> Proposal >> >> This SPIP proposes a unified approach by adding a CHANGES SQL clause and >> corresponding DataFrame/DataStream APIs that work across DSv2 connectors. >> >> 1. Standardized User API >> >> SQL: >> >> -- Batch: What changed between version 10 and 20? >> >> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20; >> >> -- Streaming: Continuously process changes >> >> CREATE STREAMING TABLE cdc_sink AS >> >> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0; >> >> DataFrame API: >> >> spark.read >> >> .option("startingVersion", "10") >> >> .option("endingVersion", "20") >> >> .changes("my_table") >> >> 2. Engine-Level Post Processing Under the hood, this proposal introduces >> a minimal Changelog interface for DSv2 connectors. Spark's Catalyst >> optimizer will take over the CDC post-processing, including: >> >> - >> >> Filtering out copy-on-write carry-over rows. >> - >> >> Deriving pre-image/post-image updates from raw insert/delete pairs. >> - >> >> Computing net changes. >> >> This pushes complexity into the engine where it belongs, reducing >> duplicated effort across the ecosystem and ensuring consistent semantics >> for users. >> >> Please review the full SPIP for comprehensive design details, the >> proposed connector API, and deduplication semantics. >> >> Feedback and discussion are highly appreciated! >> >> Thanks, >> >> Gengliang >> >>
