This looks cool overall, would it maybe make sense to share with the delta lake devs & iceberg devs for their input too? I have not had a chance to dig into this closely yet though.
On Fri, Feb 27, 2026 at 1:39 PM Gengliang Wang <[email protected]> wrote: > Hi Spark devs, > > It looks like my original email might have landed in some spam folders, so > I am just bumping this thread for visibility. > > For quick reference, here are the links to the proposal again: > > - > > *SPIP Document:* > > https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing > - > > *JIRA:* https://issues.apache.org/jira/browse/SPARK-55668 > > Looking forward to your thoughts and feedback! > > Thanks, > > Gengliang > > On Fri, Feb 27, 2026 at 1:13 PM Szehon Ho <[email protected]> wrote: > >> +1 (non binding) >> >> This is a great idea, look forward to a standard user experience for CDC >> for DSV2 data source, and centralizing the complicated share logic. >> >> Also this is somehow shown in my Spam folder :) , hope this brings it out. >> >> Thanks >> Szehon >> >> On Tue, Feb 24, 2026 at 4:37 PM Gengliang Wang <[email protected]> wrote: >> >>> Hi all, >>> >>> I'd like to open a discussion on a new SPIP to introduce Change Data >>> Capture (CDC) support to Apache Spark, targeting the Spark 4.2 release. >>> >>> - >>> >>> SPIP Document: <https://docs.google.com/document/d/> >>> >>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing >>> - >>> >>> JIRA: >>> >>> <https://www.google.com/search?q=https://issues.apache.org/jira/browse/SPARK-> >>> https://issues.apache.org/jira/browse/SPARK-55668 >>> >>> Motivation >>> >>> Currently, querying row-level changes (inserts, updates, deletes) from a >>> table requires connector-specific syntax. This fragmentation breaks query >>> portability across different storage formats and forces each connector to >>> reinvent complex post-processing logic: >>> >>> - >>> >>> Delta Lake: Uses table_changes() >>> - >>> >>> Iceberg: Uses .changes virtual tables >>> - >>> >>> Hudi: Relies on custom incremental read options >>> >>> There is no universal, engine-level standard in Spark to ask "show me >>> what changed." >>> Proposal >>> >>> This SPIP proposes a unified approach by adding a CHANGES SQL clause >>> and corresponding DataFrame/DataStream APIs that work across DSv2 >>> connectors. >>> >>> 1. Standardized User API >>> >>> SQL: >>> >>> -- Batch: What changed between version 10 and 20? >>> >>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20; >>> >>> -- Streaming: Continuously process changes >>> >>> CREATE STREAMING TABLE cdc_sink AS >>> >>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0; >>> >>> DataFrame API: >>> >>> spark.read >>> >>> .option("startingVersion", "10") >>> >>> .option("endingVersion", "20") >>> >>> .changes("my_table") >>> >>> 2. Engine-Level Post Processing Under the hood, this proposal >>> introduces a minimal Changelog interface for DSv2 connectors. Spark's >>> Catalyst optimizer will take over the CDC post-processing, including: >>> >>> - >>> >>> Filtering out copy-on-write carry-over rows. >>> - >>> >>> Deriving pre-image/post-image updates from raw insert/delete pairs. >>> - >>> >>> Computing net changes. >>> >>> This pushes complexity into the engine where it belongs, reducing >>> duplicated effort across the ecosystem and ensuring consistent semantics >>> for users. >>> >>> Please review the full SPIP for comprehensive design details, the >>> proposed connector API, and deduplication semantics. >>> >>> Feedback and discussion are highly appreciated! >>> >>> Thanks, >>> >>> Gengliang >>> >>> -- Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her
