Hi Spark devs,

It looks like my original email might have landed in some spam folders, so
I am just bumping this thread for visibility.

For quick reference, here are the links to the proposal again:

   -

   *SPIP Document:*
   
https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
   -

   *JIRA:* https://issues.apache.org/jira/browse/SPARK-55668

Looking forward to your thoughts and feedback!

Thanks,

Gengliang

On Fri, Feb 27, 2026 at 1:13 PM Szehon Ho <[email protected]> wrote:

> +1 (non binding)
>
> This is a great idea, look forward to a standard user experience for CDC
> for DSV2 data source, and centralizing the complicated share logic.
>
> Also this is somehow shown in my Spam folder :) , hope this brings it out.
>
> Thanks
> Szehon
>
> On Tue, Feb 24, 2026 at 4:37 PM Gengliang Wang <[email protected]> wrote:
>
>> Hi all,
>>
>> I'd like to open a discussion on a new SPIP to introduce Change Data
>> Capture (CDC) support to Apache Spark, targeting the Spark 4.2 release.
>>
>>    -
>>
>>    SPIP Document: <https://docs.google.com/document/d/>
>>    
>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>>    -
>>
>>    JIRA:
>>    
>> <https://www.google.com/search?q=https://issues.apache.org/jira/browse/SPARK->
>>    https://issues.apache.org/jira/browse/SPARK-55668
>>
>> Motivation
>>
>> Currently, querying row-level changes (inserts, updates, deletes) from a
>> table requires connector-specific syntax. This fragmentation breaks query
>> portability across different storage formats and forces each connector to
>> reinvent complex post-processing logic:
>>
>>    -
>>
>>    Delta Lake: Uses table_changes()
>>    -
>>
>>    Iceberg: Uses .changes virtual tables
>>    -
>>
>>    Hudi: Relies on custom incremental read options
>>
>> There is no universal, engine-level standard in Spark to ask "show me
>> what changed."
>> Proposal
>>
>> This SPIP proposes a unified approach by adding a CHANGES SQL clause and
>> corresponding DataFrame/DataStream APIs that work across DSv2 connectors.
>>
>> 1. Standardized User API
>>
>> SQL:
>>
>> -- Batch: What changed between version 10 and 20?
>>
>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20;
>>
>> -- Streaming: Continuously process changes
>>
>> CREATE STREAMING TABLE cdc_sink AS
>>
>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0;
>>
>> DataFrame API:
>>
>> spark.read
>>
>>   .option("startingVersion", "10")
>>
>>   .option("endingVersion", "20")
>>
>>   .changes("my_table")
>>
>> 2. Engine-Level Post Processing Under the hood, this proposal introduces
>> a minimal Changelog interface for DSv2 connectors. Spark's Catalyst
>> optimizer will take over the CDC post-processing, including:
>>
>>    -
>>
>>    Filtering out copy-on-write carry-over rows.
>>    -
>>
>>    Deriving pre-image/post-image updates from raw insert/delete pairs.
>>    -
>>
>>    Computing net changes.
>>
>> This pushes complexity into the engine where it belongs, reducing
>> duplicated effort across the ecosystem and ensuring consistent semantics
>> for users.
>>
>> Please review the full SPIP for comprehensive design details, the
>> proposed connector API, and deduplication semantics.
>>
>> Feedback and discussion are highly appreciated!
>>
>> Thanks,
>>
>> Gengliang
>>
>>

Reply via email to