This looks cool overall, would it maybe make sense to share with the delta
lake devs & iceberg devs for their input too? I have not had a chance to
dig into this closely yet though.

On Fri, Feb 27, 2026 at 1:39 PM Gengliang Wang <[email protected]> wrote:

> Hi Spark devs,
>
> It looks like my original email might have landed in some spam folders, so
> I am just bumping this thread for visibility.
>
> For quick reference, here are the links to the proposal again:
>
>    -
>
>    *SPIP Document:*
>    
> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>    -
>
>    *JIRA:* https://issues.apache.org/jira/browse/SPARK-55668
>
> Looking forward to your thoughts and feedback!
>
> Thanks,
>
> Gengliang
>
> On Fri, Feb 27, 2026 at 1:13 PM Szehon Ho <[email protected]> wrote:
>
>> +1 (non binding)
>>
>> This is a great idea, look forward to a standard user experience for CDC
>> for DSV2 data source, and centralizing the complicated share logic.
>>
>> Also this is somehow shown in my Spam folder :) , hope this brings it out.
>>
>> Thanks
>> Szehon
>>
>> On Tue, Feb 24, 2026 at 4:37 PM Gengliang Wang <[email protected]> wrote:
>>
>>> Hi all,
>>>
>>> I'd like to open a discussion on a new SPIP to introduce Change Data
>>> Capture (CDC) support to Apache Spark, targeting the Spark 4.2 release.
>>>
>>>    -
>>>
>>>    SPIP Document: <https://docs.google.com/document/d/>
>>>    
>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>>>    -
>>>
>>>    JIRA:
>>>    
>>> <https://www.google.com/search?q=https://issues.apache.org/jira/browse/SPARK->
>>>    https://issues.apache.org/jira/browse/SPARK-55668
>>>
>>> Motivation
>>>
>>> Currently, querying row-level changes (inserts, updates, deletes) from a
>>> table requires connector-specific syntax. This fragmentation breaks query
>>> portability across different storage formats and forces each connector to
>>> reinvent complex post-processing logic:
>>>
>>>    -
>>>
>>>    Delta Lake: Uses table_changes()
>>>    -
>>>
>>>    Iceberg: Uses .changes virtual tables
>>>    -
>>>
>>>    Hudi: Relies on custom incremental read options
>>>
>>> There is no universal, engine-level standard in Spark to ask "show me
>>> what changed."
>>> Proposal
>>>
>>> This SPIP proposes a unified approach by adding a CHANGES SQL clause
>>> and corresponding DataFrame/DataStream APIs that work across DSv2
>>> connectors.
>>>
>>> 1. Standardized User API
>>>
>>> SQL:
>>>
>>> -- Batch: What changed between version 10 and 20?
>>>
>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20;
>>>
>>> -- Streaming: Continuously process changes
>>>
>>> CREATE STREAMING TABLE cdc_sink AS
>>>
>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0;
>>>
>>> DataFrame API:
>>>
>>> spark.read
>>>
>>>   .option("startingVersion", "10")
>>>
>>>   .option("endingVersion", "20")
>>>
>>>   .changes("my_table")
>>>
>>> 2. Engine-Level Post Processing Under the hood, this proposal
>>> introduces a minimal Changelog interface for DSv2 connectors. Spark's
>>> Catalyst optimizer will take over the CDC post-processing, including:
>>>
>>>    -
>>>
>>>    Filtering out copy-on-write carry-over rows.
>>>    -
>>>
>>>    Deriving pre-image/post-image updates from raw insert/delete pairs.
>>>    -
>>>
>>>    Computing net changes.
>>>
>>> This pushes complexity into the engine where it belongs, reducing
>>> duplicated effort across the ecosystem and ensuring consistent semantics
>>> for users.
>>>
>>> Please review the full SPIP for comprehensive design details, the
>>> proposed connector API, and deduplication semantics.
>>>
>>> Feedback and discussion are highly appreciated!
>>>
>>> Thanks,
>>>
>>> Gengliang
>>>
>>>

-- 
Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her

Reply via email to