Re: [DISCUSS] SPIP: Change Data Capture (CDC) Support

Gengliang Wang Fri, 27 Feb 2026 15:34:18 -0800

@Holden Karau <[email protected]> Thanks for taking a look! I have
actually synced with a few Delta Lake and Iceberg committers offline, and
they were comfortable with the proposed SQL syntax and API. Because this
introduces a new SQL syntax, it won't affect the functionality of the
existing connectors.


Many of the active Delta and Iceberg developers are also on this mailing
list, so I'm hoping we can gather most of the initial feedback right here
in this thread. However, if we need deeper connector-specific alignment as
the discussion evolves, I'm definitely open to cross-posting it to their
respective lists.

On Fri, Feb 27, 2026 at 2:29 PM Holden Karau <[email protected]> wrote:

> This looks cool overall, would it maybe make sense to share with the delta
> lake devs & iceberg devs for their input too? I have not had a chance to
> dig into this closely yet though.
>
> On Fri, Feb 27, 2026 at 1:39 PM Gengliang Wang <[email protected]> wrote:
>
>> Hi Spark devs,
>>
>> It looks like my original email might have landed in some spam folders,
>> so I am just bumping this thread for visibility.
>>
>> For quick reference, here are the links to the proposal again:
>>
>>    -
>>
>>    *SPIP Document:*
>>    
>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>>    -
>>
>>    *JIRA:* https://issues.apache.org/jira/browse/SPARK-55668
>>
>> Looking forward to your thoughts and feedback!
>>
>> Thanks,
>>
>> Gengliang
>>
>> On Fri, Feb 27, 2026 at 1:13 PM Szehon Ho <[email protected]>
>> wrote:
>>
>>> +1 （non binding)
>>>
>>> This is a great idea, look forward to a standard user experience for CDC
>>> for DSV2 data source, and centralizing the complicated share logic.
>>>
>>> Also this is somehow shown in my Spam folder :) , hope this brings it
>>> out.
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Tue, Feb 24, 2026 at 4:37 PM Gengliang Wang <[email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'd like to open a discussion on a new SPIP to introduce Change Data
>>>> Capture (CDC) support to Apache Spark, targeting the Spark 4.2 release.
>>>>
>>>>    -
>>>>
>>>>    SPIP Document: <https://docs.google.com/document/d/>
>>>>    
>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>>>>    -
>>>>
>>>>    JIRA:
>>>>    
>>>> <https://www.google.com/search?q=https://issues.apache.org/jira/browse/SPARK->
>>>>    https://issues.apache.org/jira/browse/SPARK-55668
>>>>
>>>> Motivation
>>>>
>>>> Currently, querying row-level changes (inserts, updates, deletes) from
>>>> a table requires connector-specific syntax. This fragmentation breaks query
>>>> portability across different storage formats and forces each connector to
>>>> reinvent complex post-processing logic:
>>>>
>>>>    -
>>>>
>>>>    Delta Lake: Uses table_changes()
>>>>    -
>>>>
>>>>    Iceberg: Uses .changes virtual tables
>>>>    -
>>>>
>>>>    Hudi: Relies on custom incremental read options
>>>>
>>>> There is no universal, engine-level standard in Spark to ask "show me
>>>> what changed."
>>>> Proposal
>>>>
>>>> This SPIP proposes a unified approach by adding a CHANGES SQL clause
>>>> and corresponding DataFrame/DataStream APIs that work across DSv2
>>>> connectors.
>>>>
>>>> 1. Standardized User API
>>>>
>>>> SQL:
>>>>
>>>> -- Batch: What changed between version 10 and 20?
>>>>
>>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20;
>>>>
>>>> -- Streaming: Continuously process changes
>>>>
>>>> CREATE STREAMING TABLE cdc_sink AS
>>>>
>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0;
>>>>
>>>> DataFrame API:
>>>>
>>>> spark.read
>>>>
>>>>   .option("startingVersion", "10")
>>>>
>>>>   .option("endingVersion", "20")
>>>>
>>>>   .changes("my_table")
>>>>
>>>> 2. Engine-Level Post Processing Under the hood, this proposal
>>>> introduces a minimal Changelog interface for DSv2 connectors. Spark's
>>>> Catalyst optimizer will take over the CDC post-processing, including:
>>>>
>>>>    -
>>>>
>>>>    Filtering out copy-on-write carry-over rows.
>>>>    -
>>>>
>>>>    Deriving pre-image/post-image updates from raw insert/delete pairs.
>>>>    -
>>>>
>>>>    Computing net changes.
>>>>
>>>> This pushes complexity into the engine where it belongs, reducing
>>>> duplicated effort across the ecosystem and ensuring consistent semantics
>>>> for users.
>>>>
>>>> Please review the full SPIP for comprehensive design details, the
>>>> proposed connector API, and deduplication semantics.
>>>>
>>>> Feedback and discussion are highly appreciated!
>>>>
>>>> Thanks,
>>>>
>>>> Gengliang
>>>>
>>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>

Re: [DISCUSS] SPIP: Change Data Capture (CDC) Support

Reply via email to