Re: [VOTE] SPIP: Change Data Capture (CDC) Support

Gengliang Wang Tue, 03 Mar 2026 22:37:05 -0800

Hi Viquar,

Thanks for the detailed review — all three concerns are already accounted
for in the current SPIP design (Appendix B.2 and B.6).1. Capability
Pushdown: The Changelog interface already exposes declarative capability
methods — containsCarryoverRows(), containsIntermediateChanges(), and
representsUpdateAsDeleteAndInsert(). The ResolveChangelogTable rule only
injects post-processing when the connector declares it is needed. If Delta
Lake already materializes pre/post-images natively, it returns
representsUpdateAsDeleteAndInsert() = false and Spark skips that work
entirely. Catalyst never reconstructs what the storage layer already
 provides.2. CoW I/O Bottlenecks: Carry-over removal is already gated on
containsCarryoverRows() = true. If a connector eliminates carry-over
rows at the
scan level, it returns false and Spark does nothing. The connector also
retains full control over scan planning via its ScanBuilder, so I/O
optimization stays in the storage layer.3. Audit Fidelity: The
deduplicationMode option already supports none, dropCarryovers, and
netChanges. Setting deduplicationMode = 'none' returns the raw, unmodified
change stream with every intermediate state preserved. Net change
collapsing happens
when explicitly requested by the user.


On Tue, Mar 3, 2026 at 10:27 PM Yuming Wang <[email protected]> wrote:

> +1, really looking forward to this feature.
>
> On Wed, Mar 4, 2026 at 1:57 PM vaquar khan <[email protected]> wrote:
>
>> Hi everyone,
>>
>> Sorry for the late response, I know the vote is actively underway, but
>> reviewing the SPIP's Catalyst post-processing mechanics raised a few
>> systemic design concerns we need to clarify to avoid severe performance
>> regressions down the line.
>>
>> 1. Capability Pushdown: The proposal has Catalyst deriving
>> pre/post-images from raw insert/delete pairs. Storage layers like Delta
>> Lake already materialize these natively. If the Changelog interface lacks
>> state pushdown, Catalyst will burn CPU and memory reconstructing what the
>> storage layer already solved.
>>
>> 2. CoW I/O Bottlenecks: Mandating Catalyst to filter "carry-over" rows
>> for CoW tables is highly problematic. Without strict connector-level row
>> lineage, we will be dragging massive, unmodified Parquet files across the
>> network, forcing Spark into heavy distributed joins just to discard
>> unchanged data.
>>
>> 3. Audit Fidelity: The design explicitly targets computing "net changes."
>> Collapsing intermediate states breaks enterprise audit and compliance
>> workflows that require full transactional history. The SQL grammar needs an
>> explicit ALL CHANGES execution path.
>>
>> I fully support unifying CDC  and this SIP is the right direction, but
>> abstracting it at the cost of storage-native optimizations and audit
>> fidelity is a dangerous trade-off. We need to clarify how physical planning
>> will handle these bottlenecks before formally ratifying the proposal.
>>
>> Regards,
>> Viquar Khan
>>
>> On Tue, 3 Mar 2026 at 20:09, Cheng Pan <[email protected]> wrote:
>>
>>> +1 (non-binding)
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>>
>>> On Mar 4, 2026, at 09:59, John Zhuge <[email protected]> wrote:
>>>
>>> +1 (non-binding)
>>>
>>> Thanks for the contribution!
>>>
>>>
>>> On Tue, Mar 3, 2026 at 5:50 PM Burak Yavuz <[email protected]> wrote:
>>>
>>>> +1!
>>>>
>>>> On Tue, Mar 3, 2026 at 5:48 PM Szehon Ho <[email protected]>
>>>> wrote:
>>>>
>>>>> +1, look forward to it (non binding)
>>>>>
>>>>> Thanks
>>>>> Szehon
>>>>>
>>>>> On Tue, Mar 3, 2026 at 5:37 PM Anton Okolnychyi <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> +1 (non-binding)
>>>>>>
>>>>>> On Tue, Mar 3, 2026 at 5:07 PM Mich Talebzadeh <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> Dr Mich Talebzadeh,
>>>>>>> Data Scientist | Distributed Systems (Spark) | Financial Forensics &
>>>>>>> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
>>>>>>> Analytics
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 4 Mar 2026 at 00:57, Gengliang Wang <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Spark devs,
>>>>>>>>
>>>>>>>> I'd like to call a vote on the SPIP*: Change Data Capture (CDC)
>>>>>>>> Support*
>>>>>>>>
>>>>>>>> *Summary:*
>>>>>>>>
>>>>>>>> This SPIP proposes a unified approach by adding a CHANGES SQL
>>>>>>>> clause and corresponding DataFrame/DataStream APIs that work across 
>>>>>>>> DSv2
>>>>>>>> connectors.
>>>>>>>>
>>>>>>>> 1. Standardized User API
>>>>>>>> SQL:
>>>>>>>>
>>>>>>>> -- Batch: What changed between version 10 and 20?
>>>>>>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20;
>>>>>>>>
>>>>>>>> -- Streaming: Continuously process changes
>>>>>>>> CREATE STREAMING TABLE cdc_sink AS
>>>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0;
>>>>>>>>
>>>>>>>> DataFrame API:
>>>>>>>> spark.read
>>>>>>>>   .option("startingVersion", "10")
>>>>>>>>   .option("endingVersion", "20")
>>>>>>>>   .changes("my_table")
>>>>>>>>
>>>>>>>> 2. Engine-Level Post Processing Under the hood, this proposal
>>>>>>>> introduces a minimal Changelog interface for DSv2 connectors.
>>>>>>>> Spark's Catalyst optimizer will take over the CDC post-processing,
>>>>>>>> including:
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    Filtering out copy-on-write carry-over rows.
>>>>>>>>    - Deriving pre-image/post-image updates from raw insert/delete
>>>>>>>>    pairs.
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    Computing net changes.
>>>>>>>>
>>>>>>>>
>>>>>>>> *Relevant Links:*
>>>>>>>>
>>>>>>>>    - *SPIP Doc: *
>>>>>>>>    
>>>>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>>>>>>>>    - *Discuss Thread: *
>>>>>>>>    https://lists.apache.org/thread/dhxx6pohs7fvqc3knzhtoj4tbcgrwxts
>>>>>>>>    - *JIRA: *https://issues.apache.org/jira/browse/SPARK-55668
>>>>>>>>
>>>>>>>>
>>>>>>>> *The vote will be open for at least 72 hours. *Please vote:
>>>>>>>>
>>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>>>
>>>>>>>> [ ] +0
>>>>>>>>
>>>>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Gengliang Wang
>>>>>>>>
>>>>>>>
>>>
>>> --
>>> John Zhuge
>>>
>>>
>>>

Re: [VOTE] SPIP: Change Data Capture (CDC) Support

Reply via email to