Hi everyone,

Sorry for the late response, I know the vote is actively underway, but
reviewing the SPIP's Catalyst post-processing mechanics raised a few
systemic design concerns we need to clarify to avoid severe performance
regressions down the line.

1. Capability Pushdown: The proposal has Catalyst deriving pre/post-images
from raw insert/delete pairs. Storage layers like Delta Lake already
materialize these natively. If the Changelog interface lacks state
pushdown, Catalyst will burn CPU and memory reconstructing what the storage
layer already solved.

2. CoW I/O Bottlenecks: Mandating Catalyst to filter "carry-over" rows for
CoW tables is highly problematic. Without strict connector-level row
lineage, we will be dragging massive, unmodified Parquet files across the
network, forcing Spark into heavy distributed joins just to discard
unchanged data.

3. Audit Fidelity: The design explicitly targets computing "net changes."
Collapsing intermediate states breaks enterprise audit and compliance
workflows that require full transactional history. The SQL grammar needs an
explicit ALL CHANGES execution path.

I fully support unifying CDC  and this SIP is the right direction, but
abstracting it at the cost of storage-native optimizations and audit
fidelity is a dangerous trade-off. We need to clarify how physical planning
will handle these bottlenecks before formally ratifying the proposal.

Regards,
Viquar Khan

On Tue, 3 Mar 2026 at 20:09, Cheng Pan <[email protected]> wrote:

> +1 (non-binding)
>
> Thanks,
> Cheng Pan
>
>
>
> On Mar 4, 2026, at 09:59, John Zhuge <[email protected]> wrote:
>
> +1 (non-binding)
>
> Thanks for the contribution!
>
>
> On Tue, Mar 3, 2026 at 5:50 PM Burak Yavuz <[email protected]> wrote:
>
>> +1!
>>
>> On Tue, Mar 3, 2026 at 5:48 PM Szehon Ho <[email protected]> wrote:
>>
>>> +1, look forward to it (non binding)
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Tue, Mar 3, 2026 at 5:37 PM Anton Okolnychyi <[email protected]>
>>> wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> On Tue, Mar 3, 2026 at 5:07 PM Mich Talebzadeh <
>>>> [email protected]> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> Dr Mich Talebzadeh,
>>>>> Data Scientist | Distributed Systems (Spark) | Financial Forensics &
>>>>> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
>>>>> Analytics
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 4 Mar 2026 at 00:57, Gengliang Wang <[email protected]> wrote:
>>>>>
>>>>>> Hi Spark devs,
>>>>>>
>>>>>> I'd like to call a vote on the SPIP*: Change Data Capture (CDC)
>>>>>> Support*
>>>>>>
>>>>>> *Summary:*
>>>>>>
>>>>>> This SPIP proposes a unified approach by adding a CHANGES SQL clause
>>>>>> and corresponding DataFrame/DataStream APIs that work across DSv2
>>>>>> connectors.
>>>>>>
>>>>>> 1. Standardized User API
>>>>>> SQL:
>>>>>>
>>>>>> -- Batch: What changed between version 10 and 20?
>>>>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20;
>>>>>>
>>>>>> -- Streaming: Continuously process changes
>>>>>> CREATE STREAMING TABLE cdc_sink AS
>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0;
>>>>>>
>>>>>> DataFrame API:
>>>>>> spark.read
>>>>>>   .option("startingVersion", "10")
>>>>>>   .option("endingVersion", "20")
>>>>>>   .changes("my_table")
>>>>>>
>>>>>> 2. Engine-Level Post Processing Under the hood, this proposal
>>>>>> introduces a minimal Changelog interface for DSv2 connectors.
>>>>>> Spark's Catalyst optimizer will take over the CDC post-processing,
>>>>>> including:
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    Filtering out copy-on-write carry-over rows.
>>>>>>    - Deriving pre-image/post-image updates from raw insert/delete
>>>>>>    pairs.
>>>>>>    -
>>>>>>
>>>>>>    Computing net changes.
>>>>>>
>>>>>>
>>>>>> *Relevant Links:*
>>>>>>
>>>>>>    - *SPIP Doc: *
>>>>>>    
>>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>>>>>>    - *Discuss Thread: *
>>>>>>    https://lists.apache.org/thread/dhxx6pohs7fvqc3knzhtoj4tbcgrwxts
>>>>>>    - *JIRA: *https://issues.apache.org/jira/browse/SPARK-55668
>>>>>>
>>>>>>
>>>>>> *The vote will be open for at least 72 hours. *Please vote:
>>>>>>
>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>
>>>>>> [ ] +0
>>>>>>
>>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>>>
>>>>>> Thanks,
>>>>>> Gengliang Wang
>>>>>>
>>>>>
>
> --
> John Zhuge
>
>
>

Reply via email to