Re: [VOTE] SPIP: Change Data Capture (CDC) Support

vaquar khan Tue, 03 Mar 2026 23:17:44 -0800

Thanks  Gengliang   for the detailed follow-ans. While the mechanics you
laid out make sense on paper, looking at how this will actually play out in
production.


1. Capability Pushdown vs. Format Flag
Returning representsUpdateAsDeleteAndInsert() = false just signals that the
connector doesn't use raw delete/insert pairs. It doesn't explicitly tell
Catalyst, "I already computed the pre/post images natively, trust my output
and skip the window function entirely." Those are semantically different. A
dedicated supportsNativePrePostImages() capability method would close this
gap much more cleanly than overloading the format flag.

2. CoW I/O is a Table-Level Binary
The ScanBuilder delegation is a fair point, but containsCarryoverRows() is
still a table-level binary flag. For massive, partitioned CoW tables that
have carry-overs in some partitions but not others, this interface forces
Spark to apply carry-over removal globally or not at all. A partition-level
or scan-level hint is a necessary improvement for mixed-mode CoW tables.

3. Audit Discoverability
I agree deduplicationMode='none' is functionally correct, but my concern is
discoverability. A compliance engineer or DBA writing SQL shouldn't need
institutional knowledge of a hidden WITH clause option string to get
audit-safe output. Having an explicit ALL CHANGES modifier in the grammar
is crucial for enterprise adoption and auditing.

I am highly supportive of the core architecture, but these are real
production blockers for enterprise workloads. Let's get these clarified and
updated in the SPIP document, Items 1 and 3 are production blockers I'd
like addressed in the SPIP document. Item 2 is a real limitation but could
reasonably be tracked as a follow-on improvement. Happy to cast my +1 once
1 and 3 are clarified.

Regards,
Viquar Khan

On Wed, 4 Mar 2026 at 00:37, Gengliang Wang <[email protected]> wrote:

> Hi Viquar,
>
> Thanks for the detailed review — all three concerns are already accounted
> for in the current SPIP design (Appendix B.2 and B.6).1. Capability
> Pushdown: The Changelog interface already exposes declarative capability
> methods — containsCarryoverRows(), containsIntermediateChanges(), and
> representsUpdateAsDeleteAndInsert(). The ResolveChangelogTable rule only
> injects post-processing when the connector declares it is needed. If
> Delta Lake already materializes pre/post-images natively, it returns
> representsUpdateAsDeleteAndInsert() = false and Spark skips that work
> entirely. Catalyst never reconstructs what the storage layer already
>  provides.2. CoW I/O Bottlenecks: Carry-over removal is already gated on
> containsCarryoverRows() = true. If a connector eliminates carry-over rows
> at the scan level, it returns false and Spark does nothing. The connector
> also retains full control over scan planning via its ScanBuilder, so I/O
> optimization stays in the storage layer.3. Audit Fidelity: The
> deduplicationMode option already supports none, dropCarryovers, and
> netChanges. Setting deduplicationMode = 'none' returns the raw,
> unmodified change stream with every intermediate state preserved. Net
> change collapsing happens when explicitly requested by the user.
>
> On Tue, Mar 3, 2026 at 10:27 PM Yuming Wang <[email protected]> wrote:
>
>> +1, really looking forward to this feature.
>>
>> On Wed, Mar 4, 2026 at 1:57 PM vaquar khan <[email protected]> wrote:
>>
>>> Hi everyone,
>>>
>>> Sorry for the late response, I know the vote is actively underway, but
>>> reviewing the SPIP's Catalyst post-processing mechanics raised a few
>>> systemic design concerns we need to clarify to avoid severe performance
>>> regressions down the line.
>>>
>>> 1. Capability Pushdown: The proposal has Catalyst deriving
>>> pre/post-images from raw insert/delete pairs. Storage layers like Delta
>>> Lake already materialize these natively. If the Changelog interface lacks
>>> state pushdown, Catalyst will burn CPU and memory reconstructing what the
>>> storage layer already solved.
>>>
>>> 2. CoW I/O Bottlenecks: Mandating Catalyst to filter "carry-over" rows
>>> for CoW tables is highly problematic. Without strict connector-level row
>>> lineage, we will be dragging massive, unmodified Parquet files across the
>>> network, forcing Spark into heavy distributed joins just to discard
>>> unchanged data.
>>>
>>> 3. Audit Fidelity: The design explicitly targets computing "net
>>> changes." Collapsing intermediate states breaks enterprise audit and
>>> compliance workflows that require full transactional history. The SQL
>>> grammar needs an explicit ALL CHANGES execution path.
>>>
>>> I fully support unifying CDC  and this SIP is the right direction, but
>>> abstracting it at the cost of storage-native optimizations and audit
>>> fidelity is a dangerous trade-off. We need to clarify how physical planning
>>> will handle these bottlenecks before formally ratifying the proposal.
>>>
>>> Regards,
>>> Viquar Khan
>>>
>>> On Tue, 3 Mar 2026 at 20:09, Cheng Pan <[email protected]> wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> Thanks,
>>>> Cheng Pan
>>>>
>>>>
>>>>
>>>> On Mar 4, 2026, at 09:59, John Zhuge <[email protected]> wrote:
>>>>
>>>> +1 (non-binding)
>>>>
>>>> Thanks for the contribution!
>>>>
>>>>
>>>> On Tue, Mar 3, 2026 at 5:50 PM Burak Yavuz <[email protected]> wrote:
>>>>
>>>>> +1!
>>>>>
>>>>> On Tue, Mar 3, 2026 at 5:48 PM Szehon Ho <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> +1, look forward to it (non binding)
>>>>>>
>>>>>> Thanks
>>>>>> Szehon
>>>>>>
>>>>>> On Tue, Mar 3, 2026 at 5:37 PM Anton Okolnychyi <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> +1 (non-binding)
>>>>>>>
>>>>>>> On Tue, Mar 3, 2026 at 5:07 PM Mich Talebzadeh <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>> Data Scientist | Distributed Systems (Spark) | Financial Forensics
>>>>>>>> & Metadata Analytics | Transaction Reconstruction | Audit & 
>>>>>>>> Evidence-Based
>>>>>>>> Analytics
>>>>>>>>
>>>>>>>>    view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, 4 Mar 2026 at 00:57, Gengliang Wang <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Spark devs,
>>>>>>>>>
>>>>>>>>> I'd like to call a vote on the SPIP*: Change Data Capture (CDC)
>>>>>>>>> Support*
>>>>>>>>>
>>>>>>>>> *Summary:*
>>>>>>>>>
>>>>>>>>> This SPIP proposes a unified approach by adding a CHANGES SQL
>>>>>>>>> clause and corresponding DataFrame/DataStream APIs that work across 
>>>>>>>>> DSv2
>>>>>>>>> connectors.
>>>>>>>>>
>>>>>>>>> 1. Standardized User API
>>>>>>>>> SQL:
>>>>>>>>>
>>>>>>>>> -- Batch: What changed between version 10 and 20?
>>>>>>>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20;
>>>>>>>>>
>>>>>>>>> -- Streaming: Continuously process changes
>>>>>>>>> CREATE STREAMING TABLE cdc_sink AS
>>>>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0;
>>>>>>>>>
>>>>>>>>> DataFrame API:
>>>>>>>>> spark.read
>>>>>>>>>   .option("startingVersion", "10")
>>>>>>>>>   .option("endingVersion", "20")
>>>>>>>>>   .changes("my_table")
>>>>>>>>>
>>>>>>>>> 2. Engine-Level Post Processing Under the hood, this proposal
>>>>>>>>> introduces a minimal Changelog interface for DSv2 connectors.
>>>>>>>>> Spark's Catalyst optimizer will take over the CDC post-processing,
>>>>>>>>> including:
>>>>>>>>>
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    Filtering out copy-on-write carry-over rows.
>>>>>>>>>    - Deriving pre-image/post-image updates from raw insert/delete
>>>>>>>>>    pairs.
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    Computing net changes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Relevant Links:*
>>>>>>>>>
>>>>>>>>>    - *SPIP Doc: *
>>>>>>>>>    
>>>>>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>>>>>>>>>    - *Discuss Thread: *
>>>>>>>>>    https://lists.apache.org/thread/dhxx6pohs7fvqc3knzhtoj4tbcgrwxts
>>>>>>>>>    - *JIRA: *https://issues.apache.org/jira/browse/SPARK-55668
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *The vote will be open for at least 72 hours. *Please vote:
>>>>>>>>>
>>>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>>>>
>>>>>>>>> [ ] +0
>>>>>>>>>
>>>>>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Gengliang Wang
>>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>> John Zhuge
>>>>
>>>>
>>>>

Re: [VOTE] SPIP: Change Data Capture (CDC) Support

Reply via email to