Re: [VOTE] SPIP: Change Data Capture (CDC) Support

Gengliang Wang Wed, 04 Mar 2026 12:03:26 -0800

Sure, I've updated the SPIP doc with the semantic guarantee note in
Appendix B.2 and expanded the deduplicationMode descriptions in A.1 to
clarify all three modes.
Regarding the warning for incomplete change logs — that is
connector-specific behavior, so it's best left to each connector's
implementation rather than prescribed at the Spark level.


On Wed, Mar 4, 2026 at 11:06 AM vaquar khan <[email protected]> wrote:

> Thanks Gengliang for the continued engagement, and thank you Anton for the
> important clarification on containsCarryoverRows().
>
> 1. Capability Naming — Documentation Clarification Ask
> I respect the DSv2 naming consistency. My ask is narrower: please add a
> note in Appendix B.2 explicitly stating that returning false carries the
> semantic guarantee that pre/post-images are fully materialized by the
> connector-not lazily computed at scan time. No new method needed, just
> documentation clarity to prevent incorrect connector implementations.
>
> 2. CoW I/O — Fully Withdrawn
> Anton's clarification changes my position here entirely. If TableCatalog
> loads Changelog with awareness of the specific range being scanned, and the
> connector can inspect the actual commit history for that range to determine
> whether CoW operations occurred, then containsCarryoverRows() is
> effectively a range-scoped, commit-aware signal -not a coarse table-level
> binary. That fully addresses my concern. I'm withdrawing Item 2 entirely,
> not just as a blocker.
>
> 3. Audit Discoverability — Revised Ask
> You're right that ALL CHANGES could mislead compliance engineers when
> change logs are partially vacuumed. I withdraw that request.
> My revised ask:
> - Clearly state in the SQL documentation that deduplicationMode='none' is
> the right way to get a full audit trail
> - Show a warning if a user queries a table where some of the old change
> logs have already been deleted
>
> With items 1 and 3 addressed in the SPIP text, count my +1.
>
> Regards,
> Viquar Khan
>
> On Wed, 4 Mar 2026 at 12:35, Anton Okolnychyi <[email protected]>
> wrote:
>
>> To add to Gengliang's point, TableCatalog would load Changelog
>> knowing the range that is being scanned. This allows the connector to
>> traverse the commit history and detect whether it had any CoW operation or
>> not. In other words, it is not a blind flag at the table level. It is
>> specific to the changelog range that is being requested.
>>
>> ср, 4 бер. 2026 р. о 09:17 Gengliang Wang <[email protected]> пише:
>>
>>> Thanks for the follow-up — appreciate the rigor.
>>>
>>> *1.* *Capability Naming*: The naming is intentional —
>>> representsUpdateAsDeleteAndInsert() mirrors the existing
>>> SupportsDelta.representUpdateAsDeleteAndInsert()
>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/SupportsDelta.java#L45>
>>> in the DSv2 API. When it returns false, it means the connector's change
>>> data already distinguishes updates from raw delete/insert pairs, so there
>>> is nothing for Catalyst to derive.
>>>
>>> *2.* *Partition-Level CoW Hints*: A table-level flag is sufficient for
>>> the common case. If a connector has partitions with mixed CoW behavior and
>>> needs finer-grained control, it can simply return containsCarryoverRows() =
>>> false and handle carry-over removal internally within its ScanBuilder — the
>>> interface already supports this. There is no need to complicate the
>>> Spark-level API for an edge case that connectors can solve themselves.
>>>
>>> *3. Audit Discoverability*: The SPIP proposes only two options in the
>>> WITH clause (deduplicationMode and computeUpdates) — this is a small,
>>> well-documented surface, not a hidden knob. Adding an ALL CHANGES grammar
>>> modifier introduces its own discoverability problem: it implies the table
>>> retains a complete history of all changes, which is not guaranteed — most
>>> formats discard old change data after vacuum/expiration. A SQL keyword that
>>> suggests completeness but silently returns partial results is arguably
>>> worse for compliance engineers than an explicit option with clear
>>> documentation.
>>>
>>>
>>>
>>> On Tue, Mar 3, 2026 at 11:17 PM vaquar khan <[email protected]>
>>> wrote:
>>>
>>>> Thanks  Gengliang   for the detailed follow-ans. While the mechanics
>>>> you laid out make sense on paper, looking at how this will actually play
>>>> out in production.
>>>>
>>>> 1. Capability Pushdown vs. Format Flag
>>>> Returning representsUpdateAsDeleteAndInsert() = false just signals that
>>>> the connector doesn't use raw delete/insert pairs. It doesn't explicitly
>>>> tell Catalyst, "I already computed the pre/post images natively, trust my
>>>> output and skip the window function entirely." Those are semantically
>>>> different. A dedicated supportsNativePrePostImages() capability method
>>>> would close this gap much more cleanly than overloading the format flag.
>>>>
>>>> 2. CoW I/O is a Table-Level Binary
>>>> The ScanBuilder delegation is a fair point, but containsCarryoverRows()
>>>> is still a table-level binary flag. For massive, partitioned CoW tables
>>>> that have carry-overs in some partitions but not others, this interface
>>>> forces Spark to apply carry-over removal globally or not at all. A
>>>> partition-level or scan-level hint is a necessary improvement for
>>>> mixed-mode CoW tables.
>>>>
>>>> 3. Audit Discoverability
>>>> I agree deduplicationMode='none' is functionally correct, but my
>>>> concern is discoverability. A compliance engineer or DBA writing SQL
>>>> shouldn't need institutional knowledge of a hidden WITH clause option
>>>> string to get audit-safe output. Having an explicit ALL CHANGES modifier in
>>>> the grammar is crucial for enterprise adoption and auditing.
>>>>
>>>> I am highly supportive of the core architecture, but these are real
>>>> production blockers for enterprise workloads. Let's get these clarified and
>>>> updated in the SPIP document, Items 1 and 3 are production blockers I'd
>>>> like addressed in the SPIP document. Item 2 is a real limitation but could
>>>> reasonably be tracked as a follow-on improvement. Happy to cast my +1 once
>>>> 1 and 3 are clarified.
>>>>
>>>> Regards,
>>>> Viquar Khan
>>>>
>>>> On Wed, 4 Mar 2026 at 00:37, Gengliang Wang <[email protected]> wrote:
>>>>
>>>>> Hi Viquar,
>>>>>
>>>>> Thanks for the detailed review — all three concerns are already accounted
>>>>> for in the current SPIP design (Appendix B.2 and B.6).1. Capability
>>>>> Pushdown: The Changelog interface already exposes declarative
>>>>> capability methods — containsCarryoverRows(), containsIntermediate
>>>>> Changes(), and representsUpdateAsDeleteAndInsert(). The
>>>>> ResolveChangelogTable rule only injects post-processing when the
>>>>> connector declares it is needed. If Delta Lake already materializes
>>>>> pre/post-images natively, it returns representsUpdateAsDeleteAndInsert()
>>>>> = false and Spark skips that work entirely. Catalyst never
>>>>> reconstructs what the storage layer already provides.2. CoW I/O
>>>>> Bottlenecks: Carry-over removal is already gated on 
>>>>> containsCarryoverRows()
>>>>> = true. If a connector eliminates carry-over rows at the scan level,
>>>>> it returns false and Spark does nothing. The connector also retains
>>>>> full control over scan planning via its ScanBuilder, so I/O
>>>>> optimization stays in the storage layer.3. Audit Fidelity: The
>>>>> deduplicationMode option already supports none, dropCarryovers, and
>>>>> netChanges. Setting deduplicationMode = 'none' returns the raw,
>>>>> unmodified change stream with every intermediate state preserved. Net
>>>>> change collapsing happens when explicitly requested by the user.
>>>>>
>>>>> On Tue, Mar 3, 2026 at 10:27 PM Yuming Wang <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> +1, really looking forward to this feature.
>>>>>>
>>>>>> On Wed, Mar 4, 2026 at 1:57 PM vaquar khan <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> Sorry for the late response, I know the vote is actively underway,
>>>>>>> but reviewing the SPIP's Catalyst post-processing mechanics raised a few
>>>>>>> systemic design concerns we need to clarify to avoid severe performance
>>>>>>> regressions down the line.
>>>>>>>
>>>>>>> 1. Capability Pushdown: The proposal has Catalyst deriving
>>>>>>> pre/post-images from raw insert/delete pairs. Storage layers like Delta
>>>>>>> Lake already materialize these natively. If the Changelog interface 
>>>>>>> lacks
>>>>>>> state pushdown, Catalyst will burn CPU and memory reconstructing what 
>>>>>>> the
>>>>>>> storage layer already solved.
>>>>>>>
>>>>>>> 2. CoW I/O Bottlenecks: Mandating Catalyst to filter "carry-over"
>>>>>>> rows for CoW tables is highly problematic. Without strict 
>>>>>>> connector-level
>>>>>>> row lineage, we will be dragging massive, unmodified Parquet files 
>>>>>>> across
>>>>>>> the network, forcing Spark into heavy distributed joins just to discard
>>>>>>> unchanged data.
>>>>>>>
>>>>>>> 3. Audit Fidelity: The design explicitly targets computing "net
>>>>>>> changes." Collapsing intermediate states breaks enterprise audit and
>>>>>>> compliance workflows that require full transactional history. The SQL
>>>>>>> grammar needs an explicit ALL CHANGES execution path.
>>>>>>>
>>>>>>> I fully support unifying CDC  and this SIP is the right direction,
>>>>>>> but abstracting it at the cost of storage-native optimizations and audit
>>>>>>> fidelity is a dangerous trade-off. We need to clarify how physical 
>>>>>>> planning
>>>>>>> will handle these bottlenecks before formally ratifying the proposal.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Viquar Khan
>>>>>>>
>>>>>>> On Tue, 3 Mar 2026 at 20:09, Cheng Pan <[email protected]> wrote:
>>>>>>>
>>>>>>>> +1 (non-binding)
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Cheng Pan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mar 4, 2026, at 09:59, John Zhuge <[email protected]> wrote:
>>>>>>>>
>>>>>>>> +1 (non-binding)
>>>>>>>>
>>>>>>>> Thanks for the contribution!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 3, 2026 at 5:50 PM Burak Yavuz <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1!
>>>>>>>>>
>>>>>>>>> On Tue, Mar 3, 2026 at 5:48 PM Szehon Ho <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1, look forward to it (non binding)
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Szehon
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 3, 2026 at 5:37 PM Anton Okolnychyi <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1 (non-binding)
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 3, 2026 at 5:07 PM Mich Talebzadeh <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> +1
>>>>>>>>>>>>
>>>>>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>>>>>> Data Scientist | Distributed Systems (Spark) | Financial
>>>>>>>>>>>> Forensics & Metadata Analytics | Transaction Reconstruction | 
>>>>>>>>>>>> Audit &
>>>>>>>>>>>> Evidence-Based Analytics
>>>>>>>>>>>>
>>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 4 Mar 2026 at 00:57, Gengliang Wang <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Spark devs,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd like to call a vote on the SPIP*: Change Data Capture
>>>>>>>>>>>>> (CDC) Support*
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Summary:*
>>>>>>>>>>>>>
>>>>>>>>>>>>> This SPIP proposes a unified approach by adding a CHANGES SQL
>>>>>>>>>>>>> clause and corresponding DataFrame/DataStream APIs that work 
>>>>>>>>>>>>> across DSv2
>>>>>>>>>>>>> connectors.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Standardized User API
>>>>>>>>>>>>> SQL:
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- Batch: What changed between version 10 and 20?
>>>>>>>>>>>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20;
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- Streaming: Continuously process changes
>>>>>>>>>>>>> CREATE STREAMING TABLE cdc_sink AS
>>>>>>>>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0;
>>>>>>>>>>>>>
>>>>>>>>>>>>> DataFrame API:
>>>>>>>>>>>>> spark.read
>>>>>>>>>>>>>   .option("startingVersion", "10")
>>>>>>>>>>>>>   .option("endingVersion", "20")
>>>>>>>>>>>>>   .changes("my_table")
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. Engine-Level Post Processing Under the hood, this proposal
>>>>>>>>>>>>> introduces a minimal Changelog interface for DSv2 connectors.
>>>>>>>>>>>>> Spark's Catalyst optimizer will take over the CDC post-processing,
>>>>>>>>>>>>> including:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>
>>>>>>>>>>>>>    Filtering out copy-on-write carry-over rows.
>>>>>>>>>>>>>    - Deriving pre-image/post-image updates from raw
>>>>>>>>>>>>>    insert/delete pairs.
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>
>>>>>>>>>>>>>    Computing net changes.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Relevant Links:*
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - *SPIP Doc: *
>>>>>>>>>>>>>    
>>>>>>>>>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>>>>>>>>>>>>>    - *Discuss Thread: *
>>>>>>>>>>>>>    
>>>>>>>>>>>>> https://lists.apache.org/thread/dhxx6pohs7fvqc3knzhtoj4tbcgrwxts
>>>>>>>>>>>>>    - *JIRA: *https://issues.apache.org/jira/browse/SPARK-55668
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *The vote will be open for at least 72 hours. *Please vote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>>>>>>>>
>>>>>>>>>>>>> [ ] +0
>>>>>>>>>>>>>
>>>>>>>>>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Gengliang Wang
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> John Zhuge
>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: [VOTE] SPIP: Change Data Capture (CDC) Support

Reply via email to