Re: [VOTE] SPIP: Change Data Capture (CDC) Support

huaxin gao Wed, 04 Mar 2026 14:18:23 -0800

+1

On Wed, Mar 4, 2026 at 12:47 PM vaquar khan <[email protected]> wrote:


> Thanks for updating the SPIP (B.2 and A.1).
>
> +1
>
> Regards,
>
> Viquar Khan
>
>
> On Wed, 4 Mar 2026 at 14:03, Gengliang Wang <[email protected]> wrote:
>
>> Sure, I've updated the SPIP doc with the semantic guarantee note in
>> Appendix B.2 and expanded the deduplicationMode descriptions in A.1 to
>> clarify all three modes.
>> Regarding the warning for incomplete change logs — that is
>> connector-specific behavior, so it's best left to each connector's
>> implementation rather than prescribed at the Spark level.
>>
>> On Wed, Mar 4, 2026 at 11:06 AM vaquar khan <[email protected]>
>> wrote:
>>
>>> Thanks Gengliang for the continued engagement, and thank you Anton for
>>> the important clarification on containsCarryoverRows().
>>>
>>> 1. Capability Naming — Documentation Clarification Ask
>>> I respect the DSv2 naming consistency. My ask is narrower: please add a
>>> note in Appendix B.2 explicitly stating that returning false carries the
>>> semantic guarantee that pre/post-images are fully materialized by the
>>> connector-not lazily computed at scan time. No new method needed, just
>>> documentation clarity to prevent incorrect connector implementations.
>>>
>>> 2. CoW I/O — Fully Withdrawn
>>> Anton's clarification changes my position here entirely. If TableCatalog
>>> loads Changelog with awareness of the specific range being scanned, and the
>>> connector can inspect the actual commit history for that range to determine
>>> whether CoW operations occurred, then containsCarryoverRows() is
>>> effectively a range-scoped, commit-aware signal -not a coarse table-level
>>> binary. That fully addresses my concern. I'm withdrawing Item 2 entirely,
>>> not just as a blocker.
>>>
>>> 3. Audit Discoverability — Revised Ask
>>> You're right that ALL CHANGES could mislead compliance engineers when
>>> change logs are partially vacuumed. I withdraw that request.
>>> My revised ask:
>>> - Clearly state in the SQL documentation that deduplicationMode='none'
>>> is the right way to get a full audit trail
>>> - Show a warning if a user queries a table where some of the old change
>>> logs have already been deleted
>>>
>>> With items 1 and 3 addressed in the SPIP text, count my +1.
>>>
>>> Regards,
>>> Viquar Khan
>>>
>>> On Wed, 4 Mar 2026 at 12:35, Anton Okolnychyi <[email protected]>
>>> wrote:
>>>
>>>> To add to Gengliang's point, TableCatalog would load Changelog
>>>> knowing the range that is being scanned. This allows the connector to
>>>> traverse the commit history and detect whether it had any CoW operation or
>>>> not. In other words, it is not a blind flag at the table level. It is
>>>> specific to the changelog range that is being requested.
>>>>
>>>> ср, 4 бер. 2026 р. о 09:17 Gengliang Wang <[email protected]> пише:
>>>>
>>>>> Thanks for the follow-up — appreciate the rigor.
>>>>>
>>>>> *1.* *Capability Naming*: The naming is intentional —
>>>>> representsUpdateAsDeleteAndInsert() mirrors the existing
>>>>> SupportsDelta.representUpdateAsDeleteAndInsert()
>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/SupportsDelta.java#L45>
>>>>> in the DSv2 API. When it returns false, it means the connector's change
>>>>> data already distinguishes updates from raw delete/insert pairs, so there
>>>>> is nothing for Catalyst to derive.
>>>>>
>>>>> *2.* *Partition-Level CoW Hints*: A table-level flag is sufficient
>>>>> for the common case. If a connector has partitions with mixed CoW behavior
>>>>> and needs finer-grained control, it can simply return
>>>>> containsCarryoverRows() = false and handle carry-over removal internally
>>>>> within its ScanBuilder — the interface already supports this. There is no
>>>>> need to complicate the Spark-level API for an edge case that connectors 
>>>>> can
>>>>> solve themselves.
>>>>>
>>>>> *3. Audit Discoverability*: The SPIP proposes only two options in the
>>>>> WITH clause (deduplicationMode and computeUpdates) — this is a small,
>>>>> well-documented surface, not a hidden knob. Adding an ALL CHANGES grammar
>>>>> modifier introduces its own discoverability problem: it implies the table
>>>>> retains a complete history of all changes, which is not guaranteed — most
>>>>> formats discard old change data after vacuum/expiration. A SQL keyword 
>>>>> that
>>>>> suggests completeness but silently returns partial results is arguably
>>>>> worse for compliance engineers than an explicit option with clear
>>>>> documentation.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 3, 2026 at 11:17 PM vaquar khan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks  Gengliang   for the detailed follow-ans. While the mechanics
>>>>>> you laid out make sense on paper, looking at how this will actually play
>>>>>> out in production.
>>>>>>
>>>>>> 1. Capability Pushdown vs. Format Flag
>>>>>> Returning representsUpdateAsDeleteAndInsert() = false just signals
>>>>>> that the connector doesn't use raw delete/insert pairs. It doesn't
>>>>>> explicitly tell Catalyst, "I already computed the pre/post images 
>>>>>> natively,
>>>>>> trust my output and skip the window function entirely." Those are
>>>>>> semantically different. A dedicated supportsNativePrePostImages()
>>>>>> capability method would close this gap much more cleanly than overloading
>>>>>> the format flag.
>>>>>>
>>>>>> 2. CoW I/O is a Table-Level Binary
>>>>>> The ScanBuilder delegation is a fair point, but
>>>>>> containsCarryoverRows() is still a table-level binary flag. For massive,
>>>>>> partitioned CoW tables that have carry-overs in some partitions but not
>>>>>> others, this interface forces Spark to apply carry-over removal globally 
>>>>>> or
>>>>>> not at all. A partition-level or scan-level hint is a necessary 
>>>>>> improvement
>>>>>> for mixed-mode CoW tables.
>>>>>>
>>>>>> 3. Audit Discoverability
>>>>>> I agree deduplicationMode='none' is functionally correct, but my
>>>>>> concern is discoverability. A compliance engineer or DBA writing SQL
>>>>>> shouldn't need institutional knowledge of a hidden WITH clause option
>>>>>> string to get audit-safe output. Having an explicit ALL CHANGES modifier 
>>>>>> in
>>>>>> the grammar is crucial for enterprise adoption and auditing.
>>>>>>
>>>>>> I am highly supportive of the core architecture, but these are real
>>>>>> production blockers for enterprise workloads. Let's get these clarified 
>>>>>> and
>>>>>> updated in the SPIP document, Items 1 and 3 are production blockers I'd
>>>>>> like addressed in the SPIP document. Item 2 is a real limitation but 
>>>>>> could
>>>>>> reasonably be tracked as a follow-on improvement. Happy to cast my +1 
>>>>>> once
>>>>>> 1 and 3 are clarified.
>>>>>>
>>>>>> Regards,
>>>>>> Viquar Khan
>>>>>>
>>>>>> On Wed, 4 Mar 2026 at 00:37, Gengliang Wang <[email protected]> wrote:
>>>>>>
>>>>>>> Hi Viquar,
>>>>>>>
>>>>>>> Thanks for the detailed review — all three concerns are already 
>>>>>>> accounted
>>>>>>> for in the current SPIP design (Appendix B.2 and B.6).1. Capability
>>>>>>> Pushdown: The Changelog interface already exposes declarative
>>>>>>> capability methods — containsCarryoverRows(), containsIntermediate
>>>>>>> Changes(), and representsUpdateAsDeleteAndInsert(). The
>>>>>>> ResolveChangelogTable rule only injects post-processing when the
>>>>>>> connector declares it is needed. If Delta Lake already materializes
>>>>>>> pre/post-images natively, it returns representsUpdateAsDeleteAndInsert()
>>>>>>> = false and Spark skips that work entirely. Catalyst never
>>>>>>> reconstructs what the storage layer already provides.2. CoW I/O
>>>>>>> Bottlenecks: Carry-over removal is already gated on 
>>>>>>> containsCarryoverRows()
>>>>>>> = true. If a connector eliminates carry-over rows at the scan
>>>>>>> level, it returns false and Spark does nothing. The connector also
>>>>>>> retains full control over scan planning via its ScanBuilder, so I/O
>>>>>>> optimization stays in the storage layer.3. Audit Fidelity: The
>>>>>>> deduplicationMode option already supports none, dropCarryovers, and
>>>>>>> netChanges. Setting deduplicationMode = 'none' returns the raw,
>>>>>>> unmodified change stream with every intermediate state preserved.
>>>>>>> Net change collapsing happens when explicitly requested by the user.
>>>>>>>
>>>>>>> On Tue, Mar 3, 2026 at 10:27 PM Yuming Wang <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1, really looking forward to this feature.
>>>>>>>>
>>>>>>>> On Wed, Mar 4, 2026 at 1:57 PM vaquar khan <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> Sorry for the late response, I know the vote is actively underway,
>>>>>>>>> but reviewing the SPIP's Catalyst post-processing mechanics raised a 
>>>>>>>>> few
>>>>>>>>> systemic design concerns we need to clarify to avoid severe 
>>>>>>>>> performance
>>>>>>>>> regressions down the line.
>>>>>>>>>
>>>>>>>>> 1. Capability Pushdown: The proposal has Catalyst deriving
>>>>>>>>> pre/post-images from raw insert/delete pairs. Storage layers like 
>>>>>>>>> Delta
>>>>>>>>> Lake already materialize these natively. If the Changelog interface 
>>>>>>>>> lacks
>>>>>>>>> state pushdown, Catalyst will burn CPU and memory reconstructing what 
>>>>>>>>> the
>>>>>>>>> storage layer already solved.
>>>>>>>>>
>>>>>>>>> 2. CoW I/O Bottlenecks: Mandating Catalyst to filter "carry-over"
>>>>>>>>> rows for CoW tables is highly problematic. Without strict 
>>>>>>>>> connector-level
>>>>>>>>> row lineage, we will be dragging massive, unmodified Parquet files 
>>>>>>>>> across
>>>>>>>>> the network, forcing Spark into heavy distributed joins just to 
>>>>>>>>> discard
>>>>>>>>> unchanged data.
>>>>>>>>>
>>>>>>>>> 3. Audit Fidelity: The design explicitly targets computing "net
>>>>>>>>> changes." Collapsing intermediate states breaks enterprise audit and
>>>>>>>>> compliance workflows that require full transactional history. The SQL
>>>>>>>>> grammar needs an explicit ALL CHANGES execution path.
>>>>>>>>>
>>>>>>>>> I fully support unifying CDC  and this SIP is the right direction,
>>>>>>>>> but abstracting it at the cost of storage-native optimizations and 
>>>>>>>>> audit
>>>>>>>>> fidelity is a dangerous trade-off. We need to clarify how physical 
>>>>>>>>> planning
>>>>>>>>> will handle these bottlenecks before formally ratifying the proposal.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Viquar Khan
>>>>>>>>>
>>>>>>>>> On Tue, 3 Mar 2026 at 20:09, Cheng Pan <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> +1 (non-binding)
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Cheng Pan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mar 4, 2026, at 09:59, John Zhuge <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> +1 (non-binding)
>>>>>>>>>>
>>>>>>>>>> Thanks for the contribution!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 3, 2026 at 5:50 PM Burak Yavuz <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1!
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 3, 2026 at 5:48 PM Szehon Ho <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> +1, look forward to it (non binding)
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Szehon
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 3, 2026 at 5:37 PM Anton Okolnychyi <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> +1 (non-binding)
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Mar 3, 2026 at 5:07 PM Mich Talebzadeh <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>>>>>>>> Data Scientist | Distributed Systems (Spark) | Financial
>>>>>>>>>>>>>> Forensics & Metadata Analytics | Transaction Reconstruction | 
>>>>>>>>>>>>>> Audit &
>>>>>>>>>>>>>> Evidence-Based Analytics
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, 4 Mar 2026 at 00:57, Gengliang Wang <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Spark devs,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'd like to call a vote on the SPIP*: Change Data Capture
>>>>>>>>>>>>>>> (CDC) Support*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Summary:*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This SPIP proposes a unified approach by adding a CHANGES SQL
>>>>>>>>>>>>>>> clause and corresponding DataFrame/DataStream APIs that work 
>>>>>>>>>>>>>>> across DSv2
>>>>>>>>>>>>>>> connectors.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. Standardized User API
>>>>>>>>>>>>>>> SQL:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -- Batch: What changed between version 10 and 20?
>>>>>>>>>>>>>>> SELECT * FROM my_table CHANGES FROM VERSION 10 TO VERSION 20
>>>>>>>>>>>>>>> ;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -- Streaming: Continuously process changes
>>>>>>>>>>>>>>> CREATE STREAMING TABLE cdc_sink AS
>>>>>>>>>>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 0;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> DataFrame API:
>>>>>>>>>>>>>>> spark.read
>>>>>>>>>>>>>>>   .option("startingVersion", "10")
>>>>>>>>>>>>>>>   .option("endingVersion", "20")
>>>>>>>>>>>>>>>   .changes("my_table")
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2. Engine-Level Post Processing Under the hood, this
>>>>>>>>>>>>>>> proposal introduces a minimal Changelog interface for DSv2
>>>>>>>>>>>>>>> connectors. Spark's Catalyst optimizer will take over the CDC
>>>>>>>>>>>>>>> post-processing, including:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    Filtering out copy-on-write carry-over rows.
>>>>>>>>>>>>>>>    - Deriving pre-image/post-image updates from raw
>>>>>>>>>>>>>>>    insert/delete pairs.
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    Computing net changes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Relevant Links:*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - *SPIP Doc: *
>>>>>>>>>>>>>>>    
>>>>>>>>>>>>>>> https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing
>>>>>>>>>>>>>>>    - *Discuss Thread: *
>>>>>>>>>>>>>>>    
>>>>>>>>>>>>>>> https://lists.apache.org/thread/dhxx6pohs7fvqc3knzhtoj4tbcgrwxts
>>>>>>>>>>>>>>>    - *JIRA: *
>>>>>>>>>>>>>>>    https://issues.apache.org/jira/browse/SPARK-55668
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *The vote will be open for at least 72 hours. *Please vote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [ ] +0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Gengliang Wang
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> John Zhuge
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>

Re: [VOTE] SPIP: Change Data Capture (CDC) Support

Reply via email to