Re: SPIP: Auto CDC support for Apache Spark

Mich Talebzadeh Wed, 01 Apr 2026 04:54:00 -0700

I would suggest framing Auto CDC not only around MERGE/mutation, but also
around* immutability and auditability.*


In many real-world cases, the ability to reconstruct what happened
(ordering of changes, intermediate states) is as important as computing the
latest state. From this perspective, Auto CDC should treat immutable
changelog data as a first-class abstraction, queryable via Spark SQL across
storage backends.
This enables:

   - state reconstruction (point-in-time views)
   - audit / forensic analysis
   - validation of upstream processes

MERGE then becomes just one materialisation strategy, rather than the core
abstraction.

>From a connector standpoint, exposing complete, ordered changelogs is more
fundamental than supporting MERGE alone.

HTH

Dr Mich Talebzadeh,
Data Scientist | Distributed Systems (Spark) | Financial Forensics &
Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
Analytics

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>






On Tue, 31 Mar 2026 at 18:09, Anton Okolnychyi <[email protected]>
wrote:

> I think this is going to be a great feature if it naturally extends the
> current capabilities of Spark and complements / builds on top of changelogs
> and MERGE.
>
>  I would be really worried otherwise.
>
> On Tue, Mar 31, 2026 at 6:02 AM Andreas Neumann <[email protected]> wrote:
>
>> Hi Mich,
>>
>> I agree that is, in theory, possible to implement this for connectors
>> that do not support MERGE. But it is our intention to rely on MERGE support
>> for this feature. If there is large interest for it, we could follow up
>> with an implementation that does not require MERGE, but I would want
>> to evaluate first whether the additional complexity is justified.
>>
>> Note also that we will not require changelogs. That is a requirement for
>> the source that produces the CDC feed, as addressed by this existing SPIP: 
>> SPIP:
>> Change Data Capture (CDC) Support
>> <https://docs.google.com/document/d/1-4rCS3vsGIyhwnkAwPsEaqyUDg-AuVkdrYLotFPw0U0/edit?usp=sharing>.
>> I actually expect that many use cases will come from sources that do not
>> support MERGE but do support change feeds.
>>
>> Cheers -Andreas
>>
>> On Mon, Mar 30, 2026 at 2:09 PM Mich Talebzadeh <
>> [email protected]> wrote:
>>
>>> Hello,
>>>
>>>  Auto CDC may compile to a MERGE-like row-level plan, but the SPIP
>>> should describe that as one implementation strategy, *not necessarily
>>> the only one*. Connectors do not just need changelogs and MERGE; they
>>> need changelog semantics on the read side and row-level mutation capability
>>> on the write side, plus keys and usually sequencing.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh,
>>> Data Scientist | Distributed Systems (Spark) | Financial Forensics &
>>> Metadata Analytics | Transaction Reconstruction | Audit & Evidence-Based
>>> Analytics
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>> On Mon, 30 Mar 2026 at 16:49, Anton Okolnychyi <[email protected]>
>>> wrote:
>>>
>>>> Will auto CDC compile into or use MERGE under the hood? If yes, can we
>>>> include a sketch of how that rewrite will look like in the SPIP? What
>>>> exactly do connectors need to support to benefit from auto CDC? Just
>>>> support changelogs and MERGE?
>>>>
>>>> - Anton
>>>>
>>>> пн, 30 бер. 2026 р. о 07:44 Andreas Neumann <[email protected]> пише:
>>>>
>>>>> Hi Vaquar,
>>>>>
>>>>> I responded to most of your comments on the document itself.
>>>>> Addtional comments inline
>>>>>
>>>>> Cheers -Andreas
>>>>>
>>>>> On Sat, Mar 28, 2026 at 4:42 PM vaquar khan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> HI .
>>>>>> Thanks for the SPIP. I fully support the goal-abstracting CDC merge
>>>>>> logic is a huge win for the community. However, looking at the current
>>>>>> Spark versions, there are significant architectural gaps between 
>>>>>> Databricks
>>>>>> Lakeflow's proprietary implementation and OSS Spark.
>>>>>>
>>>>>> A few technical blockers need clarification before we move forward:
>>>>>>
>>>>>> - OSS Compatibility: Databricks documentation explicitly states that
>>>>>> the AUTO CDC APIs are not supported by Apache Spark Declarative
>>>>>> Pipelines <https://docs.databricks.com/gcp/en/ldp/cdc>.
>>>>>>
>>>>> That will change with the implementation of this SPIP.
>>>>>
>>>>>
>>>>>> - Streaming MERGE: The proposed flow requires continuous
>>>>>> upsert/delete semantics, but Dataset.mergeInto() currently does not 
>>>>>> support
>>>>>> streaming queries. Does this SPIP introduce an entirely new execution 
>>>>>> path
>>>>>> to bypass this restriction
>>>>>>
>>>>> This works with foreachBatch.
>>>>>
>>>>>
>>>>>> - Tombstone Garbage Collection: Handling stream deletes safely
>>>>>> requires state store tombstone retention (e.g., configuring
>>>>>> pipelines.cdc.tombstoneGCThresholdInSeconds) to prevent late-arriving 
>>>>>> data
>>>>>> from resurrecting deleted keys. How will this be implemented natively in
>>>>>> OSS Spark state stores?
>>>>>>
>>>>> That's an interesting question. Tombstones could be modeled in the
>>>>> state store, but we are thinking that they will be modeled as an explicit
>>>>> output of the flow, either as records in the output table with a "deleted
>>>>> at" marker, possibly with a view on top to project away these rows; or in 
>>>>> a
>>>>> separate output that contains only the tombstones. The exact design is not
>>>>> finalized, that is part of the first phase of the project.
>>>>>
>>>>> - Sequencing Constraints: SEQUENCE BY enforces strict ordering where
>>>>>> NULL sequencing values are explicitly not supported. How will the engine
>>>>>> handle malformed or non-monotonic upstream sequences compared to our
>>>>>> existing time-based watermarks?
>>>>>>
>>>>> I think malformed change events should, at least in the first
>>>>> iteration, fail the stream. Otherwise there is a risk of writing incorrect
>>>>> data.
>>>>>
>>>>>
>>>>>>
>>>>>> - Given the massive surface area (new SQL DDL, streaming MERGE paths,
>>>>>> SCD Type 1/2 state logic, tombstone GC), a phased delivery plan would be
>>>>>> very helpful. It would also clarify exactly which Lakeflow components are
>>>>>> being contributed to open-source versus what needs to be rebuilt from
>>>>>> scratch.
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Viquar Khan
>>>>>>
>>>>>> On Sat, 28 Mar 2026 at 08:35, 陈 小健 <[email protected]> wrote:
>>>>>>
>>>>>>> unsubscribe
>>>>>>>
>>>>>>> 获取Outlook for Android <https://aka.ms/AAb9ysg>
>>>>>>> ------------------------------
>>>>>>> *From:* Andreas Neumann <[email protected]>
>>>>>>> *Sent:* Saturday, March 28, 2026 2:43:54 AM
>>>>>>> *To:* [email protected] <[email protected]>
>>>>>>> *Subject:* Re: SPIP: Auto CDC support for Apache Spark
>>>>>>>
>>>>>>> Hi Vaibhav,
>>>>>>>
>>>>>>> The goal of this proposal is not to replace MERGE but to provide a
>>>>>>> simple abstraction for the common use case of CDC.
>>>>>>> MERGE itself is a very powerful operator and there will always be
>>>>>>> use cases outside of CDC that will require MERGE.
>>>>>>>
>>>>>>> And thanks for spotting the typo in the SPIP. It is fixed now!
>>>>>>>
>>>>>>> Cheers -Andreas
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 27, 2026 at 10:53 AM Vaibhav Kumar <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>> Thanks for sharing the SPIP, Does that mean the MERGE statement
>>>>>>> would be deprecated? Also I think there was a small typo I have 
>>>>>>> suggested
>>>>>>> in the doc.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Vaibhav
>>>>>>>
>>>>>>> On Fri, Mar 27, 2026 at 10:15 AM DB Tsai <[email protected]> wrote:
>>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>>>>>
>>>>>>> On Mar 26, 2026, at 6:08 PM, Andreas Neumann <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I’d like to start a discussion on a new SPIP to introduce Auto CDC
>>>>>>> support to Apache Spark.
>>>>>>>
>>>>>>>    - SPIP Document:
>>>>>>>    
>>>>>>> https://docs.google.com/document/d/1Hp5BGEYJRHbk6J7XUph3bAPZKRQXKOuV1PEaqZMMRoQ/
>>>>>>>    -
>>>>>>>
>>>>>>>    JIRA: <https://issues.apache.org/jira/browse/SPARK-55668>
>>>>>>>    https://issues.apache.org/jira/browse/SPARK-5566
>>>>>>>
>>>>>>> Motivation
>>>>>>>
>>>>>>> With the upcoming introduction of standardized CDC support
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-55668>, Spark will
>>>>>>> soon have a unified way to produce change data feeds. However,
>>>>>>> consuming these feeds and applying them to a target table remains a
>>>>>>> significant challenge.
>>>>>>>
>>>>>>> Common patterns like SCD Type 1 (maintaining a 1:1 replica) and SCD
>>>>>>> Type 2 (tracking full change history) often require hand-crafted,
>>>>>>> complex MERGE logic. In distributed systems, these implementations
>>>>>>> are frequently error-prone when handling deletions or out-of-order data.
>>>>>>> Proposal
>>>>>>>
>>>>>>> This SPIP proposes a new "Auto CDC" flow type for Spark. It
>>>>>>> encapsulates the complex logic for SCD types and out-of-order data,
>>>>>>> allowing data engineers to configure a declarative flow instead of 
>>>>>>> writing
>>>>>>> manual MERGE statements. This feature will be available in both Python
>>>>>>> and SQL.
>>>>>>> Example SQL:
>>>>>>> -- Produce a change feed
>>>>>>> CREATE STREAMING TABLE cdc.users AS
>>>>>>> SELECT * FROM STREAM my_table CHANGES FROM VERSION 10;
>>>>>>>
>>>>>>> -- Consume the change feed
>>>>>>> CREATE FLOW flow
>>>>>>> AS AUTO CDC INTO
>>>>>>>   target
>>>>>>> FROM stream(cdc_data.users)
>>>>>>>   KEYS (userId)
>>>>>>>   APPLY AS DELETE WHEN operation = "DELETE"
>>>>>>>   SEQUENCE BY sequenceNum
>>>>>>>   COLUMNS * EXCEPT (operation, sequenceNum)
>>>>>>>   STORED AS SCD TYPE 2
>>>>>>>   TRACK HISTORY ON * EXCEPT (city);
>>>>>>>
>>>>>>>
>>>>>>> Please review the full SPIP for the technical details. Looking
>>>>>>> forward to your feedback and discussion!
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Andreas
>>>>>>>
>>>>>>>
>>>>>>>

Re: SPIP: Auto CDC support for Apache Spark

Reply via email to